Mastering Differences and Pitfalls when Switching SQL Databases: PostgreSQL vs. MySQL vs. SQLITE vs. Hive vs. Presto (AWS Athena)

9 December 2022 in SQL, Database

Transitioning to another SQL database? This blog post is for you. Shifting from one SQL dialect to another can be a journey full of surprises. While the basic syntax (SELECT FROM WHERE) is similar, there are important differences, that will make your queries slow, fast, fail or worse: fail silently!

In this blog post I’ll guide you through the intricate pathways of databases I have come across during my work as a data scientist: Postgres, MySQL, SQLite, Hive and Presto (AWS Athena). We’ll start with a brief introduction into the databases and some differences. Then we jump into three pitfalls you have to be aware of.

Biases in learning to rank models and three approaches to deal with them

29 April 2021 in machine learning

Search engines rely on models, which rank the matching results for a given user query. These models optimize the order of items. They learn how to rank items in a result list, therefore the name Learning-to-Rank (LTR) models.

Avro and avro schemas - how they work and why they are useful

23 February 2021 in data engineering

You have kafka as your message broker up and running and you may wonder: In which format should I send my data around? Maybe the string format pops up in your mind. Why not just put all fields into a long string and separate them with a comma?

A Gentle Intro to the Basic Architecture of Message Brokers: RabbitMQ vs. Kafka

26 January 2021 in kafka, amqp, rabbitmq, message brokers

In this blogpost you will get a basic understanding about message brokers. We will look at two very popular message brokers, Kafka and RabbitMQ, and learn, how they handle messages.

Intro into APIs and how to access public REST APIs with `curl`

26 November 2020 in api, rest

This post will teach you the inution of REST APIs and how you can use them to get interesting datasets for your data projects. First, we will look at the four components of a request. In the second part of this blogpost, we will go through one example and access the coingecko API via curl.

Pointwise, Pairswise and Listwise Learning to Rank Models - Three Approaches to Optimize Relative Ordering

15 October 2020 in machine learning

In many scenarios, such as a google search or a product recommendation in an online shop, we have tons of data and limited space to display it. We cannot show all the products of an online shop to the user as a possible next best offer. Neither would a user want to scroll through all the pages indexed by a search engine to find the most relevant page that matches his search keywords. The most relevant content should be on top. Learning to rank (LTR) models are supervised machine learning models that attempt to optimize the order of items. So compared to classification or regression models, they do not care about exact scores or predictions, but the relative order. LTR models are typically applied in search engines, but gained popularity in other fields such as product recommendations as well.

AI-Machine-Learning-Buzzword-Bingo

10 September 2020 in machine learning, ai

I was recently invited to join a panel discussion among developers to dispel the myth of the typical BS Buzzword Bingo around machine learning and AI. In this blog post, I will share some buzzwords we talked about with a little description and links. Ooops, I already used some buzzwords. So let’s start. AI (Artificial Intelligence) is the magic portion to fix all problems of all companies and will make us unemployed in the future.

The Intuition of Word Embeddings: How you Teach A Computer to Understand Text

31 August 2020 in Machine Learning, NLP

Humans intuitively understand the meaning of words: Which words are similar, opposites or related to each other? But our machine learning models do not have this intuition. Word embeddings are numeric vectors that represent text. These vectors are learned through neural networks. The objective when creating these embedding vectors is to capture as much “meaning” as possible: Related words should be closer together than unrelated words. Also, they should be able to preserve mathematical relationships between words such as

Jupyter Notebooks: Boost your productivity with Extensions and Magic Commands

12 July 2020 in Python, Jupyter Notebooks

In this blogpost I will share some tips for working with Jupyter Notebooks. Those tips greatly improved my productivity when working with Jupyter Notebooks and I wish someone would have told me earlier. The two main topics of this post are extensions and magic commands. Jupyter Extensions Have you ever missed a feature in your Jupyter Notebook that IDEs have? E.g. you were hoping for autocompletion or automatically formatting code? Then there might be a Jupyter Notebook extension for you.

Mastering ElasticSearch Queries If You Have Only Worked With SQL Before

27 June 2020

Elasticsearch is often the storage engine of choice for storing and querying full text data. But writing an ElasticSearch query is pretty different compared to querying a relational database in SQL. In this blogpost, you will learn some basics you need to understand before working with ElasticSearch. In the second part, you learn how to write queries in ElasticSearch. ElasticSearch uses many of the same concepts as your SQL Database. The terminology is just a little different.

Heike Maria (PhD)