Transitioning to another SQL database? This blog post is for you. Shifting from one SQL dialect to another can be a journey full of surprises. While the basic syntax (SELECT FROM WHERE) is similar, there are important differences, that will make your queries slow, fast, fail or worse: fail silently!

In this blog post I’ll guide you through the intricate pathways of databases I have come across during my work as a data scientist: Postgres, MySQL, SQLite, Hive and Presto (AWS Athena). We’ll start with a brief introduction into the databases and some differences. Then we jump into three pitfalls you have to be aware of.

Quick Overview

PostgreSQL: PostgreSQL is an open-source DBMS with an active community and regular updates. It is is an all-rounder database that excels in features, performance, and extensibility. You find it applied for all sorts of use cases from analytics to production usage. I started using Postgres in 2013 as a database during my PHD and have come across it at different companies and projects. Many larger companies like Netflix, Spotify and Appke chose Postgres for all sorts of use cases.

MySQL: MySQL has gained significant popularity as a go-to choice for web apps and small to medium-sized projects. It is fast in storing and retrieving data. I have used MySQL e.g. as production database for retrieving data used by a recommender system.

SQLite: SQLite is a self-contained, serverless, and embeddable DBMS. The entire SQLite database is contained within a single file on the local file system. Unlike PostgreSQL or MySQL, SQLite does not require a separate server process to run. That’s why I often use SQLite for my small side projects, where I do not want to go through the hassle of spinning up a database instance in a Docker containers.

Hive: Hive is a distributed data warehouse system built on top of Hadoop. HiveQL has an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Under the hood queries are implemented using the MapReduce Java API. Hive is not the fastest option to write and retrieve data. That’s why I have used Hivein an “offline” setting for analytical tasks, e.g. running analysis or creating training datasets for ml models.

Presto: Presto is not a database in the traditional sense. It is a distributed SQL query engine that allows you to run SQL queries on data stored in various data sources. Facebook originally developed on Presto, because Hive was too slow. Like Hive, Presto Query Engine applies MapReduce, but does not write intermediate results to disk. It can run on top and combine multiple datasources.

Parameter Hive Postgres MySQL SQLite Presto
Initial Release Date 2012 1996 1995 2000 2012
System Language Java C C, C++ C Java
Hosting Server Server Server Serverless Server
Concurrency Supported Supported Supported (writing is slow) Only one process can write Supported
Data Storage Column-based Row-based Row-based Row-based Column-based
Storage Formats ORC, Parquet, Text, others Custom, Text, JSON, others InnoDB, MyISAM, others SQLite3, others Various via connectors
Query Engine MapReduce PostgreSQL Executor MySQL Executor SQLite Executor Presto Executor
Read Speed Moderate Fast Fast Fast Fast
Write Speed Slow Fast Fast Fast Moderate
Aggregation Speed Moderate Fast Moderate Moderate Fast

Implications of data storage on speed

Data storage and data storage formats differ, which has implications of the speed of your queries.

In row-based storage, as seen in “traditional” relational database systems like PostgreSQL or MySQL, data is stored in rows. Imagine a library catalog card: each card (row) contains all the information (columns) about a single book. This approach is highly efficient if we are often interested in the entirety of the record, like retrieving all the details of a customer’s profile. However, calculating the average balance from a column in a table with millions of rows means the database has to scan each row, which can be resource-intensive and slow.

Column-based storage, as seen in databases like Hive and Presto, flips this model. Instead of storing data row by row, data is stored column by column. It’s like rearranging the library catalog to have a separate card for each type of information— one for titles, one for authors, and so on. Column based storage excels in analytical processing where queries often aggregate values from specific columns, like calculating averages or sums. Since the data in each column is of the same type, it allows for better compression, and the database can fetch the data more efficiently, improving query performance.

Understand how your query will be executed with EXPLAIN

The different databases use different query engines. The EXPLAIN statement is a powerful tool that you can use to gain insights into how a particular SQL query is executed by the database engine. Just prefix your SQL query with the EXPLAIN keyword and the database returns a query execution plan. This plan doesn’t execute the query, but instead provides a detailed roadmap of how the database intends to retrieve or modify the data, offering a glimpse into the internal strategies and operations used by the database optimizer. It can help you understand, which subqueries you should unnest or if you should update your index.

Types

Understanding the data types available in each database is essential for effective data storage and retrieval. For instance, using the appropriate numeric types can optimize storage, and leveraging specialized types like JSON can enhance the handling of complex data structures.

Here is a brief overview (please doublecheck the current docs!)

Data Type Categories PostgreSQL MySQL SQLite Hive Presto
Numeric INTEGER, NUMERIC, FLOAT8, SERIAL INT, FLOAT, DECIMAL, TINYINT INTEGER, REAL INT, FLOAT, DOUBLE INTEGER, DOUBLE, REAL
Character Strings VARCHAR, TEXT, CHAR VARCHAR, TEXT, CHAR TEXT STRING VARCHAR, CHAR
Binary Data BYTEA BLOB, BINARY BLOB BINARY VARBINARY
Date and Time TIMESTAMP, DATE, INTERVAL DATETIME, DATE, TIMESTAMP DATETIME, DATE TIMESTAMP, DATE TIMESTAMP, DATE
Boolean BOOLEAN BOOLEAN, TINYINT INTEGER (0 or 1) BOOLEAN BOOLEAN
JSON JSON, JSONB JSON TEXT (JSON as string) STRING (JSON as string) JSON
UUID UUID CHAR(36) TEXT STRING VARCHAR
Arrays and Composite ARRAY, Composite Types - - ARRAY, MAP, STRUCT ARRAY, MAP, ROW
Specialized ENUM, CIDR, INET SET, ENUM - UNIONTYPE IPADDRESS, HYPERLOGLOG, QDIGEST

Working with date fields is always painfully different. Here is a quick overview on how to do basic date operations.

Operation Hive Postgres MySQL SQLite Presto
Current Date CURRENT_DATE CURRENT_DATE CURDATE() date('now') CURRENT_DATE
Date Addition (Add 1 day to current date) date_add(CURRENT_DATE, 1) CURRENT_DATE + INTERVAL '1 day' DATE_ADD(CURDATE(), INTERVAL 1 DAY) date('now','+1 day') date_add('day', 1, CURRENT_DATE)
Extract Year year(CURRENT_DATE) EXTRACT(YEAR FROM CURRENT_DATE) YEAR(CURDATE()) strftime('%Y', 'now') EXTRACT(year FROM CURRENT_DATE)

Pitfalls

The Pitfalls of Type Casting: Where It Goes Wrong…

Be careful when casting types! Casting to types that don’t exist, will notifify you with an error. E.g. when trying to cast to a Boolean in SQLite.

SELECT CAST(1 AS BOOLEAN);

However if the type exists, but it is not clear how to cast it, the behavior may differ. So what happens if you try to cast the String abc to Integer?

SELECT CAST('abc' AS INT)

Postgres and Hive will return Null, Postgres will give you a 0 and MySQL and SQLite will throw an error.

CAST('abc' AS INT) Behaviour
Postgres 0 defaults to 0 when it can’t cast a non-numeric string to integer.
MySQL Error throws an error indicating it cannot cast the string to an integer.
SQLite Error throws an error due to the invalid cast.
Hive Null returns a Null value when it fails to cast a non-numeric string.
Presto Null returns a Null value for the unsuccessful cast.

The Pitfalls of Type Concatinating NULL values: Where It Goes Wrong…

What happens if you want to concatenate columns and one column contains a NULL value?

CONCAT('Hello', NULL, 'World')

Postgres, SQLite and Hive will return NULL while MySQL and Presto will just ignore the NULL value.

CONCAT('Hello', NULL, 'World') Behavior
PostgreSQL NULL Result is NULL
MySQL 'HelloWorld' Ignores NULL
SQLite NULL Result is NULL
Hive NULL Result is NULL
Presto 'HelloWorld' Ignores NULL

The Pitfalls of Integer Division: Where It Goes Wrong…

Different databases have distinct behaviors when it comes to division operations. For instance, some databases return an integer value in integer division, while others might return a decimal. E.g. let’s look at the example of dividing two integers:

SELECT 1/2;

The return type and the result will be different. Hive, MySQL and Presto will return a decimal, while Postgres and SQLite will return integer. And the result will be different, returning 0, 0.5 and 1 as a result. Postgres will round up to the next integeger, while SQLlite will round down.

Database Return Type For Integer Division Result of SELECT 1/2 Behaviour
Postgres Integer 1 Returns integer value for integer division, truncating any decimals.
MySQL Decimal 0.500000 Returns decimal value for integer division, preserving fractional parts.
SQLite Integer 0 Returns integer value for integer division, truncating any decimals.
Hive Decimal 0.500000 Returns decimal value for integer division, preserving fractional parts.
Presto Decimal 0.500000 Returns decimal value for integer division, preserving fractional parts.

You get consistent results, if you use a decimal instead of an integer.

SELECT 1.0/2;
Database Result of SELECT 1.0/2
Postgres 0.5
MySQL 0.5
SQLite 0.5
Hive 0.5
Presto 0.5

Syntax

There are slight differences in the syntax, but don’t worry too much about it. Wrong syntax will make your query fail and then you can fix it. E.g. aliases in group by statements

SELECT 
	CASE WHEN nr > 10 THEN 1 ELSE 0 END AS new_column,
	count(*)
FROM 
	table_name
GROUP BY 
	new_column
Group by alias Hive Postgres MySQL
no yes yes

Postgres and MySQL allow aliases in the GROUP BY clause, while Hive does not. When using aliases in your GROUP BY statement in Hive, it will simply fail and you can remove the alias. Also LIMIT and OFFSET to restrict query results may or may not exist.

It’s rather important to understand the “silent” failures.