Mastering Differences and Pitfalls when Switching SQL Databases: PostgreSQL vs. MySQL vs. SQLITE vs. Hive vs. Presto (AWS Athena)

Transitioning to another SQL database? This blog post is for you. Shifting from one SQL dialect to another can be a journey full of surprises. While the basic syntax (SELECT FROM WHERE) is similar, there are important differences, that will make your queries slow, fast, fail or worse: fail silently!

In this blog post I’ll guide you through the intricate pathways of databases I have come across during my work as a data scientist: Postgres, MySQL, SQLite, Hive and Presto (AWS Athena). We’ll start with a brief introduction into the databases and some differences. Then we jump into three pitfalls you have to be aware of.

Quick Overview

PostgreSQL: PostgreSQL is an open-source DBMS with an active community and regular updates. It is is an all-rounder database that excels in features, performance, and extensibility. You find it applied for all sorts of use cases from analytics to production usage. I started using Postgres in 2013 as a database during my PHD and have come across it at different companies and projects. Many larger companies like Netflix, Spotify and Appke chose Postgres for all sorts of use cases.

MySQL: MySQL has gained significant popularity as a go-to choice for web apps and small to medium-sized projects. It is fast in storing and retrieving data. I have used MySQL e.g. as production database for retrieving data used by a recommender system.

SQLite: SQLite is a self-contained, serverless, and embeddable DBMS. The entire SQLite database is contained within a single file on the local file system. Unlike PostgreSQL or MySQL, SQLite does not require a separate server process to run. That’s why I often use SQLite for my small side projects, where I do not want to go through the hassle of spinning up a database instance in a Docker containers.

Hive: Hive is a distributed data warehouse system built on top of Hadoop. HiveQL has an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Under the hood queries are implemented using the MapReduce Java API. Hive is not the fastest option to write and retrieve data. That’s why I have used Hivein an “offline” setting for analytical tasks, e.g. running analysis or creating training datasets for ml models.

Presto: Presto is not a database in the traditional sense. It is a distributed SQL query engine that allows you to run SQL queries on data stored in various data sources. Facebook originally developed on Presto, because Hive was too slow. Like Hive, Presto Query Engine applies MapReduce, but does not write intermediate results to disk. It can run on top and combine multiple datasources.

Parameter	Hive	Postgres	MySQL	SQLite	Presto
Initial Release Date	2012	1996	1995	2000	2012
System Language	Java	C	C, C++	C	Java
Hosting	Server	Server	Server	Serverless	Server
Concurrency	Supported	Supported	Supported (writing is slow)	Only one process can write	Supported
Data Storage	Column-based	Row-based	Row-based	Row-based	Column-based
Storage Formats	ORC, Parquet, Text, others	Custom, Text, JSON, others	InnoDB, MyISAM, others	SQLite3, others	Various via connectors
Query Engine	MapReduce	PostgreSQL Executor	MySQL Executor	SQLite Executor	Presto Executor
Read Speed	Moderate	Fast	Fast	Fast	Fast
Write Speed	Slow	Fast	Fast	Fast	Moderate
Aggregation Speed	Moderate	Fast	Moderate	Moderate	Fast

Implications of data storage on speed

Data storage and data storage formats differ, which has implications of the speed of your queries.

In row-based storage, as seen in “traditional” relational database systems like PostgreSQL or MySQL, data is stored in rows. Imagine a library catalog card: each card (row) contains all the information (columns) about a single book. This approach is highly efficient if we are often interested in the entirety of the record, like retrieving all the details of a customer’s profile. However, calculating the average balance from a column in a table with millions of rows means the database has to scan each row, which can be resource-intensive and slow.

Column-based storage, as seen in databases like Hive and Presto, flips this model. Instead of storing data row by row, data is stored column by column. It’s like rearranging the library catalog to have a separate card for each type of information— one for titles, one for authors, and so on. Column based storage excels in analytical processing where queries often aggregate values from specific columns, like calculating averages or sums. Since the data in each column is of the same type, it allows for better compression, and the database can fetch the data more efficiently, improving query performance.

Understand how your query will be executed with `EXPLAIN`

The different databases use different query engines. The EXPLAIN statement is a powerful tool that you can use to gain insights into how a particular SQL query is executed by the database engine. Just prefix your SQL query with the EXPLAIN keyword and the database returns a query execution plan. This plan doesn’t execute the query, but instead provides a detailed roadmap of how the database intends to retrieve or modify the data, offering a glimpse into the internal strategies and operations used by the database optimizer. It can help you understand, which subqueries you should unnest or if you should update your index.

Types

Understanding the data types available in each database is essential for effective data storage and retrieval. For instance, using the appropriate numeric types can optimize storage, and leveraging specialized types like JSON can enhance the handling of complex data structures.

Here is a brief overview (please doublecheck the current docs!)

Data Type Categories	PostgreSQL	MySQL	SQLite	Hive	Presto
Numeric	INTEGER, NUMERIC, FLOAT8, SERIAL	INT, FLOAT, DECIMAL, TINYINT	INTEGER, REAL	INT, FLOAT, DOUBLE	INTEGER, DOUBLE, REAL
Character Strings	VARCHAR, TEXT, CHAR	VARCHAR, TEXT, CHAR	TEXT	STRING	VARCHAR, CHAR
Binary Data	BYTEA	BLOB, BINARY	BLOB	BINARY	VARBINARY
Date and Time	TIMESTAMP, DATE, INTERVAL	DATETIME, DATE, TIMESTAMP	DATETIME, DATE	TIMESTAMP, DATE	TIMESTAMP, DATE
Boolean	BOOLEAN	BOOLEAN, TINYINT	INTEGER (0 or 1)	BOOLEAN	BOOLEAN
JSON	JSON, JSONB	JSON	TEXT (JSON as string)	STRING (JSON as string)	JSON
UUID	UUID	CHAR(36)	TEXT	STRING	VARCHAR
Arrays and Composite	ARRAY, Composite Types	-	-	ARRAY, MAP, STRUCT	ARRAY, MAP, ROW
Specialized	ENUM, CIDR, INET	SET, ENUM	-	UNIONTYPE	IPADDRESS, HYPERLOGLOG, QDIGEST

Working with date fields is always painfully different. Here is a quick overview on how to do basic date operations.

Operation	Hive	Postgres	MySQL	SQLite	Presto
Current Date	`CURRENT_DATE`	`CURRENT_DATE`	`CURDATE()`	`date('now')`	`CURRENT_DATE`
Date Addition (Add 1 day to current date)	`date_add(CURRENT_DATE, 1)`	`CURRENT_DATE + INTERVAL '1 day'`	`DATE_ADD(CURDATE(), INTERVAL 1 DAY)`	`date('now','+1 day')`	`date_add('day', 1, CURRENT_DATE)`
Extract Year	`year(CURRENT_DATE)`	`EXTRACT(YEAR FROM CURRENT_DATE)`	`YEAR(CURDATE())`	`strftime('%Y', 'now')`	`EXTRACT(year FROM CURRENT_DATE)`

Pitfalls

The Pitfalls of Type Casting: Where It Goes Wrong…

Be careful when casting types! Casting to types that don’t exist, will notifify you with an error. E.g. when trying to cast to a Boolean in SQLite.

SELECT CAST(1 AS BOOLEAN);

However if the type exists, but it is not clear how to cast it, the behavior may differ. So what happens if you try to cast the String abc to Integer?

SELECT CAST('abc' AS INT)

Postgres and Hive will return Null, Postgres will give you a 0 and MySQL and SQLite will throw an error.

	`CAST('abc' AS INT)`	Behaviour
Postgres	0	defaults to 0 when it can’t cast a non-numeric string to integer.
MySQL	Error	throws an error indicating it cannot cast the string to an integer.
SQLite	Error	throws an error due to the invalid cast.
Hive	`Null`	returns a `Null` value when it fails to cast a non-numeric string.
Presto	`Null`	returns a `Null` value for the unsuccessful cast.

The Pitfalls of Type Concatinating NULL values: Where It Goes Wrong…

What happens if you want to concatenate columns and one column contains a NULL value?

CONCAT('Hello', NULL, 'World')

Postgres, SQLite and Hive will return NULL while MySQL and Presto will just ignore the NULL value.

	`CONCAT('Hello', NULL, 'World')`	Behavior
PostgreSQL	`NULL`	Result is `NULL`
MySQL	`'HelloWorld'`	Ignores `NULL`
SQLite	`NULL`	Result is `NULL`
Hive	`NULL`	Result is `NULL`
Presto	`'HelloWorld'`	Ignores `NULL`

The Pitfalls of Integer Division: Where It Goes Wrong…

Different databases have distinct behaviors when it comes to division operations. For instance, some databases return an integer value in integer division, while others might return a decimal. E.g. let’s look at the example of dividing two integers:

SELECT 1/2;

The return type and the result will be different. Hive, MySQL and Presto will return a decimal, while Postgres and SQLite will return integer. And the result will be different, returning 0, 0.5 and 1 as a result. Postgres will round up to the next integeger, while SQLlite will round down.

Database	Return Type For Integer Division	Result of `SELECT 1/2`	Behaviour
Postgres	Integer	1	Returns integer value for integer division, truncating any decimals.
MySQL	Decimal	0.500000	Returns decimal value for integer division, preserving fractional parts.
SQLite	Integer	0	Returns integer value for integer division, truncating any decimals.
Hive	Decimal	0.500000	Returns decimal value for integer division, preserving fractional parts.
Presto	Decimal	0.500000	Returns decimal value for integer division, preserving fractional parts.

You get consistent results, if you use a decimal instead of an integer.

SELECT 1.0/2;

Database	Result of `SELECT 1.0/2`
Postgres	0.5
MySQL	0.5
SQLite	0.5
Hive	0.5
Presto	0.5

Syntax

There are slight differences in the syntax, but don’t worry too much about it. Wrong syntax will make your query fail and then you can fix it. E.g. aliases in group by statements

SELECT 
	CASE WHEN nr > 10 THEN 1 ELSE 0 END AS new_column,
	count(*)
FROM 
	table_name
GROUP BY 
	new_column

Group by alias	Hive	Postgres	MySQL
	no	yes	yes

Postgres and MySQL allow aliases in the GROUP BY clause, while Hive does not. When using aliases in your GROUP BY statement in Hive, it will simply fail and you can remove the alias. Also LIMIT and OFFSET to restrict query results may or may not exist.

It’s rather important to understand the “silent” failures.