Transitioning to another SQL database? This blog post is for you. Shifting from one SQL dialect to another can be a journey full of surprises. While the basic syntax (SELECT FROM WHERE) is similar, there are important differences, that will make your queries slow, fast, fail or worse: fail silently!
In this blog post I’ll guide you through the intricate pathways of databases I have come across during my work as a data scientist: Postgres, MySQL, SQLite, Hive and Presto (AWS Athena). We’ll start with a brief introduction into the databases and some differences. Then we jump into three pitfalls you have to be aware of.
Quick Overview
PostgreSQL: PostgreSQL is an open-source DBMS with an active community and regular updates. It is is an all-rounder database that excels in features, performance, and extensibility. You find it applied for all sorts of use cases from analytics to production usage. I started using Postgres in 2013 as a database during my PHD and have come across it at different companies and projects. Many larger companies like Netflix, Spotify and Appke chose Postgres for all sorts of use cases.
MySQL: MySQL has gained significant popularity as a go-to choice for web apps and small to medium-sized projects. It is fast in storing and retrieving data. I have used MySQL e.g. as production database for retrieving data used by a recommender system.
SQLite: SQLite is a self-contained, serverless, and embeddable DBMS. The entire SQLite database is contained within a single file on the local file system. Unlike PostgreSQL or MySQL, SQLite does not require a separate server process to run. That’s why I often use SQLite for my small side projects, where I do not want to go through the hassle of spinning up a database instance in a Docker containers.
Hive: Hive is a distributed data warehouse system built on top of Hadoop. HiveQL has an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Under the hood queries are implemented using the MapReduce Java API. Hive is not the fastest option to write and retrieve data. That’s why I have used Hivein an “offline” setting for analytical tasks, e.g. running analysis or creating training datasets for ml models.
Presto: Presto is not a database in the traditional sense. It is a distributed SQL query engine that allows you to run SQL queries on data stored in various data sources. Facebook originally developed on Presto, because Hive was too slow. Like Hive, Presto Query Engine applies MapReduce, but does not write intermediate results to disk. It can run on top and combine multiple datasources.
Parameter | Hive | Postgres | MySQL | SQLite | Presto |
---|---|---|---|---|---|
Initial Release Date | 2012 | 1996 | 1995 | 2000 | 2012 |
System Language | Java | C | C, C++ | C | Java |
Hosting | Server | Server | Server | Serverless | Server |
Concurrency | Supported | Supported | Supported (writing is slow) | Only one process can write | Supported |
Data Storage | Column-based | Row-based | Row-based | Row-based | Column-based |
Storage Formats | ORC, Parquet, Text, others | Custom, Text, JSON, others | InnoDB, MyISAM, others | SQLite3, others | Various via connectors |
Query Engine | MapReduce | PostgreSQL Executor | MySQL Executor | SQLite Executor | Presto Executor |
Read Speed | Moderate | Fast | Fast | Fast | Fast |
Write Speed | Slow | Fast | Fast | Fast | Moderate |
Aggregation Speed | Moderate | Fast | Moderate | Moderate | Fast |
Implications of data storage on speed
Data storage and data storage formats differ, which has implications of the speed of your queries.
In row-based storage, as seen in “traditional” relational database systems like PostgreSQL or MySQL, data is stored in rows. Imagine a library catalog card: each card (row) contains all the information (columns) about a single book. This approach is highly efficient if we are often interested in the entirety of the record, like retrieving all the details of a customer’s profile. However, calculating the average balance from a column in a table with millions of rows means the database has to scan each row, which can be resource-intensive and slow.
Column-based storage, as seen in databases like Hive and Presto, flips this model. Instead of storing data row by row, data is stored column by column. It’s like rearranging the library catalog to have a separate card for each type of information— one for titles, one for authors, and so on. Column based storage excels in analytical processing where queries often aggregate values from specific columns, like calculating averages or sums. Since the data in each column is of the same type, it allows for better compression, and the database can fetch the data more efficiently, improving query performance.
Understand how your query will be executed with EXPLAIN
The different databases use different query engines. The EXPLAIN
statement is a powerful tool that you can use to gain insights into how a particular SQL query is executed by the database engine. Just prefix your SQL query with the EXPLAIN
keyword and the database returns a query execution plan. This plan doesn’t execute the query, but instead provides a detailed roadmap of how the database intends to retrieve or modify the data, offering a glimpse into the internal strategies and operations used by the database optimizer. It can help you understand, which subqueries you should unnest or if you should update your index.
Types
Understanding the data types available in each database is essential for effective data storage and retrieval. For instance, using the appropriate numeric types can optimize storage, and leveraging specialized types like JSON can enhance the handling of complex data structures.
Here is a brief overview (please doublecheck the current docs!)
Data Type Categories | PostgreSQL | MySQL | SQLite | Hive | Presto |
---|---|---|---|---|---|
Numeric | INTEGER, NUMERIC, FLOAT8, SERIAL | INT, FLOAT, DECIMAL, TINYINT | INTEGER, REAL | INT, FLOAT, DOUBLE | INTEGER, DOUBLE, REAL |
Character Strings | VARCHAR, TEXT, CHAR | VARCHAR, TEXT, CHAR | TEXT | STRING | VARCHAR, CHAR |
Binary Data | BYTEA | BLOB, BINARY | BLOB | BINARY | VARBINARY |
Date and Time | TIMESTAMP, DATE, INTERVAL | DATETIME, DATE, TIMESTAMP | DATETIME, DATE | TIMESTAMP, DATE | TIMESTAMP, DATE |
Boolean | BOOLEAN | BOOLEAN, TINYINT | INTEGER (0 or 1) | BOOLEAN | BOOLEAN |
JSON | JSON, JSONB | JSON | TEXT (JSON as string) | STRING (JSON as string) | JSON |
UUID | UUID | CHAR(36) | TEXT | STRING | VARCHAR |
Arrays and Composite | ARRAY, Composite Types | - | - | ARRAY, MAP, STRUCT | ARRAY, MAP, ROW |
Specialized | ENUM, CIDR, INET | SET, ENUM | - | UNIONTYPE | IPADDRESS, HYPERLOGLOG, QDIGEST |
Working with date fields is always painfully different. Here is a quick overview on how to do basic date operations.
Operation | Hive | Postgres | MySQL | SQLite | Presto |
---|---|---|---|---|---|
Current Date | CURRENT_DATE |
CURRENT_DATE |
CURDATE() |
date('now') |
CURRENT_DATE |
Date Addition (Add 1 day to current date) | date_add(CURRENT_DATE, 1) |
CURRENT_DATE + INTERVAL '1 day' |
DATE_ADD(CURDATE(), INTERVAL 1 DAY) |
date('now','+1 day') |
date_add('day', 1, CURRENT_DATE) |
Extract Year | year(CURRENT_DATE) |
EXTRACT(YEAR FROM CURRENT_DATE) |
YEAR(CURDATE()) |
strftime('%Y', 'now') |
EXTRACT(year FROM CURRENT_DATE) |
Pitfalls
The Pitfalls of Type Casting: Where It Goes Wrong…
Be careful when casting types! Casting to types that don’t exist, will notifify you with an error. E.g. when trying to cast to a Boolean in SQLite.
SELECT CAST(1 AS BOOLEAN);
However if the type exists, but it is not clear how to cast it, the behavior may differ. So what happens if you try to cast the String abc
to Integer?
SELECT CAST('abc' AS INT)
Postgres and Hive will return Null
, Postgres will give you a 0
and MySQL and SQLite will throw an error.
CAST('abc' AS INT) |
Behaviour | |
---|---|---|
Postgres | 0 | defaults to 0 when it can’t cast a non-numeric string to integer. |
MySQL | Error | throws an error indicating it cannot cast the string to an integer. |
SQLite | Error | throws an error due to the invalid cast. |
Hive | Null |
returns a Null value when it fails to cast a non-numeric string. |
Presto | Null |
returns a Null value for the unsuccessful cast. |
The Pitfalls of Type Concatinating NULL values: Where It Goes Wrong…
What happens if you want to concatenate columns and one column contains a NULL
value?
CONCAT('Hello', NULL, 'World')
Postgres, SQLite and Hive will return NULL
while MySQL and Presto will just ignore the NULL
value.
CONCAT('Hello', NULL, 'World') |
Behavior | |
---|---|---|
PostgreSQL | NULL |
Result is NULL |
MySQL | 'HelloWorld' |
Ignores NULL |
SQLite | NULL |
Result is NULL |
Hive | NULL |
Result is NULL |
Presto | 'HelloWorld' |
Ignores NULL |
The Pitfalls of Integer Division: Where It Goes Wrong…
Different databases have distinct behaviors when it comes to division operations. For instance, some databases return an integer value in integer division, while others might return a decimal. E.g. let’s look at the example of dividing two integers:
SELECT 1/2;
The return type and the result will be different. Hive, MySQL and Presto will return a decimal, while Postgres and SQLite will return integer. And the result will be different, returning 0, 0.5 and 1 as a result. Postgres will round up to the next integeger, while SQLlite will round down.
Database | Return Type For Integer Division | Result of SELECT 1/2 |
Behaviour |
---|---|---|---|
Postgres | Integer | 1 | Returns integer value for integer division, truncating any decimals. |
MySQL | Decimal | 0.500000 | Returns decimal value for integer division, preserving fractional parts. |
SQLite | Integer | 0 | Returns integer value for integer division, truncating any decimals. |
Hive | Decimal | 0.500000 | Returns decimal value for integer division, preserving fractional parts. |
Presto | Decimal | 0.500000 | Returns decimal value for integer division, preserving fractional parts. |
You get consistent results, if you use a decimal instead of an integer.
SELECT 1.0/2;
Database | Result of SELECT 1.0/2 |
---|---|
Postgres | 0.5 |
MySQL | 0.5 |
SQLite | 0.5 |
Hive | 0.5 |
Presto | 0.5 |
Syntax
There are slight differences in the syntax, but don’t worry too much about it. Wrong syntax will make your query fail and then you can fix it. E.g. aliases in group by statements
SELECT
CASE WHEN nr > 10 THEN 1 ELSE 0 END AS new_column,
count(*)
FROM
table_name
GROUP BY
new_column
Group by alias | Hive | Postgres | MySQL |
---|---|---|---|
no | yes | yes |
Postgres and MySQL allow aliases in the GROUP BY clause, while Hive does not. When using aliases in your GROUP BY statement in Hive, it will simply fail and you can remove the alias. Also LIMIT
and OFFSET
to restrict query results may or may not exist.
It’s rather important to understand the “silent” failures.