The basic idea of complex datatypes is to store multiple values in a single column. So if you are working with a Hive database and you query a column, but then you notice “This value I need is trapped in a column among other values…” you just came across a complex a.k.a. nested datatype.
There are three types: arrays
, maps
and structs
. First, you have to understand, which types are present. Depending on the datatype, there are different ways how you can access the values.
-
array
(ARRAY<data_type>
): It is an ordered collection of elements. The elements in the array must be of the same type. So you have for example an array of strings, integers etc. -
map
(MAP<primitive_type, data_type>
): It is an unordered collection of key-value pairs. Keys must be of primitive types. Values can be of any type. A map can have an arbitary number of elements. -
struct
(:STRUCT<col_name : data_type
): It is a collection of elements of different types. The concept of a struct is very similar to a row in a table: structs have a fixed number of fields (“columns”). Every field has a predefined type.
type | what is it? | how do they look like | example |
---|---|---|---|
Array |
ordered collection of elements of the same type | a comma-separated list surrounded with […] | ["a","b"] |
Map |
unordered collection of key-value paris | comma-separated list of key:value pairs, surrounded with {…} | {"a":1, "b":2} |
Struct |
collection of elements of different types | written using the JSON map format | {"a": 1, "b":"c"} |
Use DESCRIBE <table>
to get a list of the datatypes of the columns in your table. You may notice, that one complex data type is nested into another, e.g. you may see an array of structs. But don’t worry, if you understand each type separately, you can untangle the nested structures aswell. So let’s have a look at each of those types and how we can work with them.
Arrays
So arrays are an ordered collection of elements of the same type. You could compare them to lists of the same type in Python or vectors in R. Since they are ordered, you can use a numeric index to access the elements. The index starts at 0.
For example, the table students
has a column extra_curriculum
, which is an array of strings.
first_name | extra_curriculum |
---|---|
Tom | [‘orchestra’] |
Ann | [‘orchestra’, ‘art’] |
You can get the first value in the array, by indexing it with [0]
.
SELECT
first_name,
extra_curriculum[0] AS first_extra_curriculum
FROM
students
;
This returns:
first_name | first_extra_curriculum |
---|---|
Tom | orchestra |
Ann | orchestra |
Let’s now say you want to have one row for each element in your array (The equivalent of Rs gather
-function). Hive has build-in table generating functions (UDTF) for that. These UDTFs transform a single input row into multiple output rows. The basic syntax is
LATERAL VIEW explode(expression) tableAlias AS columnAlias
This returns 0 to many rows: one row for each element from the input array. So running the following query on our students table
SELECT
first_name,
extra_curriculum
FROM
students
LATERAL VIEW EXPLODE (extra_curriculum) ec AS extra_curriculum
;
will result in a table, with each entry in extra_curriculum
as a row.
first_name | extra_curriculum |
---|---|
Tom | orchestra |
Ann | orchestra |
Ann | art |
Maps
Maps are used for key-value pairs. They are very similar to dictionaries in Python or a named vector in R. You can access the key-value pairs with the name of the key in brackets.
In this example, you have the table students
with column grade
, MAP<string, string>
, which maps different subjects to their letter grade.
first_name | grade |
---|---|
Tom | {‘math’: ‘B’, ‘english’: ‘B’} |
Ann | {‘math’: ‘A’, ‘english’: ‘B’, ‘biology’: ‘C’} |
You can extract all the math grades by using the key math
in brackets.
SELECT
first_name,
grade["math"] AS math_grade
FROM
students
;
first_name | math_grade |
---|---|
Tom | B |
Ann | A |
Structs
Structs are written in JSON format. You can access the values using the dot notation for the field to extact the value.
first_name | teacher |
---|---|
Tom | {‘math’: ‘Mrs Johnson’, ‘english’: ‘Mr Miller’, ‘nr_teachers’: 2} |
Ann | {‘math’: ‘Mrs Johnson’, ‘english’: ‘Mrs Thomson’, ‘biology’: ‘Mr Chu’, ‘nr_teachers’: 3} |
SELECT
first_name,
teacher.math AS math_teacher
FROM
student
;
first_name | math_teacher |
---|---|
Tom | Mrs Johnson |
Ann | Mrs Johnson |
Summary
To sum it up, depending on the complex data type, you can query elements differently. This last table gives an overview on how.
Type | How to extract elements | Example | |
---|---|---|---|
Array |
by index | my_array[0] |
|
Map |
by key | my_map['key'] |
|
Struct |
by dot notation | my_struct.name |