Elasticsearch is often the storage engine of choice for storing and querying full text data. But writing an ElasticSearch query is pretty different compared to querying a relational database in SQL. In this blogpost, you will learn some basics you need to understand before working with ElasticSearch. In the second part, you learn how to write queries in ElasticSearch.
ElasticSearch uses many of the same concepts as your SQL Database. The terminology is just a little different. The following table gives an overview
Elastic Search | RDBMS-Equivalent |
---|---|
Index | Database |
Type | Table |
Document | Row |
Field | Column |
Mapping | Schema |
Shards | partitions |
An index (“database”) stores documents (“rows in a tabe”) and has per default 5 shards (“partitions”). All documents (“rows”) given a type (“table”) in an index (“database”) have the same mapping (“schema”). Documents (“rows”) are stored in JSON-Format (you might know this from the complex data type struct
in Hive for example). In a process called mapping, you define how a document, and the fields (“columns”) it contains, are stored and indexed. So the mapping is similar to the schema of a relational database. There are two options for string mappings:
-
exact values (
keyword
):keyword
fields are useful for structured content such as city names, ids or email addresses. The whole string is indexed and retrieval is based on exact matches. So “Hamburg” is not the same as “hamburg”. There is no option for partial matches. You would not choose a keyword field for full-text data. -
Full text a.k.a. analyzed text(
text
): As the name implies, text data in this field will be analyzed. There are various built in analyzers. Most analyzer convert the text into a set of tokens, perform some kind of standardization (lowercase, stemming, lemmatization etc.) and add the field to the inverted index so it is searchable quickly (You can checkout my blogpost. on the inverted index and scoring, if you want to know more). By default, Elasticsearch sorts the results by relevance score. The higher the score the higher the rank of the document. This relevance_score
is returned as a metafield. There are different query types that can modify the scores and customize the order of search results. Speaking of queries: Let’s see how to write queries in ElasticSearch.
Query Tools
You have multiple options. For example:
- You can query the index directly with
curl
. - You can use interfaces such as Kibana, which also provide some nice vizualization options. This one might come the most natural to you if you are used to using SQL-Clients.
- You could query from python using the ElasticSearch Client
Using ElasticSearch Query DSL
Instead of using SELECT FROM WHERE
syntax, you will write your query in json format. This is the basic anatomy of an elastic search query.
{
“query”: { ... }
“sort”: { ... }
”from”: { ... }
“size”: { ... }
}
So what does that mean?
query
: is just a wrapper for your ES query.sort
: is yourORDER BY
. By default, it sorts by descending_score
from
: is yourOFFSET
.size
: is yourLIMIT
.
The query clause is mandatory. Sorting, offsetting and limiting your search results is optional. So let’s dive deeper into the queries:
There are two different kinds of query
clauses.
-
Leaf query clauses Leaf query clauses look for a particular value in a particular field. These queries can be used by themselves. Examples are
term
andrange
. We’ll see examples in a minute. -
Compound query clauses Compound query clauses wrap leaf or other compound queries and therefore combine multiple queries (for example with boolean operators (
must
,must_not
,should
) or by boosting specific matches with altering the_score
. We’ll get into some examples soon.
Leaf query clauses
Filtering with match
and term
The term
and terms
query clauses are used for “strict filters”. You apply those queries to keyword
(exact value) fields. The field then must exactly match the field value, which includes whitespace and capitalization. It’s like a simple WHERE
clause in SQL without regexes. There is an analyzer applied. You can filter by single or multiple values. For example, this query finds all documents in field <<field>>
with value <value>
{ "term": { "<field>": "<value>" }}
while this query retrieves all documents with <value_a>
or <value_b>
.
{ "terms": { "<field>": [""<value_a>"", "<value_b>"] }}
The match query clause is the most generic and commonly used query clause. You can use it for text
and keyword
fields. It automatically figures out what to do: If you apply match
on a keyword
field, it will work as a filter and return documents with exact matches. If you use match
on a text
field it will perform an analyzed search, meaning that it applies the same analyzer as the field mapping, if you do not explicitly choose a different analyzer. You can write a match query as follows:
{ "match": { "<field name>": "<text to be matched>" }}
Compound query clauses
The bool query clause is an example of a compound query clause, as it is used to combine multiple query clauses using boolean operators. Just wrap the bool
around your operators must
(“AND”), must_not
(“NOT”) and should
(“OR”). The scores from must
, must_not
, and should
will be added together to provide the final score.
{
"bool": {
"must": { "term": { "tag": "math" }},
"must_not": { "term": { "tag": "probability" }},
"should": [
{ "term": { "favorite": true }},
{ "term": { "unread": true }}
]
}
}
You can also change the score of documents. For example a constant score query
sets the relevance score equal to a specified value given by boost
. For example this query doubles the default score of 1 for an exact match in <field>
.
{
"query": {
"constant_score": {
"filter": {
"term": { "<field>": "<some query string>" }
},
"boost": 2
}
}
}
You can also (de)boost documents without excluding them from the search results with a boosting query or boost matches in some fields more than in others. You can even apply ranking models like XGBoost for personalized search. But this is subject for another blogpost.
How to start querying?
1. Get to know the mapping of your index
Check the mappings of the index with
GET <indexname>/_mapping
You have to understand, if the text field was mapped as a keyword
or as text
. If it is mapped as keyword
(or other structured fields) you can only search for exact matches in the exact way the field was indexed. You find an overview for all term-level queries in the ES documentation. If the field is mapped as text
it means it was modified by an analyzer. Your query string will be processed by the same analyzer (You have to figure out which analyzer is in place. Most analyzers tokenize and remove stop words. Some analyzers include synonyms, some use stemming or lemmatization techniques). You find an overview of all full text queries in the ES documentation
2. Decide what you need: all documents that match vs. most relevant documents
All documents: You just want to know, if there is a match or there isn’t. Then you can use filter
queries, which do not compute a score. You would mostly use them on structured fields like keyword
, date or boolean fields. Just wrap a filter
clause(s) around your queries.
{
"filter": [
{ "term": { "<field>": "<value>" }},
{ "range": { "<field>": { "gte": "<date>" }}}
}
Documents ordered by relevance: Put your queries inside a query
clause (“Query Context”) to automatically compute relevance score and maybe change the scores for reranking for example with boost
.
{
"query": {
"bool": {
"must": [
{ "match": { "<field_a>": "<value_a>"}},
{ "match": { "<field_b>": "<value_b>" }}
],
}
}
}
3. Decide how you want to query: querystring as is vs. analyzed querystring
Send query string as is without modification: You have to use a term
query or other term-level queries. For example, if you search for “Hello World!” with a term query, the query engine will check the inverted index for an exact match for “Hello World!”. “hello world!”, “Hello World” or “Hallo World!” would not be retrieved.
Send query string analyzed: For that choice, you have to use the full text queries in the ES documentation. The querystring will process through the same analyzer as the field in the index, i.e. it will be tokenized, standardized, filtered. For example, if you search for the phrase “Hello World” then the query engine might check the inverted index for “hello” and “world” and ("!") (depending on the analyzer you chose before).
Overview
query
— main query containerbool
— compound query container formust
(“AND”),must_not
(“NOT”) andshould
(“OR”)filter
— filter container for strict filtering: Each leaf query inside it won’t contribute to the score of the matching documentsmatch
— this is a full-text query, meaning the text will pass through the analyzer and transformed.term
— this is a term level query. The text won’t pass through the analyzer and will be sent as is to the search engine.- apply expert knowledge by boosting with
constant_score
or bosting .(by^
operator, e.g.["<field1>^2", "<field2>^3]"
).