Elasticsearch is often the storage engine of choice for storing and querying full text data. But writing an ElasticSearch query is pretty different compared to querying a relational database in SQL. In this blogpost, you will learn some basics you need to understand before working with ElasticSearch. In the second part, you learn how to write queries in ElasticSearch. 

ElasticSearch uses many of the same concepts as your SQL Database. The terminology is just a little different. The following table gives an overview

Elastic Search RDBMS-Equivalent
Index Database
Type Table
Document Row
Field Column
Mapping Schema
Shards partitions

An index (“database”) stores documents (“rows in a tabe”) and has per default 5 shards (“partitions”). All documents (“rows”) given a type (“table”) in an index (“database”) have the same mapping (“schema”).  Documents (“rows”) are stored in JSON-Format (you might know this from the complex data type struct in Hive for example). In a process called mapping, you define how a document, and the fields (“columns”) it contains, are stored and indexed. So the mapping is similar to the schema of a relational database. There are two options for string mappings:

  • exact values (keyword): keyword fields are useful for structured content such as city names, ids or email addresses. The whole string is indexed and retrieval is based on exact matches. So “Hamburg” is not the same as “hamburg”. There is no option for partial matches. You would not choose a keyword field for full-text data.

  • Full text a.k.a. analyzed text(text): As the name implies, text data in this field will be analyzed.  There are various built in analyzers. Most analyzer convert the text into a set of tokens, perform some kind of standardization (lowercase, stemming, lemmatization etc.) and add the field to the inverted index so it is searchable quickly (You can checkout my blogpost. on the inverted index and scoring, if you want to know more).  By default, Elasticsearch sorts the results by relevance score. The higher the score the higher the rank of the document. This relevance _score is returned as a metafield. There are different query types that can modify the scores and customize the order of search results. Speaking of queries: Let’s see how to write queries in ElasticSearch.

Query Tools

You have multiple options. For example:

  • You can query the index directly with curl.
  • You can use interfaces such as Kibana, which also provide some nice vizualization options. This one might come the most natural to you if you are used to using SQL-Clients.
  • You could query from python using the ElasticSearch Client

Using ElasticSearch Query DSL

Instead of using SELECT FROM WHERE syntax, you will write your query in json format. This is the basic anatomy of an elastic search query. 

{
“query”: { ... } 
“sort”:  { ... } 
”from”:  { ... } 
“size”:  { ... } 
}

So what does that mean?

  • query: is just a wrapper for your ES query.
  • sort: is your ORDER BY. By default, it sorts by descending _score
  • from: is your OFFSET.
  • size: is your LIMIT.

The query clause is mandatory. Sorting, offsetting and limiting your search results is optional. So let’s dive deeper into the queries: There are two different kinds of query clauses. 

  • Leaf query clauses Leaf query clauses look for a particular value in a particular field. These queries can be used by themselves. Examples are term and range. We’ll see examples in a minute.

  • Compound query clauses Compound query clauses wrap leaf or other compound queries and therefore combine multiple queries (for example with boolean operators (must, must_not, should) or by boosting specific matches with altering the _score. We’ll get into some examples soon.

Leaf query clauses

Filtering with match and term

The term and terms query clauses are used for “strict filters”. You apply those queries to keyword (exact value) fields. The field then must exactly match the field value, which includes whitespace and capitalization. It’s like a simple WHERE clause in SQL without regexes. There is an analyzer applied. You can filter by single or multiple values. For example, this query finds all documents in field <<field>> with value <value>

{ "term": { "<field>": "<value>" }}

while this query retrieves all documents with <value_a> or <value_b>.

{ "terms": { "<field>": [""<value_a>"", "<value_b>"] }}

The match query clause is the most generic and commonly used query clause. You can use it for text and keyword fields. It automatically figures out what to do: If you apply match on a keyword field, it will work as a filter and return documents with exact matches. If you use match on a text field it will perform an analyzed search, meaning that it applies the same analyzer as the field mapping, if you do not explicitly choose a different analyzer. You can write a match query as follows: 

{ "match": { "<field name>": "<text to be matched>" }}

Compound query clauses

The bool query clause is an example of a compound query clause, as it is used to combine multiple query clauses using boolean operators. Just wrap the bool around your operators must (“AND”), must_not (“NOT”) and should (“OR”). The scores from must, must_not, and should will be added together to provide the final score.

{
    "bool": {
        "must":     { "term": { "tag":    "math" }},
        "must_not": { "term": { "tag":    "probability"  }},
        "should": [
                    { "term": { "favorite": true   }},
                    { "term": { "unread":  true   }}
        ]
    }
}

You can also change the score of documents. For example a constant score query sets the relevance score equal to a specified value given by boost. For example this query doubles the default score of 1 for an exact match in <field>.

{
  "query": {
    "constant_score": {
      "filter": {
        "term": { "<field>": "<some query string>" }
      },
      "boost": 2
    }
  }
}

You can also (de)boost documents without excluding them from the search results with a boosting query or boost matches in some fields more than in others. You can even apply ranking models like XGBoost for personalized search. But this is subject for another blogpost.

How to start querying?

1. Get to know the mapping of your index

Check the mappings of the index with

GET <indexname>/_mapping

You have to understand, if the text field was mapped as a keyword  or as text. If it is mapped as keyword (or other structured fields) you can only search for exact matches in the exact way the field was indexed. You find an overview for all term-level queries in the ES documentation. If the field is mapped as text it means it was modified by an analyzer. Your query string will be processed by the same analyzer (You have to figure out which analyzer is in place. Most analyzers tokenize and remove stop words. Some analyzers include synonyms, some use stemming or lemmatization techniques). You find an overview of all full text queries in the ES documentation

2. Decide what you need: all documents that match vs. most relevant documents

All documents: You just want to know, if there is a match or there isn’t. Then you can use filter queries, which do not compute a score. You would mostly use them on structured fields like keyword, date or boolean fields. Just wrap a filter clause(s) around  your queries.

	{
			"filter": [ 
	        { "term":  { "<field>": "<value>" }},
	        { "range": { "<field>": { "gte": "<date>" }}}
	}

Documents ordered by relevance: Put your queries inside a query clause (“Query Context”) to automatically compute relevance score and maybe change the scores for reranking for example with boost.  

	{
	  "query": { 
	    "bool": { 
	      "must": [
	        { "match": { "<field_a>": "<value_a>"}},
	        { "match": { "<field_b>": "<value_b>" }}
	      ],
	    }
	  }
	}
 

3. Decide how you want to query: querystring as is vs. analyzed querystring

Send query string as is without modification: You have to use a term query or other term-level queries. For example, if you search for “Hello World!” with a term query, the query engine will check the inverted index for an exact match for “Hello World!”. “hello world!”, “Hello World” or “Hallo World!” would not be retrieved. 

Send query string analyzed: For that choice, you have to use the full text queries in the ES documentation. The querystring will process through the same analyzer as the field in the index, i.e. it will be tokenized, standardized, filtered. For example, if you search for the phrase “Hello World” then the query engine might check the inverted index for “hello” and “world” and ("!") (depending on the analyzer you chose before).

Overview

  • query— main query container
  • bool — compound query container for must (“AND”), must_not (“NOT”) and should (“OR”)
  • filter — filter container for strict filtering: Each leaf query inside it won’t contribute to the score of the matching documents
  • match — this is a full-text query, meaning the text will pass through the analyzer and transformed.
  • term — this is a term level query. The text won’t pass through the analyzer and will be sent as is to the search engine.
  • apply expert knowledge by boosting with constant_score or bosting .(by^ operator, e.g. ["<field1>^2", "<field2>^3]").