Searching through full text fields with regexes in relational database systems like PostgreSQL or MySQL is painful: The query latency is high and your results will be unordered, so you have no idea how relevant your query results are. Elasticsearch is often the storage engine of choice for storing and querying full text data. In ElasticSearch querying fulltext fields is among the least resource intensive tasks and your query results are ordered putting the most relevant results on top. But how does this work?
Elasticsearch uses an “inverted index”. The most obvious way to build up an index, is to store a list of all terms for each document that we are indexing (forward indexing):
|Elephant||mammal, largest, existing, animal, endangered, species, african, asian …|
|Turtle||reptile, shell, special, bony, cortilaginous, shell, ribs, shield, species…|
|Hippopotamus||herbivorous, mammal, species, african …|
You can imagine, this appoach is pretty fast when indexing. But it is not very efficient when querying terms as this requires the search engine to look through all entries in the index for a specfic term in order to return all documents containing this term.
For a faster querying, you would have to sort the index by terms in the document. For the example above this would mean, to create a mapping of terms to documents instead of documents to terms.
|mammal||Elephant, Hippopotamus, …|
|species||Elephant, Turtle, Hippopotamus, …|
This appoach is called an inverted index, because it is an inversion of the forward index. And this is basically what ElasticSearch does: In a first analysis step, it splits documents into terms. In a second steps it sorts the terms through relevance scoring. Let’s have a close look at the steps:
Step 1: Analysis: The documents are split into terms (Tokenization). Then Elasticsearch applies normalizers and transformations to these terms. You can find a list of all available analyzers here. The idea is to make terms in documents more general. E.g. in the sentences “he analyzes the data” and “Analyze the data!” the word “analyze” should be identified as the same word. Elasticsearch has a default analyzer which applies the following steps
- tokenize text into individual terms
- removes punctuation
- convert terms to lower case
- removes stopwords
Let’s take the following sentence “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
[The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone, .]
- Remove Punctuation:
[The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone]
[the, 2, quick, brown, foxes, jumpe, d, over, the, lazy, dog's, bone]
- Remove Stopwords :
[2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone]
Different analyzers apply compounding rules (if this is relevant in the language your having your text data in), split words into characters or remove stopwords. You can also build your own analyzer. Depending on your use case, you have to choose your analyzer. The analysis step results in a unique list of terms/tokens and an associating list of documents, where the word can be found.
Step 2: Relevance through scoring: The second step is about figuring out, how “good” a search term matches a document. The fit of the match is represented by
_score: the higher the score, the more relevant it is. The default scoring method uses Lucene’s Practical scoring function, which calculates the score based on
- Term Frequency: How often does a term occur in a document?
- Document Frequency: In how many documents does this term occur?
- Field Length: How many terms are in a document?
The more often a term appears in a document, the more relevant is a term. However, if a term appears a lot in all documents (e.g. stopwords like “the” or “and”, or very common words), the term may be less relevant. The inverse document frequency accounts for that. In addition, a term is less relevant, if a field contains more terms, i.e. is longer. Therefore, the score is normalized by taking into account the field length.
|Part of the score||Interpretation||How is it calculated|
|Term Frequency||higher TF -> higher relevance||tf(t in d) = √frequency|
|Inverse Document Frequency||higher IDF –> lower relevance||idf(t) = 1 + log ( numDocs / (docFreq + 1))|
|Field Length||shorter field -> higher relevance||norm(d) = 1 / √numTerms|
Putting this together, the relevance score is calculated by
tf * idf * norm(d)
You are free to create your own scoring functions and apply boosting and weighting functions.
Let’s summarize: An analyzer converts the text into a set of terms, which are added to the inverted index. The inverted index maps terms to documents containing the term. When searching, we do not search through the documents themselves, but through the inverted index. Terms are weighted by a score and the documents with the highest score are returned.