Deconstructing Scoring In Elasticsearch
Exploring the basics of relevance scoring in Elasticsearch and Lucene

This article aims to explain the basics of relevance scoring in Elasticsearch(ES). Considering the very fact that Elasticsearch is based on Lucene; in this article we will first look into the classic TF-IDF(Term Frequency-Inverse Document Frequency) algorithm followed by the BM25 Similarity in ES which is now the default Similarity algorithm since Lucene 6.0.
Introduction
Simply put, Relevancy ranking is the process of sorting the document results so that those documents which are most likely to be relevant to the query are shown at the top. The relevance of documents is an extremely integral aspect of search engines because, as an end-user, we want results that are closely associated with our intended query. Information Retrieval as a field contributes heavily to this process.
One such scoring is used by Elasticsearch to rank and score the documents that we see on running a query. The higher the score the more relevancy it has to our intended query.
Let’s dive deep into the methods
In order to score the documents, Elasticsearch’s first step is to get the subset of the documents that match the query. This is achieved in a binary fashion. A document can either match our query or not. Yes or No. True or False.
Once the subset of documents is received then the task of scoring the documents based on their relevance begins. Scoring of a document is broadly a function of fields matched from the intended query and any surplus modifications to scoring such as boosting.
TF-IDF : Classic Method
As earlier specified, Elasticsearch is based on Lucene, so it primarily uses the latter’s scoring function. This method was the default method before Lucene 6.0 . Lucene’s practical scoring formula is mainly based on the term frequency and inverse document frequency concepts of Elasticsearch.
Lucene’s practical scoring formula:
score(q,d) =
queryNorm(q)
· coord(q,d)
· ∑ (
tf(t in d)
· idf(t)²
· t.getBoost()
· norm(t,d)
) (t in q)
Where :
- q : query
- d : document
- t : term
So, effectively score(q,d) means that we are trying to calculate the relevance score of document d for query q.
Let's understand what each of the terms means:
- queryNorm(q) : The query normalization factor is an attempt to normalize a query so that the results from one query may be compared with the results of another. It is calculated as :
queryNorm = 1 / √sumOfSquaredWeights
- coord(q,d) : The coordination factor (coord) is used to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.
Consider that we have a query where we search for documents having the term ‘hundred years later’ and the weight for each term is 2. Without coordination, the score would be just the sum of weights for the terms present in the document i.e. if the document has ‘hundred years’; the score would be 4.
The coordination factor multiplies the score by the number of matching terms in the document and divides it by the total number of terms in the query. With the coordination factor, the scores would be as follows:
* Document with hundred → score: 2.0 * 1 / 3 = 0.66
* Document with hundred years → score: 4.0 * 2 / 3 = 2.66
* Document with hundred years later → score: 6.0 * 3 / 3 = 6 - tf(t in d): Term frequency, as the name suggests is a measure of how often the term is present in the document. A positively correlated factor, the more often, the higher the weight. The term frequency is calculated as:
tf(t in d) = √frequency
- idf : Inverse Document Frequency: How often does the term appear in all documents in the collection? The more often, the lower the weight. This is calculated as :
idf(t) = 1 + log ( numDocs / (docFreq + 1))
- norm: How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows:
norm(d) = 1 / √numTerms
The field-length norm (norm) is the inverse square root of the number of terms in the field. - getBoost : Query-time boosting is a tool that can be used to tune the relevance of the documents according to our use cases.
BM25 Similarity : Current default algorithm
The classic TF-IDF method of scoring was left behind for a better alternative of BM-25 algorithm where BM stands for Best Matching which was made the default similarity algorithm since Lucene 6.
The version of ES used in this article is 7.3 which uses Lucene 8.1 under the hood; so it only makes sense that we understand the BM25 similarity as well.
For this purpose, I have indexed 5000 documents related to Shakespeare using the Shakespeare dataset for Elasticsearch.
The index ‘shakespeare’ has 5 shards and below is the distribution of 5000 documents across the 5 shards.
Given a query Q with keywords, q1…..qn, the BM25 similarity score is defined as :
Where:
- f(qᵢ,D) : term frequency of qᵢ in document D
- |D| : length of document D in words
- k₁ : term saturation parameter. This parameter controls how quickly an increase in term frequency results in term-frequency saturation. The default value is 1.2. Lower values result in quicker saturation, and higher values in slower saturation.
- b : length normalization parameter. This parameter controls how much effect field-length normalization should have. A value of 0.0 disables normalization completely, and a value of 1.0 normalizes fully. The default is 0.75
- avgdl : average document length in the text collection
- IDF : inverse document frequency defined as :
- n(qᵢ) : number of documents containing qᵢ
- N : total number of documents in the collection
Clubbing freq / (freq + k₁ * (1 — b + b * dl / avgdl)) as tf for future calculations.
All the following queries have been run on Kibana, a part of ELK stack.

Let’s’ deconstruct these ones by one using the above Shakespeare example and search for documents were ‘text_entry’ contains the word ‘hundred’:
On running this, we get 11 hits with a maximum score for _id: 4932.
Let’s now make use of Explain API to understand and calculate the score:
The output obtained is :
Let’s understand how we have arrived at each figure in this formula:
- Term Frequency (tf):
tf = freq / (freq + k1 * (1 — b + b * dl / avgdl))
In our case, the values for doc_id : 4932
1. freq = 1 since ‘hundred’ occurs only once in this document.
2. k₁ = 1.2; default value
3. b = 0.75; default value
4. dl : 6; since there are six words in the text_entry field in this document 5. avgdl : 7.583899 ; this is the average length of text_entry field in shard [shakespeare][0]
Plugging the values into the formula; we arrive at tf = 0.49700928
Tip: It’s worth noting that picking b and k₁ are generally not the first thing to do when documents are not being retrieved quickly. The default values of b = 0.75 and k₁ = 1.2 work pretty well for most corpuses, so you’re likely fine with the defaults. Boosting, synonyms or fuzziness should be explored before altering b and k₁.
- Inverse Document Frequency(idf) : It is calculated as :
idf = log(1 + (N — n + 0.5) / (n + 0.5))
In our case, N = 1031(in shard [shakespeare][0]) and n =1; we arrive at idf = 6.5337887 - The ‘boost’ section in the above screenshot which has a value of 2.2 is actually the value of (k₁+1) as specified in the BM25 formula.
So, the final score in our case for this document is : tf * idf * (k₁+1) = 0.49700928 * 6.5337887 * 2.2 = 7.144178
This is how the exact score is calculated under the hood.
If we provide additional boosting, then the entirety of the score is multiplied by that boost to arrive at the final score. Consider the following query:
Here, final score = score * additional boost = 0.3572089. As apparent, boosting changes the final score quickly.
Summing Up
Post this article, I hope that the basics of the two mentioned scoring mechanisms employed by Elasticsearch at different junctures are clear. As demonstrated above, the more we go deep into the boosting and relevance tuning for advanced ES queries, the more apparent the use of each parameter will become. Other points to consider while designing search queries are:
- The use of synonyms in match queries as per requirements
- The use of stemming or fuzziness to assist with phonetic or lingual differences
- The use of boosting or adding constant scores
- The use of function score for decaying score and relevance of older documents