Beginners guide to querying Elasticsearch (Scoring vs Sorting)

Background

Elasticsearch is an open source highly scalable search and analytics engine. The Search API in Elasticsearch is very flexible and can easily scale to petabytes of data. We will discuss how easy it is to query Elasticsearch and introduce the concept of relevance. We will cover the following here:

  • Different types of queries
  • Querying Elasticsearch
  • Relevance

Objective

In this article, Abhishek Andhavarapu, the author of the book Learning Elasticsearch shows how to query Elasticsearch. You will learn how text search is different and the difference between sorting and scoring. The query language is very expressive and can be used to define filters, queries, sorting, pagination, and aggregations in the same query.

Different types of queries

Elasticsearch queries are executed using the Search API. Like anything else in Elasticsearch, request and response are represented in JSON.

Queries in Elasticsearch at a high level are divided as follows:

  • Structured queries: Structured queries are used to query numbers, dates, statuses, and so on. These are similar to queries supported by a SQL database. For example, whether a number or date falls within a range or to find all the employees with John as the first name and so on
  • Full-text search queries: Full-text search queries are used to search text fields. When you send a full-text query to Elasticsearch, it first finds all the documents that match the query, and then the documents are ranked based on how relevant each document is to the query. We will discuss relevance in detail in the Relevance section

Both structured and full-text search queries can be combined while querying. In the next section, we will describe the overall structure of request and response.

Querying Elasticsearch

One of most powerful features of Elasticsearch is the Query DSL (Domain specific Language) or the query language. If you are familiar with SQL language, the following table shows the equivalent terms in Elasticsearch:

Database Table Row Column
Index Type Document Field

To execute a search query, an HTTP request should be sent to the _search endpoint. The index and type on which the query should be executed is specified in the URL. Index and type are optional. If no index/type is specified, Elasticsearch executes the request across all the indexes in the cluster. A search query in Elasticsearch can be executed in two different ways:

  • By passing the search request as query parameters.
  • By passing the search request in the request body.

A simple search query using query parameters is shown here:

GET http://127.0.0.1:9200/chapter6/product/_search?q=product_name:jacket

Simple queries like in the above example can be executed using the URL request parameters. Anything other than a simple query should be passed as the request body.  The preceding query, when passed as a request body, looks like the following:

POST http://127.0.0.1:9200/chapter6/product/_search 
{
   "query": {
     "term": {
       "product_name" : "jacket"
     }
   }
 }

The preceding query is executed on the chapter6 index and type named product. The query can also be executed on multiple indexes/types at the same time, as shown here:

POST http://127.0.0.1:9200/chapter5,chapter6/product,product_reviews/_search 
{
   "query": {
     "term": {
       "product_name" : "jacket"
     }
   }
 }

The HTTP verb we used in the preceding example for the _search API is POST. You can also use GET instead of POST. Since most browsers will not support a request body when using GET, we used POST.

The basic structure of the request body is shown here:

{ 
   "size" : //The number of results in the response. Defaults to 10.
 
   "from" : // The offset of the results. For example, to get the third page for a page size of 20; you should set the size to 20 and from to 40.
 
   "timeout" : // A timeout can be specified after which the partial results are sent back in the response. By default there is no timeout. If the request times out, the timed_out value in the response will be indicated as true. 
 
   "_source" : //To select the fields, that should be included in the response. For example : "_source" : ["product_name", "description"].
  
   "query" : {
      // Query 
   }
 
   "aggs" : {
      // Aggregations
   }
 
   "sort" : {
      // How to sort the results
    }
 }

The structure of the response body is shown here:

{
  "took": // Time Elasticsearch took to execute the query. 
 
  "timed_out": // Did the query time out. By default, there is no timeout.
 
 // Elasticsearch doesn't fail the request if some shards don't respond or not available. The response will contain partial results.
 
  "_shards": { 
    "total": // Number of shards the query needs to be executed.
    "successful": // Number of shards the query is successful on.
    "failed": // Number of shards the query failed.
  },
  
  "hits": {
   "total": // Total number of hits
   "max_score": // Maximum score of all the documents
   "hits": [
      // Actual documents.
    ]
   }
 }

Sample data

To better explain the various concepts in this article, we will use the e-commerce site as an example. We will create an index with a list of products. This will be a very simple index called chapter6 with type called product. The mapping for the product type is shown here:

#Delete existing index if any
DELETE http://127.0.0.1:9200/chapter6
 
#Mapping
PUT http://127.0.0.1:9200/chapter6
{
  "settings": {},
  "mappings": {
    "product": {
      "properties": {
        "product_name": {
          "type": "text",
          "analyzer": "english"
        },
        "description": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}

Let’s index some product documents:

#Index Documents
PUT http://127.0.0.1:9200/chapter6/product/1
{
  "product_name": "Men's High Performance Fleece Jacket",
  "description": "Best Value. All season fleece jacket",
  "unit_price": 79.99,
  "reviews": 250,
  "release_date": "2016-08-16"
} 

PUT http://127.0.0.1:9200/chapter6/product/2
{
  "product_name": "Men's Water Resistant Jacket",
  "description": "Provides comfort during biking and hiking",
  "unit_price": 69.99,
  "reviews": 5,
  "release_date": "2017-03-02"
} 

PUT http://127.0.0.1:9200/chapter6/product/3
{
  "product_name": "Women's wool Jacket",
  "description": "Helps you stay warm in winter",
  "unit_price": 59.99,
  "reviews": 10,
  "release_date": "2016-12-15"
}

We will refer to the preceding three documents for the remainder of this article.

Basic query (finding the exact value)

The basic query in Elasticsearch is term query. It is very simple and can be used to query numbers, boolean, dates, and text. Term query is used to look up a single term in the inverted index.

A simple term query looks like the following:

POST http://127.0.0.1:9200/chapter6/product/_search 
{
   "query": {
     "term": {
       "product_name" : "jacket"
     }
   }
 }

Term query works great for a single term. To query more than one term, we have to use terms query. It is similar to in clause in a relational database. If the document matches any one of the terms, it’s a match. For example, we want to find all the documents that contain jacket or fleece in the product name. The query will look like the following:

POST http://127.0.0.1:9200/chapter6/_search 
{
   "query": {
     "terms": {
       "product_name" : ["jacket","fleece"]
     }
   }
 }

The response of the query is as follows:

{
   ....
   "hits": {
     "total": 3,
     "max_score": 0.2876821,
     "hits": [
       {
         "_index": "chapter6",
         "_type": "product",
         "_id": "2",
         "_score": 0.2876821,
         "_source": {
           "product_name": "Men's Water Resistant Jacket",
           "description": "Provides comfort during biking and hiking",
           "unit_price": 69.99,
           "reviews": 5,
           "release_date": "2017-03-02"
         }
       },
       {
         "_index": "chapter6",
         "_type": "product",
         "_id": "1",
         "_score": 0.2824934,
         "_source": {
           "product_name": "Men's High Performance Fleece Jacket",
           "description": "Best Value. All season fleece jacket",
           "unit_price": 79.99,
           "reviews": 250,
           "release_date": "2016-08-16"
         }
       },
       {
         "_index": "chapter6",
         "_type": "product",
         "_id": "3",
         "_score": 0.25316024,
         "_source": {
           "product_name": "Women's wool Jacket",
           "description": "Helps you stay warm in winter",
           "unit_price": 59.99,
           "reviews": 10,
           "release_date": "2016-12-15"
         }
       }
     ]
   }
 }

Relevance

A traditional database usually contains structured data. A query on a database limits the data depending on different conditions specified by the user. Each condition in the query is evaluated as true/false, and the rows that don’t satisfy the conditions are eliminated. However, full-text search is much more complicated. The data is unstructured, or at least the queries are. We often need to search for the same text across one or more fields. The documents can be quite large, and the query word might appear multiple times in the same document and across several documents. Displaying all the results of the search will not help as there could be hundreds, if not more, and most documents might not even be relevant to the search.

To solve this problem, all the documents that match the query are assigned a score. The score is assigned based on how relevant each document is to the query. The results are then ranked based on the relevance score. The results on top are most likely what the user is looking for. In the next few sections, we will discuss how the relevance is calculated and how to tune the relevance score.

Let’s query the chapter6 index we created at the beginning of this article. We will use a simple term query to find jackets. The query is shown here:

POST http://127.0.0.1:9200/chapter6/_search
 {
   "query": {
     "term": {
       "product_name" : "jacket"
     }
   }
 }

The response of the query looks like the following:

{
   ....
   "hits": {
     "total": 3,
     "max_score": 0.2876821,
     "hits": [
       {
         "_index": "chapter6",
         "_type": "product",
         "_id": "2",
         "_score": 0.2876821,
         "_source": {
           "product_name": "Men's Water Resistant Jacket",
           "description": "Provides comfort during biking and hiking",
           "unit_price": 69.99,
           "reviews": 5,
           "release_date": "2017-03-02"
         }
       },
       {
         "_index": "chapter6",
         "_type": "product",
         "_id": "1",
         "_score": 0.2824934,
         "_source": {
           "product_name": "Men's High Performance Fleece Jacket",
           "description": "Best Value. All season fleece jacket",
           "unit_price": 79.99,
           "reviews": 250,
           "release_date": "2016-08-16"
         }
       },
       {
         "_index": "chapter6",
         "_type": "product",
         "_id": "3",
         "_score": 0.25316024,
         "_source": {
           "product_name": "Women's wool Jacket",
           "description": "Helps you stay warm in winter",
           "unit_price": 59.99,
           "reviews": 10,
           "release_date": "2016-12-15"
         }
       }
     ]
   }
 }

From the preceding response, we can see that each document contains a _score value. The scores of the three jackets are as follows:

ID Product name Score
2 Men’s water-resistant jacket 0.2876821
1 Men’s high-performance fleece jacket 0.2824934
3 Women’s wool jacket 0.25316024

We can see that the document with the ID 2 is scored slightly higher than documents 1 and 3. The score is calculated using the BM25 similarity algorithm. By default, the results are sorted using the _score values.

At a very high level, BM25 calculates the score based on the following:

  • How frequently the term appears in the document–term frequency (tf)
  • How common is the term across all the documents–inverse document frequency (idf)
  • Documents which contains all or most of the query terms are scored higher than the document that don’t
  • The normalization is based on the document length, shorter documents are scored better than the longer ones

To learn more about how the BM25 similarity algorithm works, please visit https://en.wikipedia.org/wiki/Okapi_BM25.

Not every query needs relevance. You can search for the documents that exactly match a value, such as status, or search for the documents within a given range. Elasticsearch allows combining both structured and full-text search in the same query. An Elasticsearch query can be executed in a query context or a filter context. In the query context, a relevance _score is calculated for each document matching the query. In a filter context, all the results that match the query are returned with a default relevancy score of 1.0; we will discuss more details in the next section.

Queries versus Filters

By default, when a query is executed, the relevance score is calculated for each result. When running a structured query (such as age equal to 50) or a term query on a non-analyzed field (such as gender equal to male), we do not need scoring. As these queries are simply answering yes/no. Calculating the relevance score for each result can be an expensive operation. By running a query in the filter context, we are telling Elasticsearch not to score the results.

The relevance score calculated for a query only applies to the current query context and cannot be reused. Like we discussed in the preceding section, score is based on term and inverted document frequency (idf), due to which the queries are not cachable. On the other hand, filters have no relevance to the query and can be cached automatically. To run a query in the filter context, we have to wrap the query with a constant_score query as shown here:

POST http://127.0.0.1:9200/chapter6/_search
 {
   "query": {
     "constant_score": {
       "filter": {
         "term" : {
           "product_name" : "wool"
         }
       }
     }
   }
 }

The results of the preceding query are not scored, and all the documents will have a score of 1. The query runs in the filter context and can be cached. We can also run queries that need scoring in the query context and others in the filter context. We will use the bool query to combine various queries as shown in the following example. A sample query is shown here:

POST http://127.0.0.1:9200/chapter6/_search
 {
   "query": {
     "bool": {
       "must": [
         {
           "match": { #Query context
             "product_name": "jacket"
           }
         },
         {
           "constant_score": { #Filter context
             "filter": {
               "range": {
                 "unit_price": {
                   "lt": "100"
                 }
               }
             }
           }
         }
       ]
     }
   }
 }

In the preceding query, the match query is executed in the query context, and the range query is executed in the filter context.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s