Harness nuanced query features to efficiently index large datasets for complex searches

Elasticsearch provides scalable, RESTful, full-text search capability not available in traditional database solutions. Built on Apache Lucene, Elasticsearch indexes large datasets in an efficient manner in order to perform complex searches, as well as pagination, filtering, scoring, and sorting without much downtime. However, crafting queries can sometimes be frustrating. This article highlights some of the nuanced query features of Elasticsearch.

Let’s briefly cover some key terminology before diving into the details.

  • Index — A collection of documents. Think of an index as a database. In our example, it will be an inventory index.
  • Document — JSON objects within the index. Think of a document as a row in a database table. In our example, a document could be “apple,” “eggs,” or “pie crust.”
  • Mapping — Defines different data types for fields and how fields should be indexed and stored in Elasticsearch. You can think of this as a schema. “Date,” “boolean,” “long,” and “keyword” are defined here, as well as analyzers to determine how to index certain fields. This is where you would define special search behavior to implement something like typeahead functionality. Even though Elasticsearch will assign a default, it is best to double-check or pass in a strict mapping you define.

    All examples going forward will be in the context of a grocery wholesale inventory using the sample data below:

   {
  "_index": "inventory",
  "_type": "food",
  "_id": "fruit-citrus-sumo-orange",
  "_source":{
  "month": "2020-08",
  "type": "fruit",
  "kind": "citrus",
  "brand": "Cloud Farms",
  "product": "Sumo Orange",
  "popularity_score": 79,
  "price": 4.99,
  "priceUnit": "pound",
  "quantity": 10000,
  "sku": "123456789",
  "transactions: [
    {
      "date": "2020-08-04",
      "profit": 2310,
      "customer_id": "SDOG",
    },
    {
      "date": "2020-08-06",
      "profit": 620,
      "customer_id": "CLUW",
    },
    {
      "date": "2020-08-10",
      "profit": 1614,
      "customer_id": "PZJF",
    },
  ]
}

While not the best option for making complex search queries, it is useful for making straightforward queries in the browser.

Find all documents of type: fruit: http://localhost:9200/inventory/_search?q=type:fruit

Find all documents of type: fruit and has a brand that starts with the letter C: http://localhost:9200/inventory/_search?q=type:fruit AND brand:c*

Find all documents of type: fruit, has a brand that starts with the letter C, and a popularity score between 75 and 100: http://localhost:9200/inventory/_search?q=type:fruit AND brand:c* AND popularity_score:[75 TO 100]

2. Fetch a few fields only

To fetch only certain fields from the index (when generating a drop-down filter menu, for example), it may come with a small performance improvement. You can use:

{
    "_source": ["product", "brand"],
    "query": {},
    "size": 10
}

3. When to use filters vs. queries

Using filters and queries can accomplish the same thing, but they have a slight difference. Filters return non-ambiguous Boolean answer of yes or no if it satisfies the criteria. Some of these filters are cached, which will improve subsequent searches. Queries assign a score to the document that indicates how well it satisfies the criteria. This can be used for something like a search engine where you want to show the most relevant results first. Filters and queries can be used together. Filters are used for exact matches — ranges, exact text matching, Booleans. Queries are used for partial text matching.

4. Let’s talk about text vs. keyword

By default, the mapping on a string field looks like this:

brand: {
   type: "text",
   fields:
   {
     keyword: {
       type: "keyword",
       ignore_above: 256
     }
   }
},

It is mapped as text and keyword. That means you can perform full-text search of the brand name and do keyword search and aggregation by using the brand.keyword field.

This is a common error you will see if you are using text when you should be using keyword: “Fielddata is disabled on text fields by default. Set fielddata=true on [brand] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.”

5. Many kinds of aggregations

There are metrics aggregations, bucket aggregations, and pipeline aggregations. Within these categories, there are cardinality, stats (seen in the previous tip), date histogram, terms, range, top hits, geo bounds aggregations to name a few. You can also nest aggregations. We will cover several in the upcoming tips.

6. Find top 10

Elasticsearch aggregations are useful for creating responsive dashboards. Users can build out custom queries to execute. Oftentimes, people like to see eye-catching metrics, most popular product by x, quarterly breakdown of profits, top 10 services, sales numbers over time, total revenue this year, etc. Elasticsearch is ideal for those metrics:

aggs: {
   "quantity": { "sum": { "field": "quantity" }},
   "priceStats": { "stats": { "field": "price" } }
}

Using stats will return min, max, sum, count, and avg. You can also get them individually.

7. Break down a data set by type by month or vice-versa

Let’s say you want to show the quantity by brand by month. You would use the date histogram aggregation nested in a terms aggregation. The terms aggregation will group all results by brand and create a bucket per month:

"aggs": {
    "byBrand": {
    "terms": {
      "field": "brand.keyword",
      "order": { "_count": "desc" },
      "size": 500
    },
    "aggs": {
      "byMonth": {
        "date_histogram": {
          "field": "timestamp",
          "interval": "month", // explicit in ES7+ as calendar_interval and fixed_interval
          "min_doc_count" : 0
        },
        "aggs": {
          "quantity": { "sum": { "field": "quantity" }}
        }
      }
    }
  }
}

This data can go directly into a stacked bar chart, which will show what brands have the most quantity of items each month. Also note that order refers to the order of the buckets. By default, they are ordered by descending doc_count. In the example above, the data will be sorted alphabetically (i.e., Apple Farms, Cheese Cow, Forgotten Vegetables Frozen Food, Penguin Foods Inc, Yoo-gurt). The options include _term (or _key in later ES versions), _count, or a field. For example:

"brand": {
  "field": "brand.keyword",
  "order" : { "popularity_score" : "desc" },
  "size": 500
}

Here we are sorting by the popularity score. You can change the aggregations to get the opposite — by month, by product:

"aggs": {
  "byMonth": {
    "date_histogram": {
      "field": "timestamp",
      "interval": "month",
      "format" : "yyyy-MM-dd"
    },
    "aggs": {
      "byBrand": {
        "terms": {
          "field": "brand.keyword",
          "size": 50
        }
      }
    }
  }
}

8. Find or aggregate within a nested array

Elasticsearch flattens documents. Using default query and aggregation syntax, you will be unable to do queries on inner objects. So from the sample data, you would be unable to add up the earnings in the the transactions object or find out how much customer CLUW spent in August. You must use the nested aggregations to solve this. You will need to delete the index (which will delete all the data), create a new mapping, and re-index the data to this new index. So transactions in the mapping should look like this:

"transactions" : {
  "type" : "nested",
  "properties" : {
     "purchase_date" : { "type" : "date" }
     "profit" : { "type" : "integer" },
     "customer_id" : { "type" : "text"}
  }
},

To add up profit for August and show a daily histogram:

"aggs": {
  "nested_data_object": {
    "nested": {
      "path": "transactions"
    },
    "aggs": {
      "profit_ytd": {
        "filter": {
          "range": {
            "date": {
              "gte": start,
              "lte": end
            }
          }
        },
        "aggs": {
          "profit": {
            "sum": {
              "field": "transactions.profit"
            }
          },
          "profitByDay": {
            "date_histogram": {
              "field": "transactions.date",
              "interval": "day",
              "format" : "yyyy-MM-dd"
            },
            "aggs": {
              "profit": { "sum": { "field": "transactions.profit" }}
            }
          }
        }
      }
    }
  }
}

To find customer CLUW:

{
    "query": {
        "nested" : {
            "path" : "transactions",
            "query" : {
                "bool" : {
                    "must" : [
                        { "match" : {"transactions.customer_id" : "CLUW"} }
                    ]
                }
            }
        }
    }
}

However, querying nested fields is slower compared to the other fields. To avoid this, you can also flatten your nested fields.

9. Fill in the gaps in a date_histogram

Use min_doc_count to fill in missing intervals in the date_histogram. This will return empty buckets between the first and last document found. You can also add extended_bounds to populate the missing buckets in the whole range.

Summary

As you can see, Elasticsearch efficiently indexes large datasets for complex searches, filtering and sorting without much downtime. Our tips and tricks should prove helpful when you’re utilizing the nuanced query features of Elasticsearch.