Mango: Declarative Querying For Apache CouchDB™

Mango allows users to declaratively define and query Apache CouchDB indexes. (It’s also the open source library that powers Cloudant Query in IBM Cloudant). For an introduction to its features, please refer to this post.

Recently, Cloudant open-sourced its Apache Lucene™-based full-text search capabilities for CouchDB as well. Mango leverages Lucene not only to perform text search, but also to enable ad-hoc querying capabilities.

The two different index types for Mango (or Cloudant Query): json or text.Users can now use either the original CouchDB view-based indexes or the new search-based indexes. In this post, I’ll compare the two index types to give users an idea of when to use each (“JSON” or view-based vs. “text” or search-based).

When JSON Syntax Gets Tricky

View-based indexes are most efficient for large datasets, but contain limitations on how a user could query an index. For example, given an index defined as:

{
    "index": {
        "fields": ["foo", "bar"]
    },
    "name" : "foo-index",
    "type" : "json"
}

The following query would fail:

{
  "selector": {"$or": [{"foo": "val1"}, {"bar": "val2"}]}
}

{"error":"no_usable_index","reason":"There is no index available for this selector."}

To understand the limitation above, users must realize that the underlying index is still a CouchDB view-based index. The values of the fields are used to compose the keys in the index. When performing a query, a selector is transformed into a start_key and end_key range search against the index.

To satisfy the $or query above, Mango would have to scan the index twice, once for "foo", then another for "bar" and perform merging logic. This can get extremely complicated as queries become more complex.

In order to bypass this limitation, users need to add a “sub-query” that will allow the query engine to scan the index once and return results. The rest of the query will then be used as an in-memory filter.

To continue with the example above, the query must become:

{
  "selector": {"_id": {"$gt" : null}, "$or": [{"company": "x"}, {"twitter": "ba"}]}
}

The above query essentially does a full index scan to return all the documents and then applies the rest of the $or query as a filter on those documents.

Cleaner Syntax With Search Index Types

Mango search-based indexes resolve this issue by using Lucene indexes. To see how to create those indexes, refer to the cloudant-query-grows-up blog linked to above. Users no longer have to add these “sub-queries” to perform operations such as $or, $in, or $elemMatch.

A user might be tempted to always use search-based indexes due to their ad-hoc query capabilities. Mango’s view-based indexes, however, will perform better in scenarios where query patterns are well known and the user is already familiar with their data model. Imagine these view-based indexes as a more pleasant abstraction of the traditional CouchDB Map-Reduce view indexing system.

If users don’t know in advance what queries will be executed, then Mango search-based indexes are the way to go. This flexibility, however, comes with its own tradeoffs.

Underneath the covers, Mango search indexes create a single default field that catalogs every field in the document. Moreover, individual elements in an array are also indexed and enumerated in this field. This comprehensive approach allows the user to perform a full-text search via the $text operator. The behavior is turned on automatically when the user creates a search-based index. So for large databases, index build times can be long. The system provides an option to disable search-based index builds, but disabling it also turns off the full-text search feature.

Users who want full ad-hoc capabilities can index the entire database with search-based indexes. It’s important to note that this approach is different than the default field mentioned above. The default field is a single field that has all the values in the document stored in that one field. When a user indexes the entire database, all the fields in the document will have their own respective values stored in the index. Again, indexing the entire database will create long index build times.

Given a document such as:

{ "first_name" : "john", "last_name" : "doe"}

… with BOTH text search enabled AND the entire database indexed will have an index that has:

"default_field" -> "john", "doe"
"first_name" -> "john"
"last_name" -> "doe"

Users who don’t want to index their entire database can specify fields individually. Suppose a user only wants to index "first_name". Then the index would look like:

"default_field" -> "john", "doe"
"first_name" -> "john"

A query that searches for "last_name" would then throw an "index not found" error.

Finally, a user can turn off “default_field” and only index specific fields:

"first_name" -> "john"

But this would limit the ad-hoc capabilities of search-based indexes, and the user should use a view-based index instead.

Arrays

Arrays can also be confusing for first-time users of Mango. Subtle differences also exist for arrays when using view-based indexes vs. search-based indexes.

Arrays: JSON

Currently, view-based indexes cannot index individual array elements with one field definition. Given an array such as:

"array_field": [10, 20, 30]

If the view-based index is defined as:

{
    "index": {
        "fields": ["array_field"]
    },
    "name" : "array-index",
    "type" : "json"
}

Users can query against the index to match the array exactly:

{
  "selector": {"array_field" : [10, 20, 30]}
}

However, the user cannot access an individual array element. Note that mango uses dot-notation to access the individual elements, i.e., my_array.0, my_array.1, etc.

In the example above, if a user tried:

{
  "selector": {"array_field.0" : 10}
}

… then the user would get:

{"error":"no_usable_index","reason":"There is no index available for this selector."}

Users would have specifically index each element in the array to access the individual elements, For example:

{
    "index": {
        "fields": ["array_field.0", "array_field.1", "array_field.2"]
    },
    "name" : "array-index",
    "type" : "json"
}

However, if a user did not specify individual elements — and indexed the array as a whole — he or she can still perform operations such as $in on the array. For example:

{"selector": {"_id": {"$gt": null},"array_field": {"$in": [10]}}}

The reason this works is, again, because Mango performs the above $in operation as a filtering mechanism against all the documents. As we saw in the conclusion of the previous section on JSON syntax, the performance tradeoff with the query above is that it, essentially, performs a full index scan and then applies a filter.

Arrays: Text

With Mango search-based indexes, the user can query the index however he or she likes with one index definition:

{
    "index": {
        "fields": [{"name": "array_field.[]", "type": "number"}]
    },
    "name" : "array-index",
    "type" : "text"
}

This not only indexes the entire array, but also individual elements in the array. Users can then ad-hoc query the array.

What’s Next?

Hopefully this post helps clarify Mango view-based indexes vs. search-based indexes. In order to enable search-based indexes, currently, users must first enable text search in their CouchDB distribution. For instructions on recompiling the current release of CouchDB to use the new search features, read this article by fellow Apache CouchDB project committer Robert Kowalski.

If recompiling seems like too much work, don’t worry! Lucene text search, along with Mango’s declarative query system, will be included in the upcoming 2.0 release of Apache CouchDB. For updates, follow the project on Twitter at @CouchDB or join one of the many excellent mailing lists.

© “Apache”, “CouchDB”, “Lucene”, “Apache CouchDB”, “Apache Lucene”, and the CouchDB and Lucene logos are trademarks or registered trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.

Join The Discussion

Your email address will not be published. Required fields are marked *