In this blog we are going to discuss why choosing the index type for querying your geospatial data is important and why IBM Cloudant supports configurable secondary geospatial index types. That’s why Cloudant allows GIS developers to use more than one of these indexes in the same application. Don’t let a database restrict your index choice!
We will discuss:
- R-Tree and R*-Tree
- Distributed indexes
- Temporal Indexes
In later blog postings we will discuss K-D Trees, Quadtrees and Octrees which complete the more common spatial index types.
An index over a data collection enables an optimized query of the data. Index types can vary
according to the type of the data and should be chosen carefully to optimize query speed.
Cloudant’s MapReduce views use a B-Tree where the emitted key is stored as a
[key, docid] key in the B-Tree. This is very effective for key look-ups, but it is not as effective for queries when using geospatial data such as features with geometries. For indexing geometries, an index based on the minimum-bounding rectangle of the geometry is more efficient with look up since it groups geometries together. This results in significantly faster queries.
Cloudant supports multiple index types per database, enabling you to have multiple B-Tree, geospatial and full-text indexes over the same data. This was developed to support customer use cases who need to index their data efficiently in different ways. As with existing Cloudant index types, the geospatial indexes are distributed and built for scale, speed and flexibility.
Why We Didn’t Use A GeoHash As The Primary Spatial Index Type
It is possible to create a geospatial index using Apache CouchDB™ and a B-Tree by using a GeoHash and storing the generated hash values as keys in the B-Tree. A GeoHash is useful, however it is typically only used to index points or centroids. A GeoHash is a textual representation of a Morton number. Since a GeoHash string represents a bounding box where a location lies it is possible to have two spatially close locations with different GeoHashes which result in the query being inaccurate and missing data.
The image above shows the Morton Numbers across the whole world which shows how a GeoHash breaks up data of similar localilty.
Cloudant can use a GeoHash (a consistent hash) to route particular data to particular nodes within its distributed database. This enables users to peg data to specific region so that it remains closer to the end users accessing it. Alternatively, you can restrict data from a specific region for compliance reasons.
R-Tree Or R*-Tree
An R-Tree is a spatial index, it groups nearby features and represents by their minimum bounding rectangle in the tree. An R-Tree also pages the data, making it suitable for serialization.
In typical geospatial database applications, a quadtree or an R-Tree is used as the spatial index implementation. Quadtrees typically recursively divide a two dimensional space into quadrants, this can be effective but we need more flexibility.
For Cloudant the attraction of an R*-Tree was minimizing the overlap of the minimum bounding rectangle between tree nodes for faster reads and because Cloudant is a distributed database, we can distribute the load of slower writes across a cluster, minimizing the costs of writing data to the R*-Tree.
In the above image we can see considerable overlap between nodes using an R-Tree as compared to the R*-Tree below. Those overlaps mean that an R-Trees can accept writes faster, but are slow on reads because of the inefficient indexing.
Cloudant is a distributed database, each node within the cluster holds a portion of the data, not all of it. In addition, to satisfy our guarantees of partition tolerance and availability, we typically hold three copies of that data. A successful read or write operation requires two responses from the nodes in the cluster.
Since a write to Cloudant is guaranteed to be written to disk if successful, then it makes sense to use an R*-Tree as part of that spatial index to minimize overlap between nodes. Writes may be slightly slower as a trade-off but reads will be significantly quicker. Add to that the scale of data that Cloudant typically operates with and we want to minimize tree traversal by using an R*-Tree.
A temporal index is required when data needs to be queried in a predictive or historical manner. Currently Cloudant supports using a TPR-Tree which is a temporal predictive index.
The aim of a temporal index is reduce the size of the index on disk (or in memory) and to enable a quick traversal of the tree for query.
A TPR-Tree is a predictive trajectory index which currently based on a linear path through time (a.k.a., dead reckoning). In addition to movement, TPR-Tree supports the growing of features, (i.e., the dispersion of a feature) which is particularly useful when modeling how feature spread across time and space, like the dispersion of smoke from a wildfire or oil from an oil spill.
Web, mobile and GIS developers should be able to choose the right index for their queries, not try to solve all challenges with a single index type. With Cloudant, you can build an application that uses more than one of these indexes in the same application.
Cloudant builds on the excellent LibSpatialIndex library and we are keen contributors to this and its open source Erlang binding. Let us know what index types you would like added and we will get to work!
© “Apache”, “CouchDB”, “Lucene”, “Apache CouchDB”, “Apache Lucene”, and the CouchDB and Lucene logos are trademarks or registered trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.