“Building a Powerful Data Tier from Open Source Datastores”
At the excellent OSCON Europe conference in London, I saw many great and memorable sessions. My favourite by far was the talk on open source datastores by Joey Lynch from Yelp. It had such excellent coverage of the available options, alongside pragmatic advice about when best to pick each one. Best of all it focussed entirely on open source databases, many of which we have available on our Bluemix platform.
Datastores: No One Can Deploy Just One.™
Joey started by discussing why more than one datastore is a good idea. Many traditional application architectures have one thing labelled “database”, and the rest of the system has to work around the constraints of whichever database was chosen. In reality, today’s applications are complex, componentised, and can easily take advantage of multiple datastores and their respective strengths. Given that this was a short session for the size of the topic (only 40 minutes!), five specific areas were covered in detail: relational databases, document databases, key value stores, configuration services and search.
The traditional relational database still has much to offer modern applications. In this section there were shout-outs to both MySQL and PostgreSQL, which have been doing sterling work in our stacks for so long. These relational databases are great when working with relational data (surprising, I know) and are based on well-established academic research. For data that is not relational, such as objects, Joey’s advice is to consider adding appropriate datastores to your application stack.
Document databases are not brand new, but many organisations haven’t yet identified the best way to make use of their power. The main players in this space are MongoDB (very popular in the NodeJS community), CouchDB (fresh from its recent 2.0 release milestone) and RethinkDB (now safely in community control after its commercial caretakers announced their retirement). Joey gave us all a few words of encouragement about how easy it is to get started with document databases and recommended they be used in place of the relational databases for object storage. These datastores are definitely growing in adoption and it’s easy to see why: they are very modern and web-friendly with JSON document storage and strong analytical features.
Key Value Stores
Key value stores are always an auxiliary datastore; they are blazing fast for small and simple data but aren’t suitable for some other data types. The key value stores include our old friend Memcached and its newer cousin Redis. Both are open source projects seeing excellent adoption and impressive performance in a wide range of applications.
If it’s durability you need in your key value store, Redis has eventual consistency features. Other options include the DynamoDB and BigTable databases, which are based on modern academic research and perform well in particular use-cases. For applications requiring serious write scalability, Joey’s advice was to check out HBase, Riak or Cassandra — but to bear in mind that at this level of scalability the design emphasis is very much on the queries required rather than the data shape or structure.
In contrast to the stores mentioned so far in the talk, these foundational datastores don’t store application data, but rather, they keep information needed to solve problems such as distributed consensus. In this space the leading offerings are Apache Zookeper, Consul and etcd. Joey’s advice was to look for examples of these deployed well before trying to solve problems with a homespun solution, especially in distributed systems.
The search space would easily have offered a talk by itself, but the quick overview here was valuable. Search options include Lucene, Elastic Search and Solr — all of them offering a wide range of awesome features and with support for a variety of different data types.
The main challenges when working with search indexes is keeping them updated at a rate matching the other datastores in the application stack. We were also treated to some sage advice around relying on these types of datastores for persistence: search engines are difficult to keep consistent and their indexes should always be expected to lose data on occasion. The recommendation is to deploy these solutions as secondary stores and never to rely on search engines as primary storage.
My favourite slide of the talk was where Joey laid out his approach for choosing a datastore to add to your stack.
- Does this datastore satisfy the business requirements which are causing us to consider a new datastore?
- Is the community behind this project solid?
- Will it play nicely with your existing stack?
- Given that all datastores are broken in some way: how broken is it?
Deploying, Operating and Managing Datastores at Scale
The final section of the talk walked us through the evolution of the datastores landscape at Yelp over the last 5 years, with some excellent explanations of what drove the need and selection for each addition. Check out the slide deck for some lovely overview diagrams of how the Yelp data pipeline looks now and how they manage the movement and aggregation of all their data at huge scale.
Here at IBM we have a good proportion of the datastores mentioned here available for you to painlessly add to your applications (or just play around with on our platform). Check them out:
Which are your favourites? Are there any you’d have included if you were giving this talk yourself? Let us know in the comments! My colleague Matt Collins and I also have our own series on getting started with some of these datastores. See our series “Seven Databases in Seven Days,” parts 1, 2, 3, 4, 5 and 6 for more.