dwblog-nosql

One question many app developers find themselves asking is, “OK, I have plenty of data from my application’s users. Now how do I gain insight from my application’s data?” Likely, much of the developer’s dilemma here stems from the NoSQL data stores which support their web or mobile app, because NoSQL databases (like JSON document stores, which we’ll discuss here) are notoriously resistant to traditional analytics approaches.

The rise of NoSQL is intimately tied to the rise of “the app economy.” As the web (and our use of it) evolved, the nature of the data we create and store has changed – and it has become much bigger. As relational databases that ran all the desktop client-server applications moved to the web, a new problem developed. Yesterday’s desktop applications came out of a software box from CompUSA (may they rest) and were installed to support an entire company’s user base. But web applications grow to many thousands, if not millions, of users. The traditional relational database simply isn’t efficient at supporting the data needs of web-scale apps. Hence the rise of NoSQL.

The challenge we now have of bringing analytics to NoSQL ultimately boils down to the debate around the merits of schema vs. schema-less data structures. This is an age-old debate which, in our new world of the “Three V’s of Data” (Volume, Variety and Velocity), is here again with a vengeance.

 

How Did We Get Here?

Let’s take a quick historical tour of the schema vs. schema-less debate and how we got here. Once upon a time, data was stored on a number of files that would repeat that data. For example, let’s imagine a hardware store a few decades ago. This hardware store is very progressive and has implemented a bookkeeping system on a PC.

Back then, if you bought a Stanley Drill, it was made to be the last drill you ever bought (a favorite talking point of my grandfather Earl’s). However, you would still wear out drill bits, which meant you kept going back to the hardware store to replace them. And with our hardware store’s PC bookkeeping system, each time you came back to buy a new drill bit, your customer information would be stored with the sales data in a table like this:


table1

But there were two major pains associated with this method.

Pain 1:
This data took up a lot of space in a world where storage was very expensive. I think back to my first Apple II+ and then I look at my iPhone – it’s incredible to think that the storage capacity we hold in our hands today used to fill a room of mainframes. So, the repetition of data in these files back in those days was very costly.

Pain 2:
Updating data. If the hardware store owner wanted to get a list of their customers to send out a holiday circular, we see that the data entered above is incorrect. Unless George Carpenter has gone the path of Prince in 1993 and officially changed his name to “Georg3,” there has been a typo – thereby creating a second result on a query of this sales table that could create a second holiday circular that will be mailed to the same address, thereby causing unnecessary extra cost to our neighborhood small business  and annoying our customer with unwanted junk mail. If we expand this example to something a bit larger, like a department store chain, it grows into a huge problem for bottom line concerns like marketing expenditures and customer retention.

Long story short, the original answer to these pain points was the invention of the relational database. With a relational database, customers could be assigned a CustomerID and separated into their own table, where their records could be updated and edited in one place. Then, using a Foreign Key, you could quickly determine the identity of the customer with a query that joined these tables. We would also do the same for the product data. Let’s look at what that new model looks like.

Breaking up our Customer Sales in a relational architecture implementing our Foreign Keys, we get three tables as follows.


table2

 

Our new schema looks like this:


table3

 

Now, this may all be basic information, causing you to tilt your head and say:

“Why are you telling me this? Everyone knows this.”

Well, I’ll tell you. This schema is actually helping us police our application. With a relational database system (RDBMS), you are not allowed to enter in data that “doesn’t exist.” Because you have placed that Foreign Key in the sales table, you can’t put a record as a sale that has no customerID associated with it. The system breaks. This is very helpful if the success of your app depends on not having junk data or not losing data in a sale. You now have much more control over the entities in your application. Adding an attribute to your Products table is not some nightmare of editing giant records such that inserting an extra field is cumbersome, time consuming and prone to error.

Going back to our first example, what if you wanted to add the color of the product to that sales table? In our new schema world, we just need to add it to the Product table and set the color. All legacy sales queries now have that data available – you just have to run the report again with the new field included.

Schemas are great for reporting and analytics. You know where things are. The schema is your map and you can pick which attributes you are going to analyze. In our example above, it would be very easy to get the sum and count of sales by product, by zip code and by individual customers. And with our new Product_Color field, we can do some digging on what colors sell the most and when.

 

So, What About NoSQL Analytics?

“Great. But aren’t we here to discuss getting some insight from a NoSQL database?”

Yes. So one benefit of a NoSQL data store, like a JSON database, is that because there isn’t a pre-defined schema, you can iterate quickly and data can flow into your document store as needed. This is in contrast to the RDBMS “schema police” mentioned above, who make sure those sales records all pass the rules of the defined schema, acting as sticks in the spokes – they bring data-in-motion, flowing from our apps, our web events, and our devices, to a grinding halt. For many, the move to NoSQL came about because in the time it took the relational database to make sure the user’s data was “okay” for storage, the user had already left and the app failed.

Calling NoSQL “schema-less” is a bit of a misnomer in this use case. The data is being stored in a type of structure that is flexible to additions and subtractions of attributes. The benefit is the ability to perform “fast writes” – incredibly fast insertions of data. With no schema police, you can keep up with your users. Reading from a JSON document store (e.g. IBM Cloudant) is great for apps and we know front-end JavaScript loves it some JSON. The problem comes when you want to query this data to start gaining insights from it. Creating “design docs” that will grab counts and do some simple metrics aggregations will keep you going, but when you want to hook up an analytics application, in most cases, it’s going to be looking for a schema.

So what about when you want to start aggregating daily statistics into summary tables for charts and graphs? You are definitely going to be looking to move your JSON data into a structure built for speed. (This isn’t just about JSON either. Many relational databases also start to fall down when we enter the world of analytics).

Remembering our simple schema above, imagine a schema with a sales table that is normalized, using all those IDs as foreign keys to other tables. We could end up with tables that have dozens of joins just to get a line item produced. This overhead of query processing gave rise to the dimensional model in data warehousing invented by Ralph Kimball. The main goal of this architecture is to start to take a highly normalized relational model and start to “flatten it” so that an application built to report on it doesn’t need to do as much work and can crunch more and more numbers.

 

Now, if you are thinking…

“Wait, flattening? Isn’t that starting to head back towards the beginning of our story?”

…the answer is, yes. It’s a little different, but yes. Just wait.

There are a few metrics you are going to be able to get rather easily, for example: number of downloads from an app store, number of people signing up for a beta via a website, and number of views you are grabbing from a popular web analytics platform you embedded into your website.

But there are many more metrics around collecting web events and user engagement that can prove highly valuable to your business. How many people sign up but then don’t come back? How active are people on your app? Your product team benefits from knowing which features of your app people are using the most. This information can inform which popular features you should build out and which features should be abandoned or re-architected because users aren’t engaging with them.

There’s also your event and logging data stored as JSON. As you start to grow and expand, having more than one operational database collecting information about your new business will allow you to collect and store more data and metrics. Graduating to a SaaS billing system, like Stripe, provides useful metrics, but it would be even better to bring all that information into a dashboard – where you can compare spikes in sales to different marketing campaigns, for example.

Unless you want to make a dashboard where you are doing API calls to all your different SaaS systems and operational databases (from web apps, event collection, etc.) to have some point-in-time reporting, you now need to think about getting your data into a format that you can see across your business historically. This is no small task, and this is the basis for the entire art and science of the data warehousing and business intelligence industry. Before we start creating a giant system of implementing all the great technologies out there like Spark / Hadoop / Elasticsearch, which can be a bit overwhelming if you are only looking for a first stage reporting solution, let’s learn about a simple data warehouse architecture that is going to accomplish your most important goals easily.

 

Simple Data Warehousing

“Great, I get it. Now, how do I get JSON into a schema for analysis?”

Once your data is centralized in a JSON database, it can be “piped” to a variety of data stores and analytics tools optimized for visualizing and exploring data. Of course, the JSON data will still need to be put into a schema for analysis. This can be done in multiple ways, but one simple approach is to leverage a schema discovery process (SDP) – like the one used in the integration between Cloudant and IBM dashDB, a managed columnar data warehouse service. Cloudant’s SDP will determine and build a schema in dashDB for a given JSON document to fit into. It will also identify nested structures in your JSON and will create additional tables to hold that data appropriately, creating a key using the document ID native to your JSON document.

Once JSON data has been given a schema and is stored in a relational data warehouse, like dashDB, it is ripe for visualization, analytics and exploration using whatever SQL-compatible tools you like. This offers many opportunities for “embedded BI” – which means using metrics alongside the operational processes that your app is recording.

If you have an app with the ability to enter data to perform daily work flows, embedded BI could enable a trended graph showing your user their history or any other helpful information collected in your data warehouse. (e.g. a salesperson needs to make a quarterly quota and they can now see their performance, and how it is trending to meet that goal). Data that is coming from your data warehouse isn’t just showing up on management reports, but informing your users in real-time about their usage.

The rise of non-relational databases has created great flexibility for app developers in terms of writing data and leveraging a format that is made for web technologies. When you look at the history of how the database has evolved, it can’t go without notice that we are now using a data structure that is similar to how we started out – schema-less – and it solves the problems created by normalizing data in relational architectures.

With the decreasing costs of data storage and the inception of cloud platforms, the trend of data denormalization has been a vital force in accelerating innovation. By building a bridge from NoSQL to SQL, developers can easily move their application data, discover its schema, and connect on an integrated platform built for analytical advancement.

 

Learn More: A Field Guide to the World of Modern Data Stores

Navigating today’s cloud databases and analytics options can be challenging. But it doesn’t have to be intimidating. Start (or level-set) your journey today with A Field Guide to the World of Modern Data Stores – read this FREE eBook to learn:

  • What are the defining characteristics and strengths of today’s different cloud databases?
  • How can data in NoSQL stores be analyzed to learn more about your customers?
  • How are different open source databases used together to achieve polyglot persistence?

 


5981_C_Banner_300x250_V1(1)

Get A Field Guide to the World of Modern Data Stores from IBM Cloud Data Services.

 

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

2 comments on"Drilling into analytics with NoSQL data stores"

  1. Overall solid post, thanks for sharing. However, innovation is moving beyond the “make it all relational fixed schema”, and for good reasons. In JSON for example there is no difference between data and schema, meaning a given piece of data can be either or both. When you arbitrarily declare something as schema using SDP that data is no longer available to query in a relational structure. Many more issues we have documented them all in a whitepaper that defines the requirements of true NoSQL analytics without loss of data fidelity or integrity. Additionally, this is not needed “Great, I get it. Now, how do I get JSON into a schema for analysis?”. Open source projects like Quasar-Analytics and others do not require any fixed schema, mapping, data relocation of any kind, simply analyze JSON in place, as it exists in the DB using SQL^2 (squared) and creates all the viz, charts, reports you desire using tools like SlamData. Incidentally, Quasar could easily support Cloudant with a few weeks work on a connector. Support for Couchbase, Spark, Marklogic coming in Oct and more to follow by end of year.

    • Thanks for your comment, Jeff. I’ve had a look at your tech and it’s pretty cool! We agree that transferring JSON data into a relational format is not the best solution in all use cases. Many organizations, that aren’t necessarily ‘born on the cloud’ are still heavily invested in relational BI technology, and for them, a solution like Cloudant SDP and dashDB makes a lot of sense. For example, in the case of piping in data for an ODS ( https://developer.ibm.com/clouddataservices/2015/08/24/operational-data-store/ ) landing JSON data (from weblogs or SaaS APIs like Salesforce) next to those operational sources from an RDBMS for unified reporting with a tool like Looker or Chartio.
      You’re correct, however, in some cases it makes more sense to look into new analytics purpose-built for NoSQL data (take a look at the Spark-Cloudant connector for an example: https://developer.ibm.com/clouddataservices/2016/03/09/introducing-spark-cloudant-connector/). We don’t see these trends as mutually exclusive, though. With Cloudant, you have options:

      1) built-in aggregations/analytics with MapReduce;
      2) relational analytics/BI with the dashDB integration;
      3) ad hoc analytics via notebooks, spark submit, spark streaming with spark-cloudant.

      We’re continuing to improve upon #1, and building more native integrations with re: to #3, to have a SQL API for Cloudant with a Spark wrapper. Stay tuned for more on this!

Join The Discussion

Your email address will not be published. Required fields are marked *