Those of you who have explored our sample applications know that we are tracking deployments to measure interest. In preparation of the analysis process, we are continuously loading the tracking records (JSON documents identifying when a sample application was deployed) from the Cloudant repository database into a dashDB data warehouse.

Data movement scenario

Data movement scenario

Cloudant’s Data Warehousing feature makes it easy to continuously extract JSON documents and load them into the relational data warehouse. Unfortunately, some of these tracking records don’t meet the data warehouse quality guidelines and have to be filtered out or transformed before an attempt is made to load them.

To support document filters and transformations, we’ve built a simple lightweight service that handles two main steps:

  1. Copies JSON documents from one Apache CouchDB™ or Cloudant database to a staging database
  2. Applies user-defined filters and/or transformation routines

Together, they prepare your JSON for cleaner continuously-running data warehousing tasks.

Solution overview
Solution overview

We implemented the service as a simple Node.js web application. It’s available on our couchdb-db-transform GitHub repo.

Unlike CouchDB-style replication, our copy-and-transform service was designed to process an incoming flow of new documents, not new and updated documents.

Once you’ve downloaded the app, you have to customize it by defining a filter or transformation routine that meets your needs. Out of the box, no filtering or
transformation is applied.

Service details
Service details

Filtering documents

Certain documents in the source database might not be of analytical interest and should therefore not be loaded into the data warehouse. To exclude these documents — which, for example, could be invalid tracking records or design documents in our project — the service provides server-side filters and client-sider filters. Filter routines return a boolean value indicating whether a document should be processed (true) or not (false).

Server-side filters

Server-side filters are implemented in Javascript and defined within design documents in the source database. These filters are applied on the source database and exclude documents that should not be sent to the service, which, depending on the filter condition, can significantly reduce the network traffic. The GitHub repository includes a couple of sample server-side filters that you can use as a starting point to build your own. The following example depicts a simple filter that removes deleted documents:

{
  "_id": "_design/transform_service",
  "filters": {
    "exclude_deleted_docs": "function(doc, req) { if(doc._deleted) { return false; }  else { return true; }}"
  },
  "language": "javascript"
}

Client-side filters

Client-side filters are applied by the service, removing documents that should not be transformed and loaded into the target database. Filter routines are implemented in Node.js and packaged with the service. The GitHub repository includes sample client-side filters, such as the following that ignores design documents:

module.exports = function(change) {
    if((! change) || (! change.doc) || (change.doc._id.startsWith('_design/'))) {
        return false;
    }
    else {
        return true;
    }
};

The docproperty in the change object contains the complete document from the source database.

If possible, use server-side filters instead of client-side filters to reduce the network overhead if filter conditions are highly selective. Use client-side filters if you are not authorized (or don’t want) to create a design document in the source database.

Transforming documents

Documents that have passed server-side and client-side filters will be processed by the services’ transformation routine, if one is configured. Like client-side filters, transformation routines are implemented in Node.js. A transformation routine receives a document as its input and is expected to return the modified document, as shown in the example below, which adds a timestamp property (in ISO 8601 format) to the document:

module.exports = function(doc) {
    if(doc) {
        doc.timestamp = new Date().toISOString();
    }
    return doc;
};

To get you started quickly, we’ve included a couple of example transformation routines.

The design was inspired by Glynn Bird’s popular couchimport utility, which is used to bulk load large data sets from text files into CouchDB/Cloudant.

Running the service

You can install and run this service locally or deploy it to Bluemix by following the setup instructions in the project’s README.

Join The Discussion

Your email address will not be published. Required fields are marked *