Today, we’re introducing a refactored and streamlined Simple Data Pipe, our open-source data movement project. While the workflow for piping data has changed, the new architecture opens up more free options for data movement onto, or off of, the IBM cloud.

Why change The Pipe?

Services are changing rapidly on IBM’s Bluemix application platform. As these services evolve, we wanted to create a more modular Simple Data Pipe that could better deal with new features and brand new products.

If you’re already using the Simple Data Pipe, don’t fear. We can still move data to dashDB, IBM’s cloud data warehouse. I’ll cover the mechanics of analytics workflows later on. For now, let’s look at The Pipe’s new architecture and our motivations behind it.

A simpler data pipe architecture

It’s all about getting data. The big problem the Simple Data Pipe solves has always been about sourcing data from disparate Web APIs. The Pipe captures that data in its native structure, and persists it in a database that’s flexible enough to adapt to your plans for processing it.

The new Simple Data Pipe no longer assumes that you plan to process data for a particular use (analytics), in a particular place (dashDB). We’ve modularized the architecture of The Pipe by separating the step of landing data in Cloudant from the step of moving data to a different, more specialized place. Here’s an “annotated” architecture diagram:

Simple annotations for a Simple Data Pipe
The new Simple Data Pipe lands data in Cloudant

Instead of automating the process of moving data from REST sources → Cloudant → dashDB, the new Simple Data Pipe is scoped more narrowly to REST sources → Cloudant and ends the process there. It’s a cleaner, more modular approach that we believe better handles the rate of innovation in the Bluemix ecosystem and makes the data pipe more useful to applications beyond analytics use-cases.

What the Pipe has lost in push-button, end-to-end data movement, it has gained in flexibility. Also, it still allows for future implementations that do move data end-to-end, whenever free APIs are available for analytics engines like IBM’s Apache Spark service, warehouses like dashDB, and other tools.

More options for your next move

For users who are focused on analytics use-cases, the new Simple Data Pipe can still connect to dashDB, although that connection is no longer baked in. It’s now a separate step completed in Cloudant. While this roster will expand, here is the current set of options for moving data out of Cloudant:

  • dashDB, via native Cloudant integration with dashDB. Finish movement using Cloudant’s web dashboard.
  • Apache Spark, via native Cloudant integration with Bluemix’s Spark service. Finish movement by calling the Cloudant connector in a Spark Scala Notebook.
  • Transporter, the open source ETL pipeline by Compose.io. Finish movement by configuring package info and associated JavaScript code.
  • DataWorks, enterprise-grade APIs for data shaping & movement. A paid service on Bluemix as of February 2016. Provision DataWorks on Bluemix first, before deploying the new Simple Data Pipe.

When compared to the previous version of the Simple Data Pipe — aside from a streamlined architecture — we’ve removed The Pipe’s dependence on DataWorks. Connecting the DataWorks APIs to the data pipe is still an option, but by removing this dependency, Cloudant can provide more options for data movement.

Moving "Piped" data into dashDB via the Cloudant dashboard
Moving “Piped” data into dashDB via the Cloudant dashboard

Where to get the new Pipe

The same place as always on our developerWorks site. There you’ll find links to our GitHub repos and other instructions. In the coming weeks we’ll be updating content to reflect the new Simple Data Pipe. We’ll also kick off a new series of tutorials that shows all the ways you can work with the Data Pipe’s additional targets.

Let’s get that data moving, y’all.

Join The Discussion

Your email address will not be published. Required fields are marked *