IBM Developer Advocacy

Introducing spark-cloudant, an open source Spark connector for Cloudant data


We would like to introduce you to the spark-cloudant connector, allowing you to use Spark to conduct advanced analytics on your Cloudant data. The spark-cloudant connector can be found on GitHub or the Spark Packages site and is available for all to use under the Apache 2.0 License. As with most things Spark, it’s available for Python and Scala applications.

If you haven’t heard of Apache Spark™, it is the new cool kid on the block in the analytics space. Spark is touted as being an order of magnitude faster and much easier to use than its analytic predecessors, and its popularity has skyrocketed in the past couple of years. If you would like to learn more about Spark in general, I recommend checking out the Spark Fundamentals classes on Big Data University and the great tutorials on IBM developerWorks.

Cloudant + Apache Spark logos

Flexible JSON database plus in-memory analytics, ftw!

Start fast with Spark on Bluemix

So how do you get going quickly in analyzing your Cloudant data in Spark? Luckily, IBM has a fully-managed Spark-aaS offering in IBM Bluemix that has the latest version of the spark-cloudant connector already loaded for you. Head on over to the Bluemix catalog to sign-up and create a Spark instance to get started. Since the spark-cloudant connector is open source, you are also free to use it in your own stand-alone Spark deployments with Cloudant or Apache CouchDB™. Next, check out the README on GitHub, the Bluemix docs on Spark-aaS, and the great video tutorials on the Learning Center showing how to use the connector in both a Scala and Python notebook.

The integration with Spark opens the door to a number of new analytical use cases for Cloudant data. You can load whole databases into a Spark cluster for analysis. Alternatively you can read from a Cloudant secondary index (a.k.a. “MapReduce view”) to pull a filtered subset or cleansed version of your Cloudant JSON. Once you have the data in Spark, use SparkSQL for full adhoc querying capabilities in familiar SQL syntax. Spark can efficiently transform or filter your data and write it back into Cloudant or another data source. Because Spark has a variety of connection capabilities, you can also use it to conduct federated analytics over disparate data sources such as Cloudant, dashDB and Object Storage.

Example: Cloudant analytics with Spark

To provide another example of using the spark-cloudant connector, check out this example Python Notebook on GitHub and load it into your Spark service running on Bluemix. (It becomes interactive once you upload it to a Spark notebook using the instructions below.) This notebook does the following:

  • Loads a Cloudant database spark_sales from Cloudant’s examples account containing documents with sales rep, month, and amount fields.

    (Feel free to replicate the database into your own Cloudant account and update the connection details if you prefer.)

  • Detects and prints the schema found in the JSON documents.
  • Counts the number of documents in the database.
  • Prints out a subset of the data and shows how to print out a specific field in the data.
  • Uses SparkSQL to perform counts, sums, and order by value queries on the data.
  • Prints a graph of the monthly sales.
  • Filters the data based on a specific sales rep and month.
  • Counts and shows the filtered data.
  • Saves the filtered data as documents into a Cloudant database in your own account.

    (You need to create the database in your Cloudant account and enter credentials for your account in the notebook before this final step will work.)

Notes for new Bluemix users:

  1. After provisioning the IBM Analytics for Apache Spark service, click on its service tile in the Bluemix dashboard and open the UI to manage Spark instances.
  2. Create a new instance (if needed) and a new notebook within that instance.
  3. On the Create Notebook page, choose “From URL” and use the URL for the raw IPython notebook data, which should look like
  4. Run the code block-by-block using the triangular play button in the menu bar, but be sure to read the code comments before running block 10 and modify the snippet accordingly.

We hope you find the Spark integration a powerful tool to conduct analytics on your Cloudant data. If you have any feedback or encounter an issue with the spark-cloudant connector, please open an issue in GitHub.

© “Apache”, “CouchDB”, “Spark”, “Apache CouchDB”, “Apache Spark”, and the Spark logo are trademarks or registered trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.

blog comments powered by Disqus