IBM Developer Advocacy

Load Cloudant Data in Apache Spark Using a Scala Notebook

Please note: The following tutorial may employ a legacy Bluemix interface rather than the current DSx interface. The functionality remains unchanged. If you encounter obstacles in workflow, refer to Spark docs or DSx docs.

Learn how to load Cloudant data in IBM Analytics for Apache Spark using the Cloudant-Spark beta connector in a Scala notebook for easy access to filter and refine Cloudant data in IBM Analytics for Apache Spark.

You can download the Scala notebook shown in the video and referenced in this tutorial, or create your own notebook by cutting/pasting the code found below in the tutorial into a new notebook.


Try the tutorial

Learn how use Spark SQL to load and filter Cloudant data in a Scala notebook in IBM Analytics for Apache Spark.

What you’ll learn

At the end of this tutorial, you should be able to:

  • replicate a Cloundant database into your Cloudant account
  • create a Scala notebook in IBM Analytics for Apache Spark.
  • use Spark SQL to load and filter the Cloudant data.
  • write data back to a Cloudant database.

Before you begin

Watch the Getting Started on Bluemix video to add the IBM Analytics for Apache Spark service to your Bluemix account.

You can download the Scala notebook shown in the video and referenced in this tutorial, or create your own notebook by cutting/pasting the code into a new notebook.

Procedure 1: Replicate the Crimes database into your Cloudant account

  1. Sign in to your Cloudant account or sign in to Bluemix, and access the Cloudant Dashboard.
  2. Click the Replication tab.
  3. Complete the form on the right side of the screen to create a new replication job with the following specifications.
    1. For the _id, type crimes_replication.
    2. In this tutorial, you want to replicate a database from the Education account to your own personal account, so indicate that the source database is a Remote Database and type the URL to the database as https://education.cloudant.com/crimes.
      In this case, you don’t need to set any special permissions because this database is already set to allow anyone to replicate it locally.

    3. For the target database, click New Database, select Create a new database locally, and then specify the database name as crimes.
    4. Leave Make this replication continuous unchecked so this will be a singular replication event.
  4. Click Replicate.
  5. Next, type your password, and click Continue.

    Under the covers, the process base64 encodes your credentials and includes that authentication information in the replication document.

  6. You get the success message: This replication has been posted to the _replicator database but hasn’t been fired yet. Check the _replicator DB to see its state.

Procedure 2: Create a Scala notebook to analyze the Cloudant data

  1. From your Bluemix Dashboard, open the Apache Spark instance, and launch IBM Analytics for Apache Spark.
  2. Open the existing Spark instance.
  3. On the Analytics tab, click New Notebook, select Scala, type a name and description for the notebook, and click Create Notebook.
  4. Paste the following statement into the first cell, and then click Run. This command contains SQLContext which is the entry point into all functionality in Spark SQL and is necessary to execute SQL queries.

    Command: val sqlContext = new org.apache.spark.sql.SQLContext(sc)

  5. Paste the following statement into the second cell, and then click Run. Replace hostname, username, and password with the hostname, username, and password for your Cloudant account. This command reads the crimes database from the Cloudant account and assigns it to the cloudantdata variable.
    Command:
    val cloudantdata = sqlContext.read.format("com.cloudant.spark").
    option("cloudant.host","hostname").
    option("cloudant.username", "username").
    option("cloudant.password", "password”).
    option("schemaSampleSize", "-1").
    load("crimes”)
  6. Paste the following statement into the third cell, and then click Run. This next command lets you take a look at that schema.
    Command: cloudantdata.printSchema
  7. Paste the following statement into the fourth cell, and then click Run. A DataFrame object can be created directly from a Cloudant database. This next line creates and displays a DataFrame containing all of the crime codes from the cloudantdata.
    Command: val resultsDF = cloudantdata.select("properties.naturecode")
    resultsDF.show()
  8. Paste the following statement into the fifth cell, and then click Run. This next line creates a DataFrame containing the cloudantdata filtered on only crime data where the crime code is a public disturbance. You’ll notice that the .select statement specified which column to select, and the .filter statement specifies which rows to select. Refer to the SQL Programming Guide for more information on the .select and .filter syntax.
    Command: val disturbDF = cloudantdata.filter(cloudantdata.col("properties.naturecode").startsWith("DISTRB"))
    disturbDF.show()
  9. Paste the following statement into the sixth cell, and then click Run. This line persists the DataFrame back to another Cloudant database. The Cloudant-Spark Connector does not create the database, so the database needs to already exist. This command writes 7 documents into a database named crimes_filtered and contains the properties of the crime. Replace hostname, username, and password with the hostname, username, and password for your Cloudant account.
    Command: disturbDF.select("properties").write.format("com.cloudant.spark").
    option("cloudant.host","hostname.cloudant.com").
    option("cloudant.username","username").
    option("cloudant.password”,”password”).
    save("crimes_filtered")

Procedure 3: View the database from the Cloudant dashboard

  1. Open the Cloudant dashboard by signing in to your Cloudant account.
  2. In the list of databases, notice the original crimes database contains 273 documents, while the crimes_filtered database contains only 7 documents.
  3. Open the crimes_filtered database.
  4. Open the documents in the database to verify that all documents contain the naturecode “DISTRB”.
blog comments powered by Disqus