Learn how to use the SQL-Cloudant connector in a Scala notebook for easy access to load, filter, and refine Cloudant data using Apache Spark in IBM Data Science Experience. Download the Scala notebook shown in the video and referenced in this tutorial, or create your own notebook by cutting/pasting the code found below in the tutorial into a new notebook.
Here are some other examples:
- Watch the Python version of this notebook in Data Science Experience.
- Download the Sales Cloudant database Python notebook example.
Try the tutorial
Before you begin
Watch the Getting Started on IBM Cloud video to add the IBM Analytics for Apache Spark service to your IBM Cloud account.
You can download the Scala notebook shown in the video and referenced in this tutorial, or create your own notebook by cutting/pasting the code into a new notebook.
Procedure 1: Replicate the Crimes database into your Cloudant account
- Sign in to your Cloudant account or sign in to IBM Cloud, and access the Cloudant Dashboard.
- Click the Replication tab.
- Complete the form to create a new replication job with the following specifications.
- For the _id, type
- In this tutorial, you want to replicate a database from the Education account to your own personal account, so indicate that the source database is a Remote Database and type the URL to the database as
In this case, you don’t need to set any special permissions because this database is already set to allow anyone to replicate it locally.
- For the target database, click New Database, select Create a new database locally, and then specify the database name as
- Leave Make this replication continuous unchecked so this will be a singular replication event.
- For the _id, type
- Click Replicate.
- Next, type your password, and click Continue.
Under the covers, the process base64 encodes your credentials and includes that authentication information in the replication document.
Procedure 2: Install the SQL-Cloudant package
- Log in to Data Science experience at http://datascience.ibm.com.
- Open an existing project, or create a new project.
- Create a new notebook, specifying a name, description, Spark service to use, Python 2.7, and Spark 2.1.
- Paste the following statement into the first cell, and then click Run. This command imports pixiedust and install Bahir’s sql-cloudant connector and its play-json dependency.
- If you see a warning to install in a new version, paste the following statement into the second cell, and then click Run.
!pip install --user --upgrade pixiedust
- Restart the Python kernel.
Procedure 3: Create a Scala notebook to analyze the Cloudant data
- Create a new notebook, specifying a name, description, Spark service to use, Scala 2.11, and Spark 2.1.
- Paste the following statement into the first cell, and then click Run. This command contains SQLContext which is the entry point into all fhttps://bigblue.aha.io/welcome_centerunctionality in Spark SQL and is necessary to execute SQL queries.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
- Paste the following statement into the second cell, and then click Run. Replace hostname, username, and password with the hostname, username, and password for your Cloudant account. This command reads the crimes database from the Cloudant account and assigns it to the cloudantdata variable.
val cloudantdata = sqlContext.read.format("org.apache.bahir.cloudant").
- Paste the following statement into the third cell, and then click Run. This next command lets you take a look at that schema.
- Paste the following statement into the fourth cell, and then click Run. A DataFrame object can be created directly from a Cloudant database. This next line creates and displays a DataFrame containing all of the crime codes from the cloudantdata.
val resultsDF = cloudantdata.select("properties.naturecode")
- Paste the following statement into the fifth cell, and then click Run. This next line creates a DataFrame containing the cloudantdata filtered on only crime data where the crime code is a public disturbance. You’ll notice that the .select statement specified which column to select, and the .filter statement specifies which rows to select. Refer to the SQL Programming Guide for more information on the .select and .filter syntax.
val disturbDF = cloudantdata.filter(cloudantdata.col("properties.naturecode").startsWith("DISTRB"))
- Paste the following statement into the sixth cell, and then click Run. This line persists the DataFrame back to another Cloudant database. The Cloudant-Spark Connector does not create the database, so the database needs to already exist. This command writes 7 documents into a database named crimes_filtered and contains the properties of the crime. Replace hostname, username, and password with the hostname, username, and password for your Cloudant account. Note: Option ‘createDBOnSave’ creates the database if it doesn’t exist.
Procedure 4: View the database from the Cloudant dashboard
- Open the Cloudant dashboard.
- In the list of databases, notice the original crimes database contains 273 documents, while the crimes_filtered database contains only 7 documents.
- Open the crimes_filtered database.
- Open the documents in the database to verify that all documents contain the naturecode “DISTRB”.