Learn how to use the SQL-Cloudant connector in a Python notebook for easy access to load, filter, and refine Cloudant data using Apache Spark in IBM Watson Studio. Download the Python notebook shown in the video and referenced in this tutorial, or create your own notebook by cutting/pasting the code found in the tutorial below into a new notebook.

Here are some other examples:

Try the tutorial

Try using Spark SQL to load, filter, and visualize Cloudant data in a Python notebook in IBM Watson Studio.

Before you begin

Watch the Getting Started on IBM Cloud video to add the IBM Analytics for Apache Spark service to your IBM Cloud account. You may also want to watch the Use the Cloudant-Spark Connector in Scala Notebook video to see how to accomplish these same tasks using a Scala notebook.

You can download the Python notebook shown in the video and referenced in this tutorial, or create your own notebook by cutting/pasting the code into a new notebook.

Procedure 1: Replicate the Crimes database into your Cloudant account

  1. Sign in to your Cloudant account or sign in to IBM Cloud, and access the Cloudant Dashboard.

    If attempting to access the Cloudant Dashboard through IBM Cloud:

    1. Access your IBM Cloud Dashboard.
    2. Click Work With Data.
    3. Select the Services tab.
    4. Click Portfolio.
    5. Select the Cloudant service
  2. Click the Replication tab.
  3. Complete the form on the right side of the screen to create a new replication job with the following specifications.
    1. For the _id, type crimes_replication.
    2. In this tutorial, you want to replicate a database from the Education account to your own personal account, so indicate that the source database is a Remote Database and type the URL to the database as https://education.cloudant.com/crimes.
      In this case, you don’t need to set any special permissions because this database is already set to allow anyone to replicate it locally.

    3. For the target database, click New Database, select Create a new database locally, and then specify the database name as crimes.
    4. Leave Make this replication continuous unchecked so this will be a singular replication event.
  4. Click Replicate.
  5. Next, type your password, and click Continue.

    Under the covers, the process base64 encodes your credentials and includes that authentication information in the replication document.

Procedure 2: Install the SQL-Cloudant package

  1. Log in to IBM Watson Studio at http://datascience.ibm.com.
  2. Open an existing project, or create a new project.
  3. Create a new notebook, specifying a name, description, Spark service to use, Python 2.7, and Spark 2.1.
  4. Paste the following statement into the first cell, and then click Run. This command imports pixiedust and install Bahir’s sql-cloudant connector and its play-json dependency.

    Command:
    import pixiedust
    pixiedust.installPackage("com.typesafe.play:play-json_2.11:2.5.9")
    pixiedust.installPackage("org.apache.bahir:spark-sql-cloudant_2.11:0")

  5. If you see a warning to install in a new version, paste the following statement into the second cell, and then click Run.
    Command: !pip install --user --upgrade pixiedust
  6. Restart the Python kernel.

Procedure 3: Create a Python notebook to analyze the load and filter the data

  1. Log in to IBM Watson Studio at http://datascience.ibm.com.
  2. Open an existing project, or create a new project.
  3. Create a new notebook, specifying a name, description, Spark service to use, Python 2 or (3), and Spark 2.1.
  4. Paste the following statement into the first cell, and then click Run. This command imports and initializes SparkSession.
    Command:
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
  5. Paste the following statement into the second cell, and then click Run. Replace hostname, username, and password with the hostname, username, and password for your Cloudant account. This command reads the crimes database from the Cloudant account and assigns it to the cloudantdata variable.
    Command:
    cloudantdata = spark.read.format("org.apache.bahir.cloudant")\
    .option("cloudant.host","hostname")\
    .option("cloudant.username", "username")\
    .option("cloudant.password","password")\
    .load("crimes")
  6. Paste the following statement into the third cell, and then click Run. This next command lets you take a look at that schema and a record count.
    Command:
    cloudantdata.printSchema
    cloudantdata.count()
  7. Paste the following statement into the fourth cell, and then click Run. This code displays the values of the naturecode field.
    Command: cloudantdata.select("properties.naturecode").show()
  8. Paste the following statement into the fifth cell, and then click Run. This next line creates a DataFrame containing the cloudantdata filtered on only crime data where the crime code is a public disturbance.
    Command:
    disturbDF = cloudantdata.filter("properties.naturecode = 'DISTRB'")
    disturbDF.show()
  9. Paste the following statement into the sixth cell, and then click Run. This line persists the DataFrame back to another Cloudant database. This command writes 7 documents into a database named crimes_filtered and contains the properties of the crime. Replace hostname, username, and password with the hostname, username, and password for your Cloudant account. Note: Option ‘createDBOnSave’ creates the database if it doesn’t exist.
    Command:
    disturbDF.select("properties").write.format("org.apache.bahir.cloudant").
    option("cloudant.host","hostname.cloudant.com").
    option("cloudant.username","username").
    option("cloudant.password”,”password”).
    option("createDBOnSave", "true").
    save("crimes_filtered")

Procedure 4: Visualize the Cloudant data

  1. Next, you’ll see how to create a visualization of the crimes data. Paste the following statement into the next cell, and then click Run. This line creates a DataFrame containing all of the naturecodes and a count of the crime incidents for each code.
    Command:
    reducedValue = cloudantdata.groupBy("properties.naturecode").count()
    reducedValue.printSchema()
  2. Paste the following statement into the next cell, and then click Run. This line imports two Python modules. The pprint module helps to produce pretty representations of data structures, and the counter subclass from the collections module helps to count hashable objects.
    Command:
    import pprint
    from collections import Counter
  3. Paste the following statement into the next cell, and then click Run. This line imports PySpark classes for Spark SQL and DataFrames.
    Command:
    from pyspark.sql import *
    from pyspark.sql.functions import udf, asc, desc
    from pyspark import SparkContext, SparkConf
    from pyspark.sql.types import IntegerType
  4. Paste the following statement into the next cell, and then click Run.
    Command:
    import pandas as pd
    pandaDF = reducedValue.orderBy(desc("count"), asc("naturecode")).toPandas()
    print(pandaDF)
  5. Paste the following statement into the next cell, and then click Run. This line is required to actually see the plots.
    Command: %matplotlib inline
  6. Paste the following statement into the next cell, and then click Run. This line imports matplotlib.pyplot which is a collection of command style functions that make matplotlib work like MATLAB.
    Command: import matplotlib.pyplot as plt
  7. Paste the following statement into the next cell, and then click Run. This line assigns the count and naturecode data to the values and labels objects.
    Command:
    values = pandaDF['count']labels = pandaDF['naturecode']
  8. Paste the following statement into the next cell, and then click Run. The first two statements provide the format for the plot. The next two statements specify that the plot should display as a horizontal bar chart with values for the x-axis and labels for the y-axis. The last statement displays the plot.
    Command:
    plt.gcf().set_size_inches(16, 12, forward=True)
    plt.title('Number of crimes by type')
    plt.barh(range(len(values)), values)
    plt.yticks(range(len(values)), labels)
    plt.show()

    A bar chart displays:
    Bar chart displaying the number of crime incidents with the naturecode DISTRB

Join The Discussion

Your email address will not be published. Required fields are marked *