Skill Level: Any Skill Level


Using your NoSQL data in Spark with Python






matplotlib - pyplot


  1. Lets quickly review Part 1.

    To use and learn the entire tutorial together – follow these links for all parts.

    Part of of this Tutoriall showed us how to create and load a NoSQL cloudant database. You can read it here Now lets see how we can use this data to attain a basic level of insight of the data.

    What we have done to date is highlighted in blue in the diagram below. The task ahead of us is to read the data, import it into Spark and subsequently produce insights from the data.

  2. Creating a Spark connector to our Cloudant NoSQL database

    Now we have a Cloudant NoSQL database, the next step it to create a connector to it in Spark that will allow us to read its data in directly and then find insights into it.

    #p1. Picture

    Create your Spark instance by selecting the Spark from the Bluemix dashboard. Once you have done this, establishing your Spark Instance in Bluemix, the first thing to do is to create a Notebook and a connection to our database.

    To create a data-connection, on the left hand side select “Data”. This will open up a screen that allows us to connect to our “my_exampledb” Cloudant database.

    Once the create connection screen is open, enter in a connection name and a description of the underlying database. Making the names and descriptions meaningful can be of assistance as you start to have versions and generations of data that you are analysing. You may choose if your database is internal or external. As we are using Cloudant in Bluemix, we will choose internal, and select both the instance of Cloudant we created as well as the database we created “my_exampledb”. Once all this is done, simply “Create Connection”.

  3. What is a Spark Notebook?

    Spark notebooks were crated to help the data scientist manage and organize their work. Initially much of the analysis that was work done was static – reports, models and this reporting analysis was done of the data that was available. As things got more complex and the data that drove the reports and analysis became more dynamic, reporting and results needed to be more agile and change on demand, in addition, data sources grew and the underlying databases become not just big, but enormous.

    As researchers, scientists and work becomes more distributed and skills are fragmented, sharing and collaboration on analysis and will grow. Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It has a comprehensive, unified framework to manage big data processing requirements with a variety of data sets such as databases, textfiles, graphs etc. The speed of its in memory capability allows performance benefits when working with large data sources. Importantly though as we will see it allows the developer to quickly write applications in R, Scala or in our case Python

  4. Getting started with your Spark Python Notebook

    We will create a Python Notebook in our Spark instance. This notebook allows us to have an interactive working environment that we can enter Python code fragments to interact with our Cloudand data. The notebook also allows us to complete mathamatical computations as well as plot and display findings of our data. To start the process, in your Bluemix dashboard start up your Spark Instance. It should look something like this.

    Now we create our new Notebook. We will call our notebook “MyDemoNotebook”, start the process by clicking “New Notebook”

    You will be presented with a screen that looks like this below, Enter a name and a good description. You will probably end up with a library of notebooks that you can use and share so more details you enter here the better. Finally, this examples language of choice is Python, so for this recipe we will select the Python radio button and click create.

    And you should be presented with something that looks like this in the next Topic.

  5. Adding your Cloudant Data-connection to your Spark Notebook

    If you remember in a previous step we created a database connection to our database we loaded and called it “Australian Housing Data”. Now we need to connect using that database connection in our Spark Notebook. Lets see how to do that. After you create your “empty” notebook you should see a screen that looks similar to this. On the right hand side you will see the Palette and the Data Sources available to your notebook. Click on Data Sources to start the connection.

    On selectingthe “+ Add Source” – you will be prompted with a full list of all the data sources you have created and that can be selected for analysis. This is a fast way to connect to various data sources for your analysis. Select the Australian Housing Data that we previously created and then “Add Data Source”.

    You will see on the right had side the connection to the data-source and on clicking “insert into code” the authorizaton information for that data-source will be included in the Code Box. It should look like this below, however, it will have your credentials for your database.

  6. How to run your Python Spark Code.

    To run your Python Spark Code (we will have more examples coming), simply add your Python Code in the codebox and press “Run”.

    Other useful keys in the menu are Stop and Save / Checkpoint.

    It is important to remember that each time you enter + you will get a new “code box”. Values that are stored in previous code boxes will be available in subsequent code boxes.

  7. Lets run some Python Code and Connect to our Cloudant Database.

    The code below is quite straight forward and is good starting place to use Python for your Spark Notebook. The first part is

    the credentials dictionary

    • Print Hello from Spark
    • Extract the credentials to host-variables
    • print the user credentials
    • Connect to the “my_exampledb” database using the credentials
    • print the schema of the “my_exampledb” database
    • print the number of documents loaded into the database

    Copy the code below and try connecting to your database.

    credentials_1 = {

    print ('Hello from Spark')
    db_username = credentials_1['username']db_password = credentials_1['password']db_host = credentials_1['host']db_port = credentials_1['port']db_url = credentials_1['url']print db_username
    print db_username
    print db_password
    print db_host
    print db_port
    print db_url

    # connection to cloudant database here #

    sqlContext = SQLContext(sc)
    cloudantdata =“com.cloudant.spark”).
    option(“cloudant.username”, db_username).


    Hello from Spark
    |– _id: string (nullable = true)
    |– _rev: string (nullable = true)
    |– constructions_total: string (nullable = true)
    |– constructions_value: long (nullable = true)
    |– date: string (nullable = true)
    |– established_dwellings_no: long (nullable = true)
    |– established_dwellings_value: long (nullable = true)
    |– new_dwellings_purchased: long (nullable = true)
    |– new_dwellings_value: long (nullable = true)
    |– total_dwellings_no: long (nullable = true)
    |– total_dwellings_value: long (nullable = true)

  8. Querying and displaying some data from our NoSQL database

    The last thing we will do in this tutorial is to read the Cloudant database and create a dataframe. There are a couple of terms to know here. The first is the NoSQL database. That is our external source of data, we showed previously how to create it and connect to it. The next is the term DataFrame. Often in code, you will see a variable tagged with DF, eg: weatherDF. This denotes that the variable is a dataframe.

    What is a DataFrame? A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to relational tables. A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files. DataFrame API was designed for modern Big Data and data science applications taking inspiration from DataFrame in R Programming and Pandas in Python

    Finally, Pandas (Python Data AnalysiS), pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

    In the example below we have expanded on the Connection described previously with the definition of a dataframe.

    df =['constructions_total'], cloudantdata['constructions_value'],cloudantdata['date'] )

    The dataframe in this example selects 3 columns from the Cloudant NoSQL document, 'constructions_total', 'constructions_value','date', you will see the dataframe (df) below, looking like a table.

    sqlContext = SQLContext(sc)
    cloudantdata ="com.cloudant.spark").
    option("cloudant.username", db_username).

    df =['constructions_total'], cloudantdata['constructions_value'],cloudantdata['date'] )

    DataFrame[_id: string, _rev: string, constructions_total: string, constructions_value: bigint, date: string, established_dwellings_no: bigint, established_dwellings_value: bigint, new_dwellings_purchased: bigint, new_dwellings_value: bigint, total_dwellings_no: bigint, total_dwellings_value: bigint]+-------------------+-------------------+--------+
    |constructions_total|constructions_value| date|
    | 4471| 83700|Nov-1975|
    | 4188| 82583|May-1976|
    | 4322| 85208|Nov-1976|
    | 3588| 77061|May-1977|
    | 4279| 90503|Nov-1977|
    | 4030| 86571|May-1978|
    | 4267| 100877|Nov-1978|
    | 4537| 109524|May-1979|
    | 4028| 101623|Nov-1979|
    | 4227| 113996|May-1980|
    | 3684| 95472|Nov-1980|
    | 3683| 97424|May-1981|
    | 3224| 88666|Nov-1981|
    | 2986| 83015|May-1982|
    | 3080| 89739|Nov-1982|
    | 4003| 123466|May-1983|
    | 5078| 163802|Nov-1983|
    | 6338| 218457|May-1984|
    | 5093| 191285|Nov-1984|
    | 6064| 244188|May-1985|
    only showing top 20 rows

    Paste code above and delete me

  9. Summary:

    We have covered a lot in this Tutorial, creating a Spark instance in Bluemix and learning how to make a connection to it. Then we looked at the code that uses that connection information and selected data from it in a form that can be easily read, processed and displayed. We will continue with some specialist data analysis techniques in the next tutorial.

Join The Discussion