Contents


Overview

Skill Level: Any Skill Level

Intermediate, some level of database and programming understanding is needed to follow the exampes

Solving data-science problems with Python, Spark, Cloudant and Bluemix Part 1: Creating a Cloudant database and loading data with Python

Ingredients

Bluemix, Python, Spark, Cloudant

Step-by-step

  1. Create a Cloudant data-source in Bluemix

    To use and learn the entire tutorial together – follow these links for all parts.

    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-2/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-3/

    Cloudant is a simple to use NoSQL database service that takes away the complexity and time of configuration, creation, architecting and hardware and storage selection. These tasks often require you to work closely with database administration teams and can be time consuming when all you wish to do is analyse some data to solve or understand a problem. In this recipe you will see how simple it is to use Cloudant to manage data for large on-line analytics. It is a NoSQL JSON document store, Cloudant is ideal for managing semistructured- or un-structured data.

    There are plenty of examples of getting a Bluemix Login ID and starting your Bluemix session, so I will not go into this and jumpt straight into Cloudant. The first thing we are going to do is create a Cloudant instance and create a database for us to use and load data into. To create a cloudant instance go to the Data & Analytics section of your Bluemix catalog and select Cloudant. While there are literally hundreds of things you can do with Cloudant, for this recipe we will look into just creating the database object, load data and read the data. However, for a twist we will do this via Python and the REST APIs. We are doing this as it is a capability you will need to know as you start to analyse data. You will want to do this as step within your analysis.

    Once you have created your Cloudant instance within Bluemix, to use it you will need to note the username, password and URL. This information will authenticate you to the instance and allow you or your program to interact with it. Ie. create databases, load data, establish indexes etc. You will notice that the credentials are stored in a familiar looking JSON document structure; in Python this equates to a dictionary, so to extract the authentication components out of it is quite simple, lets have a look at this code. But first here is where you will find the credentials.

    Now lets have a look at the Python code to access your Cloudant database instance. You will need to have access to this code snip when you start using Spark (Python).

    credentials = {"credentials": 
    {"username": "------------------------------------------",
    "password": "-----------------------------------------------------------------",
    "host": "------------------------------bluemix.cloudant.com",
    "port": 443,
    "url": "https://----------------------------------bluemix:---------------------bluemix.cloudant.com"
    } }

    username = credentials['credentials']['username']password = credentials['credentials']['password']host = credentials['credentials']['host']url = credentials['credentials']['url']
    print username
    print password
    print host
    print url

    Now we have created a Cloudant database instance and have access to our credentials, lets create a database, but rather than do it via Bluemix, lets learn how to do it in Python.

  2. Creating a working database in Cloudant - from Python

    Installing Python Cloudant

    To use Python and Cloudant together, requires access to Python libraries need to be installed on your service if you are planning to run your Python code outside of Bluemix. However, if you are planning to run it under bluemix it needs to be included in your requirements file. The example here shows how installing the Python Cloudant libraries on your server (or desktop), taken directly from the manual. “Because 2.0.0 is still in development (2.0.0a2) and we wish to give developers time to upgrade, version 0.5.10 will remain the latest stable version on PyPI until at least early 2016.In order to install version 2.0.0a1 or greater,execute from your command line after downloading from GitHub.

    pip install –pre cloudant

    Once you have installed the Python libraries for Cloudant, creating a database in Cloudant, using your credentials is a simple matter of using the REST API. Lets have a look at the code now that will do this.

    import os
    import json
    import requests
    from cloudant.account import Cloudant
    from cloudant.result import Result

    #---------------------------------------------------------------------------------------------------------------#
    # Authentication details below are extracted from Bluemix Credentials - in this example it is hard coded as this is being
    # run external to being hosted on bluemix
    #---------------------------------------------------------------------------------------------------------------#
    USERNAME = '--------------------------------------------'
    PASSWORD = '--------------------------------------------'
    url = 'https://---------------------------------------------bluemix.cloudant.com'
    auth = ( USERNAME, PASSWORD )
    hostname = '-------------------------------------------bluemix.cloudant.com'

    #---------------------------------------------------------------------------------------------------------------#
    # Name of database to create
    # and the connectection string for cloudant in bluemix
    #---------------------------------------------------------------------------------------------------------------#

    my_database = 'my_exampledb'

    client = Cloudant(USERNAME, PASSWORD, url='https://'+hostname)
    client.connect()


    #---------------------------------------------------------------------------------------------------------------#
    # The code below creates a dataabase (named in the my_database variable)
    #---------------------------------------------------------------------------------------------------------------#
    requests.put( url + '/' + my_database, auth=auth )

    Lets have a look at the key lines in this code:
    Import statements: these load the Python libraries that need to be used.
    my_database = my_exampledb sets the global variable with the name of the database
    create a connection to the cloudant database, at location found within the URL string.

    client = Cloudant(USERNAME, PASSWORD, url='https://'+hostname)

    Connect to it.

    client.connect()

    Create the database by using a PUT command.

    requests.put( url + '/' + my_database, auth=auth )

    On running this code, you will see a database my_exampledb created in your instance on Bluemix. You can explore it in Bluemix and it should look like this;

  3. Preparing your data

    Preparing your data:
    A common task often needed to be completed, is the preparation and loading of data. In this example, I have a flat file with the number of houses built each month since 1975 in Australia. The data is pretty simple for this example and contains two fields.
    housingbuilds and housingdate, Here is a small fragment of the data
    Oct-1975 5013
    Nov-1975 4471
    Dec-1975 4042
    The objective is to turn each row into a document so it can be stored as a document in a NOSQL database such as Cloudant. Once we have that done, it becomes a simple task to access it in Spark to analyse the data. So lets see how we create a document that converts the data from the list above to look like:
    {'housingdate': 'Oct-1975', 'housingbuilds': '5013'}
    {'housingdate': 'Nov-1975', 'housingbuilds': '4471'}
    {'housingdate': 'Dec-1975', 'housingbuilds': '4042}

    Here is an example of the code to do this:

    f = open('/Users/savio/documents/fivecopy.txt')
    ## Read the first line
    line = f.readline()

    ## If the file is not empty keep reading line one at a time
    ## till the file is empty
    output_array = []while line:
    a = line
    a = a.replace('n', '')
    housingdate = a[:8] housingbuilds=int(a[15:])
    b = {"housingdate":housingdate,"housingbuilds":housingbuilds}
    output_array.append(b)
    line = f.readline()
    f.close()
    print output_array

    Again lets have a quick look at the key lines in this code:

    Replace any new line characters with a null
    a = a.replace('n', ')
    Extract the first 8 characters into the variable housingdate
    housingdate = a[:8]
    Extract from character 15 onwards into the variable housingbuilds
    housingbuilds=int(a[15:])
    Create a dictionary variable (b) with the extracted data and the tags: housingdata, housingbuilds
    b = {“housingdate”:housingdate,”housingbuilds”:housing builds}
    Append it to the output array.
    output_array.append(b)
    Read the next line
    line = f.readline()

    The output of this should create an array that looks like this, so we can now simply loop through it to load into a Cloudant database.

    [{'housingdate': 'Oct-1975', 'housingbuilds': 5013}, {'housingdate': 'Nov-1975', 'housingbuilds': 4471}, {'housingdate': 'Dec-1975', 'housingbuilds': 4042}, {'housingdate': 'Jan-1976', 'housingbuilds': 3442}, {'housingdate': 'Feb-1976', 'housingbuilds': 3727}, {'housingdate': 'Mar-1976', 'housingbuilds': 4548}, {'housingdate': 'Apr-1976', 'housingbuilds': 3642}]

  4. Loading your documents into your Cloudant database

    Note that we will use the same basic connection code as we did previously;

    import os
    import json
    import requests
    from cloudant.account import Cloudant
    from cloudant.result import Result

    #---------------------------------------------------------------------------------------------------------------#
    # Authentication details below are extracted from Bluemix Credentials - in this example it is hard coded as this is being
    # run external to being hosted on bluemix
    #---------------------------------------------------------------------------------------------------------------#
    USERNAME = '--------------------------------------------'
    PASSWORD = '--------------------------------------------'
    url = 'https://---------------------------------------------bluemix.cloudant.com'
    auth = ( USERNAME, PASSWORD )
    hostname = '-------------------------------------------bluemix.cloudant.com'

    #---------------------------------------------------------------------------------------------------------------#
    # Name of database to create
    # and the connectection string for cloudant in bluemix
    #---------------------------------------------------------------------------------------------------------------#

    my_database = 'my_exampledb'

    client = Cloudant(USERNAME, PASSWORD, url='https://'+hostname)
    client.connect()

    my_database_connection = client[my_database]

    Lets just add to this code (note I have removed the PUT statement from the previous example). The previous code that created the JSON format document can be added to with this code below.

    for x in range(0, len(output_array)):
    my_document = my_database_connection.create_document(output_array[x])
    print "loaded document " + str(output_array[x])

    Lets go throught this code:

    Start a loop for each entry in the the output_array that we previously created.
    for x in range(0, len(output_array)):

    Create a document in the database using the output_array entry.
    my_document = my_database_connection.create_document(output_array[x])

    The output should look like this:

     [{'housingdate': 'Oct-1975', 'housingbuilds': 5013}, {'housingdate': 'Nov-1975', 'housingbuilds': 4471}, {'housingdate': 'Dec-1975', 'housingbuilds': 4042}, {'housingdate': 'Jan-1976', 'housingbuilds': 3442}, {'housingdate': 'Feb-1976', 'housingbuilds': 3727}, {'housingdate': 'Mar-1976', 'housingbuilds': 4548}, {'housingdate': 'Apr-1976', 'housingbuilds': 3642}]7
    {'housingdate': 'Oct-1975', 'housingbuilds': 5013}
    loaded document {'housingdate': 'Oct-1975', 'housingbuilds': 5013}
    {'housingdate': 'Nov-1975', 'housingbuilds': 4471}
    loaded document {'housingdate': 'Nov-1975', 'housingbuilds': 4471}
    {'housingdate': 'Dec-1975', 'housingbuilds': 4042}
    loaded document {'housingdate': 'Dec-1975', 'housingbuilds': 4042}
    {'housingdate': 'Jan-1976', 'housingbuilds': 3442}
    loaded document {'housingdate': 'Jan-1976', 'housingbuilds': 3442}
    {'housingdate': 'Feb-1976', 'housingbuilds': 3727}
    loaded document {'housingdate': 'Feb-1976', 'housingbuilds': 3727}
    {'housingdate': 'Mar-1976', 'housingbuilds': 4548}
    loaded document {'housingdate': 'Mar-1976', 'housingbuilds': 4548}
    {'housingdate': 'Apr-1976', 'housingbuilds': 3642}
    loaded document {'housingdate': 'Apr-1976', 'housingbuilds': 3642}

    Looking back into Bluemix and Cloudant, you can see the data you just loaded. Now try it with a larger datafile.

    In the next example, we will see how to use this data with Spark to query and report on it. You can access it here:

    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-2/

Join The Discussion