Skill Level: Beginner

Knowledge of functional programming languages like Python is an advantage.

In this recipe you will go through a typical scenario of a data scientist and generate statistics over IoT event data by connecting to a database, prepare the data for analysis with Python and finally generate diagrams using matplotlib.


  1. Introduction

    The Recipe ‘Simulate an Elevator – All using Node-RED nodes‘ showed you how you can simulate a real world IoT scenario taking an Elevator as example and using Node-RED. A second recipe titled ‘Monitor Elevator events in real time to take corrective actions using Watson IoT Platform‘, performed real time analytics on the device events and implemented a closed control loop putting the elevators in maintenance mode should the motor temperature go beyond 200 degree. A third recipe titled ‘Store Elevator device data and alerts in Cloudant NoSQL DB‘ went through different scenarios that can be used to store event data in an IBM Cloudant database for later use e.g. by a data scientist or business analytics tool.

    In this recipe we shall go through some basic steps in using IBM Data Science Experience (DSX for short) to generate a few archetypical statistical diagrams over the data that was collected in the Cloudant NoSQL DB, thus taking the role of a data scientist. For this purpose we shall use Apache Spark Resilient Distributed Dataset (RDD for short) as underlying data storage, Python as programming language, matplotlib for plotting diagrams and Jupyter notebooks to keep everything organized nicely.

    The recipe will take you through the following main steps:

    • Log into Data Science Experience
    • Create a project
    • Create a Jupyter Notebook
    • Connect to the Cloudant Database
    • Prepare data for analysis using Python
    • Create a Histogram using Matplotlib
    • Create a Boxplot diagram using Matplotlib
    • Create a Time Series diagram using Matplotlib
  2. Architecture

    Before you start it is important to provide some details on the data flow and the architecture of the solution. Elevator event data from the Watson IoT Platform will saved to Cloudant NoSQL databases using the Historical Data Storage Extension. From there it is then possible using Apache Spark running under the hood of IBM Data Science Experience to extract the data, transform it and prepare it for analysis using matplotlib.



    For this purpose a Jupyter notebook wil be used to define the logic of extracting the event data from Clpoudant, transforming the event data making it available for analysis and then finally passing it to matplotlib for rendering.

    The database used will be one of the buckets generated by the Historical Data Storage Extension of the Watson IoT Platform. At the end we shall however discuss some alternative scenarios to try out with the data available from the previous recipe such as the elevator events generated by Node-RED and the elevator alerts generated by the trigger that fires when the motor temperature gets beyond 200 degree.

  3. Getting Started

    IBM Data Science Experience supports the data scientist community to learn, create, and collaborate. It allows the user community to organize analytic and data assets into projects so that you can share work with collaborators. It allows data scientist to analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.

    You can register with the IBM Data Science Experience by going through the following steps:

    1. Click Data Science Experience. This will take you to DSX.01-Welcome-1
    2. Click Sign Up (or alternatively Sign Up for a Free Trial).01-Sign-Up-1
    3. Click the link Sign in with your IBM ID.
    4. Enter your IBM ID and click Continue.
    5. Extern your IBM Password and click Sign In.
    6. Once in DSX, you will see the welcome page.01-Data-Science-Experience

    Notice that inside DSX you will have access to articles, data sets, notebooks and tutorials.

  4. Create Project

    Once registered you will need to create a project. Projects allow you to orient a team of collaborators around a set of notebooks, data sets, articles, and analysis work streams. This makes it easier for organizations to re-use assets, and for teams to establish and enforce shared best practices while collaborating with each other and learning from the community.

    1. Click Create New Project (shown in the previous screenshot). This will open up the New Project dialog for defining the properties of the project.
    2. Enter the properties of the Project as follows:
      1. Name as ‘iot-elevators’.
      2. Description as you like.
      3. Spark Service as DSX-Spark.
      4. Storage Type as Object Storage.
    3. Click Create

    The project allows cooperation with other data scientists and data engineers by creating and sharing notebooks, data sets and data connections. It also provides access to RStudio. The focus in this recipe however will be on using Jupyter notebooks and Python to perform some basic task of the data scientist.


  5. Create a Jupyter Notebook for Python

    Jupyter notebooks are a web-based environment for interactive computing. You can run small pieces of code that process your data, and you can immediately view the results of your computation. Notebooks include all of the building blocks you need to work with data: data, code computations, visualizations, text and rich media to enhance understanding.

    To create a notebook, do the following:

    1. Click add notebooks.
    2. This will open the Create Notebook dialog for defining the properties of the notebook.03-Notebook-Properties
    3. Set the properties of the notebook as follows:
      1. Name: ‘elevator-statistics’.
      2. Description: whatever you like.
      3. Language: Python 2.
      4. Spark version: 1.6.
    4. Select the tab named From URL. You will need to import a notebook from GitHub to be used as the starting point.
    5. Click the following link for GitHub project iot-dsx-notebook, copy the URL from the browser and insert it into the Notebook URL field as shown below.
    6. Click Create Notebook

    DSX provide support for Python, Scala and R but we shall focus solely on Python in this recipe. Once created, the notebook should look like this:


    The toolbar at the top provides commands for sharing (e.g. on Github), scheduling, versioning, collaborating and finding resources. The Kernel menu provides commands for e.g. restarting the underlying kernel runtime system. When you open a notebook in edit mode, exactly one interactive session connects to a Jupyter kernel for the notebook language you select. This kernel executes code that you send and returns the computational results. The kernel remains active even if the web browser window is closed. Reopening the same notebook from Data Science Experience will connect the web application to the same kernel.

    There are menus as well for handling interaction with the file system (saving, printing, downloading) and for editing the cells of the notebook. Each cell has a specific format: it can be a header, a markdown or a piece of code that can be executed. The format can be controlled in the Format selection box (marked red above). The two first cells in the notebook are markdowns, the third that defines the readDataFrameFromCloudant function of type Code. To run the code you will actually use the Run Cell command – that is the triangle surrounded by a circle in the screenshot above.

  6. Connect to the Cloudant Database

    The Jupyter notebook allows you to select your data sources and review the data to ensure it’s right for your needs, before you prepare it for analysis. Before you can continue you will need to get hold on the credentials for the Cloudant NoSQL DB service that stores the elevator event data and alerts. The steps for doing so are defined in Step 2 (Getting Started) of the recipe ‘Store Elevator device data and alerts in Cloudant NoSQL DB‘.


    You will further need to open up the dashboard for the Cloudant NoSQL DB and retrieve the name of the historic database to be used. This is the one for the current bucket interval:


    Having undertaken these steps, you can now continue with the notebook:

    1. Select the cell in the notebook that allows you to specific the credentials.04-Credentials
    2. Insert the credentials (hostname, user and password) for your Cloudant NoSQL DB service from the VCAP file.
    3. Enter the name of a database that was created by the Historical Data Storage Extension.
    4. Invoke File > Save Version to save the changes to a new version of the notebook.
    5. Select the second cell in the notebook, i.e. the one defines the function named readDataFrameFromCloudant and click the Run Cell command. The code will be executed and the next cell will be selected automatically.04-GetDatabase
    6. Click the Run Cell command again. Notice that executing markdowns has no effect. DSX will simply select the next cell.
    7. Click the Run Cell command again. This will execute the piece of code that defines your credentials and calls the readDataFrameFromCloudant function.
    8. Wait until DSX returns with a result, namely the first 20 documents in the database.04-Database-Listing

    The table contains the deviceId, the timestamp and the event data that is relevant for further analysis (amongst others). The interesting informationis contained in the column named data. The Historical Data Storage Extension of the Watson IoT Platform stored hte properties of the device as a hierarchically nested JSON structure, but in the printout above it looks as if it is a list of unnamed numbers. To complete the next step, we need to know how this data is represented internally in Python.

    The function readDataFrameFromCloudant returns a Resilient Distributed Datasets (RDD) to access the data in the Cloudant database. You can find more information on RDD in the Spark RDD Tutorial. RDD is a fundamental data structure of Spark, ans is used to achieve faster and efficient MapReduce operations. The RDD is an immutable distributed collection of objects. Each dataset in the RDD is divided into logical partitions, which may be computed on different nodes of the cluster. Formally, an RDD is a read-only, fault-tolerant and partitioned collection of elements that can be operated on in parallel. A functional programming language such as Python offers an ideal framework for parallel processing (as long as purely functional, side effect free constructs of the language are used). There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system. In this recipe, we shall use the second option.

  7. Prepare Data for Analysis using Python

    An important part of the work of a data scientist is to prepare the raw data coming from various data sources for analysis. This recipe is not much more different. The matplot library functions that we will use require the data to be in form of lists. For example, to print a histogram of motor temperatures showing the distribution for all elevators as a bar chart, we will need to pass the plot function a list of temperatures. On the other hand, the data that we get from the Cloudant database is a table of rows and columns. So basically, the information will need to be extracted, transformed and possibly also filtered before we are in a position where the plot functions of matplotlib can be called.

    Before we can start transforming the data and preparing it for matplotlib, we need to know more about how the data in the table is represented internally in Python. The print out does not give much information on that. To determine by inspection how the rows are represented, do the following:

    1. Select the cell that defines your credentials. This will cause DSX to select the output table as well.
    2. Invoke Insert > Insert Cell Below.
    3. Check that the Format of the cell is Code. If not, change it.
    4. Enter cloudantdata.collect()[0] as code in the new cell and click Run Cell.
    5. This will return the value of the first row in the table.

    You should now see the structure of the data in more detail which is important in the next step where we will need to access the data and prepare it for use by matplotlib. The variable cloudantdata is a Resilient Distributed Dataset (RDD). This is the basic abstraction in Spark and represents an immutable, partitioned collection of elements that can be operated on in parallel. The RDD collect method returns a list that contains all the elements in the RDD. You can find a brief documentation of all the RDD methods in the Apache Spark documentation.

    The table is a list with each element in the list being a named tuple of type Row. The row has in turn a property named ‘d’, which in turn contains a named tuple (again a Row) holding the individual properties such as ‘cabinSpeed’ and ‘motorTemp’.

    1. Add the following line of code to the end of the cell and invoke Run Cell command again which has the effect that it will return the data field of the first row in the table as well as the motor temperature:
    row = cloudantdata.collect()[0]print row[2]print type(row[2])
    print row[2].d.motorTemp
    print type(row[2].d.motorTemp)

    Notice: Python indices start with 0 so the data field has index 2.

    Following this step there will be other plot jobs which require more information than just the temperature. Knowing this, it would make sense to define a function that would extract the needed data from the table and return a list that can then be used as a basis for all plotting jobs in this recipe. This will limit the need to do redundant and tedious work. The function getData does exatly this.

    1. Select the next cell in the notebook that defines the getData function.
    2. Invoke Run Cell. DSX will interpret the code and move on to the next cell.
    3. Invoke Run Cell again.
    4. If you would like to test whether it works, simply add print elevator_telemetrics[0:10] to the end of the cell and click Run Cell again.
    5. You should now see the elements in the variable.

    In the function getData, the data is collected from the RDD first using a sample factor of 0.5 (50%). This returns a list of all elements in the table with all columns. However, we are just interested in a subset of the information. The map function is therefore called next to select the timestamp, the deviceId and the motor temperature. The filter function removes columns with null values (None in Python). The sorted function sorts the data by the first element in the tuple (i.e. the timestamp). There are better ways of doing this in context of large data sets which we will look into in shortly.

    A functional programming language is an ideal choice for transforming the data preparing it for plotting. The higher order nature, allowing functions to be first class values, combined with functional composition, map (for transforming data) and filter (for filtering data) comes in very handy in this context.

    We are now ready to use matplotlib. matplotlib is plotting library for the Python programming language and its numerical mathematics extension NumPy. It support a variety of diagrams like traditional line, bar and pie charts as well as more statistical plots like boxplot diagrams and histograms. For more information, please go to the tutorial.

  8. Create a Histogram using Matplotlib

    We will print a simple histogram of temperature values for all elevators using matplotlib next. The plot function requires as argument a single list of temperatures which will need to be computed from the elevator telemetry already extracted from Cloudant NoSQL DB.

    To get a histogram of motor temperatures using matplotlib, do the following:

    1. Select the following first cell below.
    2. Click Run Cell. This will cause matplotlib to be imported. DSX will in turn select the next cells.
    3. Click Run Cell.
    4. Click Run Cell again and wait shortly until the histogram is displayed.

    The histogram shows that most temperatures are in the 160-180 degree, that the distribution is left skewed and that there are some outliers in the 210-220 degree caused by overheated motors.

    Notice that the Python variable elevator_temp is simply computed by mapping a list of tuples containing the timestamp, device Id and temperature (as a float value) to a list of temperatures in form of integer values.

    Moreover, matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure decorates the plot with labels, etc. In this recipe we shall deliberately keep the commands very simple. The focus is not on the virtues of matplotlib, rather on how to fetch the data from Cloudant, transform it and then pass it to a basic matplotlib plot function (without any syntactic sugar at all).


  9. Create a Boxplot Diagram using Matplotlib

    In this step we shall generate a boxplot diagram that show the mean value, the variance and the outliers of the motor temperature for all elevators. The boxplot diagram requires as key input for this purpose a list of temperatures. This means that we can reuse what we have already prepared and call the plot functions right away.

    To get a boxplot diagram of the motor temperatures using matplotlib, do the following:

    1. Select the following cell.
    2. Click Run Cell and wait shortly until the boxplot diagram is displayed.

    The boxplot diagram shows again that most temperatures are in the 160-180 range, but that there are some outliers in the 210-220 and 130-140 range caused by overheated or cold motors. The mean value is approximately 167 degree, with 50% of the temperature values being in the range between 165 and 172.

  10. Create a Time Series Diagram using Matplotlib

    To get a time series diagram using matplotlib showing the motor temperatures over time, we will need to pass the plot function two parameters: a list of timestamps and a list of temperatures. Some preprocessing steps are therefore required which shall be briefly discussed below.

    To get a time series diagram of the motor temperatures using matplotlib, do the following:

    1. Select the first cell as shown below. This cell contains Python code to transform the data into an appropriate form for matplotlib.
    2. Click Run Cell. DSX will select the second cell with the plot command.
    3. Click Run Cell again and wait shortly until the time series diagram is displayed.

    The time series diagram shows that most temperatures are in the 160-180 degree range over time, but with some peek values in the 210-220 range caused by overheated motors.

    The variable date_to_ts is basically a function that converts dates to timestamps. The function is constructed the usual way in functional programming languages by using a lambda expression. The second line of code takes the elevator_telemetrics – basically a list of tuples – and maps this list to a new list that just contains the timestamps.

  11. Alternative Scenarios

    There are a couple of alternative scenarios that you can try out which require some programming in Python and use of other data sources such as the Cloudant NoSQL databases named elevator_events and elevator_alerts created in the previous recipe.

    Some possible scenarios are:

    1. Rerun the various cells with different values of the sampling factor, e.g. use 1.0 (100%) and 0.1 (10%) and observe the difference.
    2. Change scope of the analysis to focus on one or more elevators. This would require you to define a function that would return just the data for a specific device.
    3. Generate statistics using the database elevator_events (which mirrors a flat relational table) rather than the hierarchically structured document created by the Historical Data Storage Extension. This would require to change the database name as well as the function getData.
    4. Generate statistics using the elevator_alerts database which stores the relevant data in a single string. This would require changing the database name and again the function getData so that it parses the bulk string holding the telemetry to extract the motor temperature. In this case it would make sense to generatea histogram showing the number of alerts for the various elevators to determine if they are equally distributed or not..

    For example, to get the data for a single elevator from the elevator_events database, the following variant of getData could be used (called getElevatorData):

    def getElevatorData(df,sfactor,elevator):
    srdd = df.rdd.sample(False,sfactor)
    srdd1 = srdd.filter(lambda row : (not row.deviceId is None) and (row.deviceId == elevator) and (not row.motorTemp is None) \
    and (not row.timestamp is None) \
    srdd2 = srdd1.map(lambda row : (row.timestamp[:19],row.deviceId,row.motorTemp))
    return sorted(srdd2.collect(), key = lambda data : data[0])

    elevator_telemetrics = getData(cloudantdata,1.0,"Elevator01"

    Here the filtering and mapping is appropriately done on the level of the RDD – before the collect method is called to return a list of elements from the RDD. This is a more efficient method lending itself towards parallel processing in context of large data sets. Also notice that the map function reflects that each row is basically a flat structure: there is no nesting in the documents of the elevator_events database. The motor temperature can consequently be selected directly using row.motorTemp.

    When it comes to the elevator_alerts database, we will need to look carefully into how the data is returned:


    Notice that the deviceId is a more complex string value than previously, and that the event data is encoded in a string in the column named data. Jupyter notebooks are however excellent interactive environments that allow us to call Python functions in order to inspect how data is stored. We can therefore perform various experiements in order to determine how a given representation of the data can be tansformed into a form that is nmore suitable for the purpose at hand:

    print cloudantdata.collect()[0]
    deviceId = cloudantdata.collect()[0].deviceId[16:] # fetches the deviceId as simple string.
    msg = cloudantdata.collect()[0].message # fetches the message as a string
    print deviceId
    print msg

    import json
    decoded = json.loads(msg) #decodes the string and returns an object
    print decoded
    temp = decoded['d']['motorTemp'] #the elements must be fetched using indexing
    print temp
    print type(temp)

    The code is usefull for inspecting how the information is represented i in the table and determine how we can extract the individual properties needed. The remaining piece of work is just to update the getData function accordingly.

    It is recommended to create different notebooks for at the alternative scenarios. The way to do this is as follows:

    1. Save the current notebook by invoking File > Save
    2. If you want to save the notebook to your file system as well, invoke e.g. File > Download as > Notebook (.ipynb)
    3. In the top of the window, select the project name (iot-elevators).
    4. This will take you back to your project and show the list of notebooks.
    5. Click the “” button to the right of your notebook. Then select Duplicate to create a copy.
    6. Click the Pencil to the right of the notebook named “elevator-statistics-copy” to open it.
    7. Once in the Jupyter Notebook editor, click then “i” button as shown below.
    8. Rename the notebook as shown above and give it a unique name.
    9. Select the cell that contains the credentials and change the database name.
    10. Update the function getData.
    11. Rerun all cells required to generate the matplotlib diagrams.

    Finally, here’s a couple of tips when programming in Python:

    1. The data from Cloudant may contain null values – in Python represented as None. Make sure that these values are filtered away before transformation functions (like int) are called.
    2. Python is lazily evaluated which means that functions are not called before the result is actually needed. To trigger evaluation, use e.g. the print function.
    3. Python is not strongly typed as other functional programming languages such as Haskell or ML. In case that something goes wrong, use type to check the type of an expression.
    4. It is possible to write very condense code in Python that do a lot of computations and shows excellence in using lambda calculus. The downside of this is that the code becomes very hard to understand and to debug. This is for example such a version of the function getData:
    def getData(df,sfactor):
    etm = sorted(df.rdd.sample(False,sfactor).map(lambda row : (row.timestamp[:19],row.deviceId,row.data.d.motorTemp)).collect(),\
    key = lambda data : data[0] \
    return filter(lambda (ts,dev,temp) : (not temp is None) and (not ts is None),etm)
  12. Conclusion

    The value of IoT can be found within the analysis of data gathered from the system under observation. In this recipe we have taken the role of the data scientist and gone through some basic steps in using IBM Data Science Experience to generate a few archetypical statistical diagrams over the data that was collected in the Cloudant NoSQL DB. For this purpose we used Apache Spark Resilient Distributed Dataset (RDD for short) as underlying data storage, Python as programming language, matplotlib for plotting diagrams and Jupyter notebooks to keep everything organized nicely.

    The intend was by no way to provide a deep dive into the subject matter, rather to provide a first introduction showing how the IoT event data stored in a Cloudant database can be used to generate archetypical statistical plots using IBM Data Science Experience.

  13. References

    Many of the techniques covered in this recipe are introduced in more depth in the Coursera course titled ‘A developer’s guide to Exploring and Visualizing IoT Data’. This online course provides a very good introduction to data exploration and visualization in form of presentations, quizzes, exercises and hands-on assignments for an end-to-end IoT scenario using the IBM Watson IoT Platform, Cloudant NoSQL Database and ApacheSpark under the hood of IBM Data Science Experience. In fact, fragments of code used in this recipe have their origin in that course.

Join The Discussion