Overview

Skill Level: Intermediate

  INTERMEDIATE Basic knowledge of 1. IBM Watson IoT Platform 2. Apache Spark 3. Cloudant NoSQL 4. Pandas for Data Manipulation

Recipes to enhance Analytics in IBM Watson IoT Platform Before you proceed, evaluate the following analytical recipes that suites your need.Introduction In the previous recipe “Engage Machine Learning for detecting anomalous behaviors of things“, we saw how one can integrate IBM Watson IoT, Apache Spark service, Predictive Analysis service and Real-Time Insights to take timely […]

Ingredients

Step-by-step

  1. Recipes to enhance Analytics in IBM Watson IoT Platform

    Note: Before you proceed, evaluate the analytical and cognitive recipes from the list that suites your need. Click on the respective image below,

    llist-of-analytical-recipes llist-of-cognitive-recipes

     

    Deploy Watson IoT Platform & Cloudant NoSQL DB on Bluemix

    Now that you have evaluated on your needs wrt Analytic & Cognitive recipes, let us continue on Integrating IBM Streaming Analytics Service with Watson IoT Platform.

    Click on the Create Toolchain button provided below, to quickly deploy the Watson IoT Platform – Node-RED Bluemix starter app. Post successful deployment, you shall have both the above mentioned services, up & running on Bluemix along with a Cloudant NoSQL DB at your disposal.

    Toolchain-8

    Note: If you are a User using the United Kingdom Region in your Bluemix environment, then please make use of the steps mentioned in the IoT Recipe Deploy Internet of Things Platform Starter service on Bluemix to deploy the setup. Alternatively, you can also try using the Deploy to Bluemix button, to deploy the setup under your United Kingdom Region, provided your Jazzhub account is validated. Users of US South Region can ignore this step.

    deploy

  2. Introduction

    In the previous recipe “Engage Machine Learning for detecting anomalous behaviors of things“, we saw how one can integrate IBM Watson IoT, Apache Spark service, Predictive Analysis service and Real-Time Insights to take timely action before an (unacceptable) event occurs. And in this recipe, we will make use of the data (historical data) produced by the previous recipe to discover the hidden patterns, temperature trend over the days, month and year using Apache Spark SQL, Pandas DataFrame and Jupyter Notebook.

    What is Spark SQL and DataFrames?

    Apache Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL, the DataFrames API and the Datasets API. And in this recipe, we will be using DataFrams to analyze and visualize the temperature data.

    A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

    To make things simple, this recipe does not alter the previous recipe setup, rather it just adds a Cloudant NoSQL DB service on top of part 1 of the recipe as shown below,

    architecture

    In case if you want to look at the overall architecture that shows the components of part1 and part2, take a look at this link.

    As shown, out of the box Cloudant connector from Watson IoT Platform will be used to store the device event s (which contains the actual temperature, forecasted temperature, zscore and wzscore values) from the Watson IoT Platform and store them in a Cloudant NoSQL DB. This Cloudant NoSQL DB will act as a historical data storage. Once the Cloudant NoSQL DB is filled with enough data, this recipe will use the Jupyter Notebook to load the data into the Spark engine and use Spark SQL, Panda DataFrames, other graphical libraries to analyze the data and show the results in charts or graphs.

    Also, one can use the sample application present in the github to generate the historical data without running the previous recipe code. The steps are detailed in the following section.

  3. Store events into Cloudant NoSQL DB

    In this step, we will access the Cloudant NoSQL DB and use the Watson IoT Platform Cloudant connector to send the device events & result events to Cloudant DB.

    Access the Cloudant NoSQL DB service

    Open the Bluemix Dashboard and click on the Application service that you have currently deployed (using the Create Toolchain button), if you have already moved away from it. You should see the Watson IoT Platform listed under the Binded Services. Click on the WIoTP service and choose the click on Launch button to launch the WIoTP Dashboard.

    Connect Cloudant NoSQL DB from Watson IoT Platform

    1. On Watson™ IoT Platform dashboard, click Extensions in the navigation bar,

      iot-extension

    2. In the Historical Data Storage tile, click Setup,
      setup
    3. All available Cloudant NoSQL DB services within the same Bluemix space as your Watson™ IoT Platform service are listed in the Configure historical data storage section.
    4. Select the Cloudant NoSQL DB service that we wish to connect
    5. Select Cloudant NoSQL DB configuration options:
      1. Select a bucket interval.
      2. Select a time zone.
      3. Choose options that determine the database nameconfigure-cloudant
    6. Clicking on Done, we should see a popup window asking for authorize the connection to Cloudant Service.
    7. Clicking on CONFIRM in authorization dialog box, we should see Authorization Success message in the popup window.
    8. With successful completion of configuring the Cloudant NoSQL DB to be used as Historian Service for Watson IoT, We should see the temperature events from the IoT device and the result events from the Spark service to get stored in the Cloudant NoSQL database.

    View the events in Cloudant NoSQL DB

    1. Go to Bluemix Dashboard,
    2. Click on the Cloudant NoSQL DB service that you created in this step.
    3. Observe that a Databased similar to “iotp_9ts1nd_historicdb_2016-12” is created where all the result events are stored.
    4. Click on iotp_9ts1nd_historicdb_2016-12 to enter inside the database and click on any document to view the events.events

     Retrieve the credentials of Cloudant DB to load the events in Spark

    1. Click on the Cloudant NoSQL DB service credentials tab and generate credentials as shown below,credentials
    2. View the generated credentials and note down the username & password. This will be required to load the events into the Jupyter notebook later.username-pass

    In this step, we have successfully configured the connector to store the events into Cloudant NoSQL DB.

  4. Create a view in Cloudant to group result events

    By default the Cloudant connector forward all the events to Cloudant NoSQL DB. i.e, now the Cloudant NoSQL DB will have both the temperature events (sent by the device) and the result events (sent by the Spark service from recipe#1). In this step, we will create a view to filter and group only the result events that we are interested in.

    • Launch the Cloudant NoSQL DB service that you created in the previous step,
    • Click on the corresponding database where the events are stored,
    • Click on the iotp tab as shown below to see the list of default views configured by the Watson IoT Cloudant connector,
      list-of-views
    • Click on the + sign to create a new view as shown below that will group all the result events together,new-view
    • Specify the name as result-events and enter the following code into the function area, (See the below screenshot for better clarity)
      function (doc) {
      if(doc.eventType === "result"){
      emit(doc.eventType, doc);
      }
      }
    • Click Create to create the new view,result-event

    In this step, we created a view to filter and group the result events that we are interested in.

  5. Create a Spark SQL DataFrame

    In this step, we will create the Notebook application and load the Cloudant data into Apache Spark service.

    What is Jupyter Notebook?

    The Jupyter Notebook is a web application that allows one to create and share documents that contain executable code, mathematical formulae, graphics/visualization (matplotlib) and explanatory text. Its primary use includes:

    1. Data cleaning and transformation,
    2. Numerical simulation,
    3. Statistical modeling,
    4. Machine learning and much more.

    Create a Notebook

    1. While the first notebook is running, go back to the Bluemix Catalog and open the same Apache Spark service that you created as part of recipe 1.
    2. Click NOTEBOOKS button to show existing Notebooks. Click on NEW NOTEBOOK button.
    3. Enter a Name, under Language select Python and click CREATE NOTEBOOK button to create a new notebook.

    Load data into Spark and perform basic operations

    1. Go to the notebook, In the first cell (next to In [ ]), enter the following command that creates the SQLContext and click Run. The SQLContext is the main entry point into all functionality in Spark SQL and is necessary to create the DataFrames.
      sqlContext=SQLContext(sc)
    2. Enter the following statements into the second cell, and then click Run. Replace hostname, username, and password with the hostname, username, and password for your Cloudant account. This command will load the result events from the Cloudant NoSQL DB into the Notebook.
      cloudantdata=sqlContext.read.format("com.cloudant.spark").\
      option("cloudant.host","hostname").\
      option("cloudant.username", "username").\
      option("cloudant.password", "password").\
      option("view","_design/iotp/_view/result-events").\
      load("dbname")
    3. Enter the following statement into the third cell, and then click Run. This command will return the schema as shown below,
      cloudantdata.printSchema()

      root
      |– id: string (nullable = true)
      |– key: string (nullable = true)
      |– value: struct (nullable = true)
      | |– _id: string (nullable = true)
      | |– _rev: string (nullable = true)
      | |– data: struct (nullable = true)
      | | |– forecast: double (nullable = true)
      | | |– name: string (nullable = true)
      | | |– temperature: double (nullable = true)
      | | |– timestamp: string (nullable = true)
      | | |– wzscore: double (nullable = true)
      | | |– zscore: double (nullable = true)
      | |– deviceId: string (nullable = true)
      | |– deviceType: string (nullable = true)
      | |– eventType: string (nullable = true)
      | |– format: string (nullable = true)
      | |– timestamp: string (nullable = true)

    4. Enter the following command in the next cell to select only the required columns that we are interested in and assign it to different variable
      df = cloudantdata.selectExpr("value.data.forecast as forecast", "value.data.name as name", "value.data.temperature as temperature", "value.data.timestamp as timestamp", "value.data.wzscore as wzscore", "value.data.zscore as zscore")
    5. Enter the following command in the next cell to look at first record (document) and click Run,
      df.take(1)
      [Row(forecast=17.455556497952305, name=u’datacenter’, temperature=17.38, timestamp=u’2016-Dec-01 07:37:23′, wzscore=-1.3100530965238832, zscore=-0.31371467787699114)]
    6. Enter the following command in the next cell to get only the temperature values and click Run, (Note that it will return only the top 20 rows),
      df.select("temperature").show()

      +—————+
      |temperature|
      +—————+
      | 18.66|
      | 18.5|
      | 18.53|
      | 18.56|
      | 17.595|
      ………….
      | 17.5|
      +—————+
      only showing top 20 rows

    In this step we have successfully loaded the historical (Cloudant NoSQL DB) data into the Spark Service and explored the schema.

  6. Create a Pandas DataFrame

    In this step we will convert the Spark SQL DataFrame into Pandas timeseries Dataframe and perform basic operations.

    The Python Data Analysis Library (a.k.a. pandas) provides high-performance, easy-to-use data structures and data analysis tools that are designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

    Create a Pandas DataFrame

    1. Enter the following commands in the next cell to create a Pandas DataFrame from the Spark SQL DataFrame and click Run. This line prints the schema of the newly created Pandas DataFrame which will be same as the Spark SQL DataFrame,
      import pprint
      import pandas as pd
      pandaDF = df.toPandas()
      #Fill NA/NaN values to 0
      pandaDF.fillna(0, inplace=True)
      pandaDF.columns

      Index([u’forecast’, u’name’, u’temperature’, u’timestamp’, u’wzscore’, u’zscore’], dtype=’object’)

    2. Using len on a DataFrame will give the number of rows as shown below,
      len(pandaDF)

      33980

    3. Columns can be accessed in two ways in Pandas. The first is using the DataFrame like a dictionary with string keys,
      pandaDF["temperature"]

      0 17.69
      1 17.38
      2 16.56
      3 17.69
      4 17.50
      ………..

    4. You can get multiple columns out at the same time by passing in a list of strings as shown below,
      pandaDF[["timestamp","temperature"]]  
         timestamp  temperature
      0  2016-Dec-01 07:16:01 17.530

       

    5. The second way to access columns is using the dot syntax. This only works if your column name could also be a Python variable name (i.e., no spaces), and if it doesn’t collide with another DataFrame property or function name (e.g., count, sum).
      pandaDF.temperature

    Create datetime as the index

    By default Pandas DataFrame uses the sequence number as index, since we analyze the timeseries data its better If we use datetime instead of integers for our index, we will get some extra benefits from pandas when plotting later on. This section will focus on doing the same,

    1. Enter the following code in the next cell to make the timestamp as the index.
      #import the datatime library
      from datetime import datetime
      # convert the time from string to panda's datetime
      pandaDF.timestamp = pandaDF.timestamp.apply(lambda d: datetime.strptime(d, "%Y-%b-%d %H:%M:%S"))
      pandaDF.index = pandaDF.timestamp

      # Drop the timestamp column as the index is replaced with timestamp now
      pandaDF = pandaDF.drop(["timestamp"], axis=1)
      pandaDF.head()

      # Also, sort the index with the timestamp
      pandaDF.sort_index(inplace=True)
    2. Enter the following command in the next cell to retrieve a row corresponding to a particular time and click Run, one can retrieve the temperature reading based on a relative datetime by first finding a closest time and then querying for it as shown below,
      '''
      One can query the temperature based on the datetime,
      incase if you are not sure about the exact time,
      then use searchsorted() method to get to the nearest date
      '''
      date = pandaDF.index.searchsorted(datetime(2016, 2, 18, 17, 44, 23))
      pandaDF.ix[date]

      forecast 17.61406
      name datacenter
      temperature 17.69
      wzscore 1.695702
      zscore 0.3832717
      Name: 2016-02-18 17:46:00, dtype: object

    As shown above, the row corresponding to the closest time is retrieved.

    In this step we have successfully created a Panda dataframe and performed few basic operations. In the next section we will see how one can visualize the temperature data using the matplotlib visualization library.

  7. Visualize Temperature Readings

    When working with interactive notebooks, one can decide how to present results and information. So far, we have used normal print functions which are informative. In this section, we will show how one can visualize the temperature data using the Pandas DataFrames and matplotlib library.

    1. Enter the following command in the next cell to generate the histogram for the temperature and click Run, This will tell how well the temperature readings are distributed,
      #tell Jupyter to render charts inline:
      %matplotlib inline
      import matplotlib.pyplot as plt

      pandaDF.temperature.hist()

    2. Enter the following commands in the next cell to plot the overall temperature and click Run, Observe that the graph is drawn with the timestamp in x axis and temperature values in y axis. Also, observe 2 Red lines showing the upper and lower thresholds,
      # Draw overall temperature%matplotlib inlineimport matplotlib.pyplot as pltimport numpy as npplotDF = pandaDF[['temperature']]import matplotlib.dates as datesfig, ax = plt.subplots()plotDF.plot(figsize=[20,10], ax=ax, grid=True)ax.set_xlabel("Timestamp",fontsize=20)ax.set_ylabel("Temperature",fontsize=20)ax.set_title("Overall Temperature", fontsize=20)ax.set_ylim([12,22])# Draw lines to showcase the upper and lower thresholdax.axhline(y=19,c="red",linewidth=2,zorder=0)ax.axhline(y=15,c="red",linewidth=2,zorder=0)ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=5, maxticks=None, interval_multiples=False))ax.xaxis.set_minor_formatter(dates.DateFormatter('%d\n%a\n%H:%M:%S'))ax.xaxis.grid(True, which="minor")ax.xaxis.set_major_locator(dates.MonthLocator())ax.xaxis.set_major_formatter(dates.DateFormatter('\n\n\n%b\n%Y'))plt.tight_layout()plt.show()

    3. If you want to plot only the last 400 values, then enter the following commands in the next cell and click Run. This will help one to understand the recent state of the system.
      # Draw Last 400 temperature valuesplotDF = pandaDF[['temperature']]fig, ax = plt.subplots()plotDF.tail(400).plot(figsize=[20,10], ax=ax, grid=True)ax.set_xlabel("Timestamp",fontsize=20)ax.set_ylabel("Temperature",fontsize=20)ax.set_title("Recent 400 Temperature Values", fontsize=20)ax.set_ylim([12,22])ax.axhline(y=19,c="red",linewidth=2,zorder=0)ax.axhline(y=15,c="red",linewidth=2,zorder=0)ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=5, maxticks=None, interval_multiples=False))ax.xaxis.set_minor_formatter(dates.DateFormatter('%d\n%a\n%H:%M:%S'))ax.xaxis.grid(True, which="minor")ax.yaxis.grid()ax.xaxis.set_major_locator(dates.MonthLocator())ax.xaxis.set_major_formatter(dates.DateFormatter('\n\n\n%b\n%Y'))plt.tight_layout()plt.show()

    4. Similarly you can plot temperature values along with zscore & wzscore by entering the following commands into the next cell and click Run, In the following example, we plot the graph between 2 days.
      # Draw temperature chart with normal zscore & wzscorestart = datetime(2016, 4, 25)end = datetime(2016, 4, 26)plotDF = pandaDF.ix[start:end]plotDF = plotDF[['temperature','zscore','wzscore']]if (len(plotDF) > 0): fig, ax = plt.subplots() plotDF.plot(figsize=[20,10], ax=ax, grid=True) # format the axis ax.set_xlabel("Timestamp",fontsize=20) ax.set_ylabel("Temperature and zscore",fontsize=20) ax.set_title("Temperatures between " + str(start) + " and " + str(end) + " with zscore", fontsize=20) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=False)) ax.xaxis.set_minor_formatter(dates.DateFormatter('%d\n%a\n%H:%M:%S')) ax.xaxis.grid(True, which="minor") ax.yaxis.grid() ax.xaxis.set_major_locator(dates.MonthLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('\n\n\n%b\n%Y')) plt.tight_layout() plt.show()else: print "There are no rows matching the given condition"

    5. Enter the following command to overlay the zscore & wzscore along with the temperature, this will help one to understand the deviations better,
      # Draw temperature chart with scaled zscore & wzscore# define a method that scales zscore with the temperaturedef scaleZscore(row): return row['zscore'] + row['temperature']# define a method that scales wzscore with the temperaturedef scaleWZscore(row): return row['wzscore'] + row['temperature']# apply the functionspandaDF['scaledzscore'] = pandaDF.apply(scaleZscore, axis=1)pandaDF['scaledwzscore'] = pandaDF.apply(scaleWZscore, axis=1)start = datetime(2016, 2, 19)end = datetime(2016, 2, 20)plotDF = pandaDF.ix[start:end]if (len(plotDF) > 0): # create a dataframe with a required fields that we want to plot plotDF = plotDF[['temperature','scaledzscore','scaledwzscore']] fig, ax = plt.subplots() plotDF.plot(figsize=[23,12], ax=ax) ax.set_xlabel("Timestamp",fontsize=20) ax.set_ylabel("Temperature and zscore",fontsize=20) ax.set_title("Temperatures between " + str(start) + " and " + str(end) + " with scaled zscore", fontsize=20) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=True)) ax.axhline(y=19,c="purple",linewidth=2,zorder=0) ax.axhline(y=15,c="purple",linewidth=2,zorder=0) ax.set_ylim([13,21]) ax.xaxis.set_minor_formatter(dates.DateFormatter('%d\n%a\n%H:%M:%S')) ax.xaxis.grid(True, which="minor") ax.yaxis.grid() ax.xaxis.set_major_locator(dates.MonthLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('\n\n\n%b\n%Y')) plt.tight_layout() plt.show()else: print "There are no rows matching the given condition, Try changing the date"

    6. One can visualize the temperature readings between 2 different times specified. For example, the following code allows one to visualize the temperature over the last 2 days, (Note, you might observe a failure if there isn’t any data in the last 2 days)
      from datetime import *import pytz# retrieve the current temperaturenow = datetime.now(pytz.timezone('UTC'))'''get the start time that will be behind 2 days from now, just modify "days = 2" to "hours=2" in case if you want to retrieve the temperature from last 2 hours.'''last_n_days = now - timedelta(days=2)plotDF = pandaDF.ix[last_n_days:now]plotDF = plotDF[['temperature','scaledzscore']]fig, ax = plt.subplots()plotDF.plot(figsize=[20,10], ax=ax)# choose the colours for each columnwith pd.plot_params.use('x_compat', True): plotDF.temperature.plot(color='b') plotDF.scaledzscore.plot(color='r');ax.set_xlabel("Timestamp",fontsize=20)ax.set_ylabel("Temperature and scaledzscore",fontsize=20)ax.set_title("Temperature in last 2 days", fontsize=20) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=True))ax.axhline(y=19,c="purple",linewidth=2,zorder=0)ax.axhline(y=15,c="purple",linewidth=2,zorder=0)ax.set_ylim([13,21])ax.xaxis.set_minor_formatter(dates.DateFormatter('%d\n%a\n%H:%M:%S'))ax.xaxis.grid(True, which="minor")ax.yaxis.grid()ax.xaxis.set_major_locator(dates.MonthLocator())ax.xaxis.set_major_formatter(dates.DateFormatter('\n\n\n%b\n%Y'))plt.tight_layout()plt.show()

    In this step, we have successfully analyzed the temperature data and visualized the results using bar and line charts.

  8. Operations related to Maximum temperature

    In this step, we will see how to use the Pandas DataFrames to find the maximum temperature over the hour, day, year and etc..

    1. Enter the following command in the next cell to find out the overall maximum temperature and click Run,
      # find the maximum temperature
      maximum = pandaDF.temperature.max()
      maximum

      19.78

    2. Enter the following statements in the next cell to find out all the instances where the temperature has crossed 19 degree and click Run. Observe that it returns all the rows where the temperature is greater than 19 degree.
      threshold_crossed_days = pandaDF[pandaDF.temperature > 19]threshold_crossed_days
    3. Enter the following command to return only the days and not the timestamp in which the temperature is crossed the threshold,
      threshold_crossed_days['timestamp'] = threshold_crossed_days.index
      days = threshold_crossed_days.timestamp.map(lambda t: t.date()).unique()
      print "Number of times the threshold is crossed: " + str(threshold_crossed_days.temperature.count())
      print "The days are --> " + str(days)

      Number of times the threshold is crossed: 100
      The days are –> [datetime.date(2016, 2, 19) datetime.date(2016, 2, 21)
      datetime.date(2016, 2, 22) datetime.date(2016, 2, 24) …….]

    4. Enter the following command to find the hourly maximum temperature for each years, the result will show 24 rows per year wherein each row will show the maximum temperature of the corresponding hour. This will be useful to find out the utilization of the equipment (assuming the temperature is directly propotional to the utilization of the equipment) in each hour, for example, how much the equipment is utilized in the first hour compared to 2nd hour and so on. Best examples could be the space utilization (Office Space, Parking Space and etc..) for each hour over the year.
      # Find out hourly maximum temperature for each year
      year_hour_max = pandaDF.groupby(lambda x: (x.year, x.hour)).max()

      fig, ax = plt.subplots()
      plotDF = year_hour_max[['temperature']]plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each year')
      ax.set_xlabel("Hour of each year",fontsize=12)
      ax.set_ylabel("Temperature",fontsize=12)
      plt.show()

    5. You can create a bar chart as well for better visualization by typing the following command in the next cell and click Run,
      # draw a bar chart for hourly maximum temperature
      fig, ax = plt.subplots()
      plotDF.temperature.plot(kind='bar',figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each year')
      ax.set_xlabel("Hour of each year",fontsize=12)
      ax.set_ylabel("Temperature",fontsize=12)
      plt.show()

    6. But if you want to observe the maximum temperature for each hour (every day) and plot it, enter the following code snippet,
      # Find out hourly maximum temperature for each day
      each_hour_max = pandaDF.groupby(lambda x: (x.year, x.month, x.day, x.hour)).max()

      fig, ax = plt.subplots()
      plotDF = each_hour_max[['temperature']]plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each Day')
      ax.set_xlabel("Hour of each day",fontsize=12)
      ax.set_ylabel("Temperature",fontsize=12)
      plt.show()

    7. Enter the following command to find out the maximum temperature for each day over the years,
      # Maximum temperature of each Day
      df = pandaDF

      df = df.drop(["scaledzscore"], axis=1)
      df = df.drop(["scaledwzscore"], axis=1)
      df = df.drop(["forecast"], axis=1)
      df = df.drop(["zscore"], axis=1)
      df = df.drop(["wzscore"], axis=1)

      df['Year'] = map(lambda x: x.year, df.index)
      df['Month'] = map(lambda x: x.month, df.index)
      df['Day'] = map(lambda x: x.day, df.index)
      plotDF = df.groupby(['Day','Month','Year']).max()

      fig, ax = plt.subplots()

      plotDF.plot(kind='bar', figsize=[20,10], ax=ax)

      ax.axhline(y=19,c="purple",linewidth=2,zorder=0)
      ax.axhline(y=15,c="purple",linewidth=2,zorder=0)
      ax.set_ylim([10,25])
      ax.set_title("Daily maximum temperature", fontsize=20)
      ax.set_xlabel("Day",fontsize=20)
      ax.set_ylabel("Temperature",fontsize=20)
      ax.xaxis.grid(True, which="minor")
      ax.yaxis.grid()
      plt.tight_layout()
      plt.show()

    In this step, we have seen how to use the Pandas DataFrames to explore and plot the maximum temperature data from the historical data. Similarly you can use the min() function to find the minimum temperatures.

  9. Operations related to Average temperature

    In this step, we will see how to use the Pandas DataFrames to explore and plot the average temperature data from the historical data.

    1. Enter the following command in the next cell to find out the average temperature and click Run,
      #calculate temperature mean
      pandaDF.temperature.mean()

      17.593230723955266

    2. Enter the following command to find the average temperature for the last one hour,
      from datetime import *
      import pytz

      # retrieve the current time
      now = datetime.now(pytz.timezone('UTC'))

      last_n_hours = now - timedelta(hours=1)
      pandaDF.ix[last_n_hours:now].temperature.mean()

      17.589529391059482

    3. Similarly, to find the average temperature of the last one day enter the following command,
      # Caculate average temperature for last day
      from datetime import *
      import pytz
      # retrieve the current time
      now = datetime.now(pytz.timezone('UTC'))
      last_n_days = now timedelta(days=1)

      pandaDF.ix[last_n_days:now].temperature.mean()

      17.583351550960117

    4. Similarly use the following command to find out the average temperature for the last month,
      # retrieve the current time
      now = datetime.now(pytz.timezone('UTC'))
      '''
      get the start time that will be behind n days from now, just modify "days = n" to "hours = n"
      in case if you want to retrieve the temperature from last n hours
      '''
      last_n_days = now - timedelta(days=30)

      pandaDF.ix[last_n_days:now].temperature.mean()

      17.592428135954886

    5. Enter the following command to find hourly average temperature for each years, the result will show 24 rows per year wherein each column will show the average temperature of the corresponding hour. This will be useful to find out the utilization of the equipment (assuming the temperature is directly propotional to the utilization of the equipment) in each hour, for example, how much the equipment is utilized in the first hour compared to 2nd hour and so on. Best examples could be the space utilization for each hour over the year.
      # Find out hourly Average temperature for each year
      year_hour_avg = pandaDF.groupby(lambda x: (x.year, x.hour)).mean()

      fig, ax = plt.subplots()
      plotDF = year_hour_avg[['temperature']]plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Average temperature for each year')
      ax.set_xlabel("Hour of each year",fontsize=12)
      ax.set_ylabel("Temperature",fontsize=12)
      plt.show()

    6. You can create a bar chart as well for better visualization by typing the following command in the next cell and click Run,
      # draw a bar chart for hourly average temperature
      fig, ax = plt.subplots()
      plotDF.temperature.plot(kind='bar',figsize=(15,5), ax=ax, title='Hourly Averagetemperature for each year')
      ax.set_xlabel("Hour of each year",fontsize=12)
      ax.set_ylabel("Temperature",fontsize=12)
      plt.show()

    7. But if you want to find out the average temperature for each hour and plot it, enter the following code snippet, In the following example, we plot the hourly average for last 2 days,
      # retrieve the current temperature
      now = datetime.now(pytz.timezone('UTC'))

      '''
      get the start time that will be behind 2 days from now, just modify "days = 2" to "hours=2" in case if
      you want to retrieve the temperature from last 2 hours
      '''
      last_n_days = now - timedelta(days=2)

      plotDF = pandaDF.ix[last_n_days:now]# Find out hourly average temperature for each day
      plotDF = plotDF.groupby(lambda x: (x.year, x.month, x.day, x.hour)).mean()

      fig, ax = plt.subplots()
      plotDF = plotDF[['temperature']]plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Average temperature of each Day')
      ax.set_xlabel("Hour of each day",fontsize=12)
      ax.set_ylabel("Temperature",fontsize=12)
      plt.show()

    8. Enter the following command to find out the average temperature for each day over the years,
      # Average temperature of each Day
      df = pandaDF

      df = df.drop(["scaledzscore"], axis=1)
      df = df.drop(["scaledwzscore"], axis=1)
      df = df.drop(["forecast"], axis=1)
      df = df.drop(["zscore"], axis=1)
      df = df.drop(["wzscore"], axis=1)

      df['Year'] = map(lambda x: x.year, df.index)
      df['Month'] = map(lambda x: x.month, df.index)
      df['Day'] = map(lambda x: x.day, df.index)
      plotDF = df.groupby(['Day','Month','Year']).mean()

      fig, ax = plt.subplots()

      plotDF.plot(kind='bar', figsize=[20,10], ax=ax)

      ax.axhline(y=19,c="purple",linewidth=2,zorder=0)
      ax.axhline(y=15,c="purple",linewidth=2,zorder=0)
      ax.set_ylim([10,25])
      ax.set_title("Daily Average temperature", fontsize=20)
      ax.set_xlabel("Day",fontsize=20)
      ax.set_ylabel("Temperature",fontsize=20)
      ax.xaxis.grid(True, which="minor")
      ax.yaxis.grid()
      plt.tight_layout()
      plt.show()

    In this step, we have seen how to use the Pandas DataFrames to explore and plot the average temperature data from the historical data.

  10. Conclusion and the Road Ahead

    This recipe showed how to analyze the historical timeseries data to understand the temperature trend over the day, month & year, The maximum temperature over the year, average temperature over the year and etc.. using the Spark SQL and Jupyter Notebook. One can use the average/maximum temperature derived from the data analysis and set the rule accordingly in IBM Real Time Insights service to create alerts. Developers can take a look at the code made available in this recipe and also in the Notebook in github repository to understand whats happening under the hood. The Notebook present in the github has more operations than what is showed in this recipe. Developers can consider this recipe as a template for doing a timeseries historical data analysis and can modify the python code depending upon the use case.

     

    As a next step, look at the list of analytical and cognitive recipes around Watson IoT Platform to analyze the IoT events.

12 comments on"Timeseries Data Analysis of IoT events by using Jupyter Notebook"

  1. mfalhimohamed April 27, 2016

    thank you for this information

  2. RomeoKienzler September 07, 2016

    This code works, yours not, maybe you can update the tutorial?

    cloudantdata=sqlContext.read.format(“com.cloudant.spark”). \
    option(“cloudant.host”,”0495289b-1beb-4e6d-888e-315f36925447-bluemix.cloudant.com”). \
    option(“cloudant.username”, “0495289b-1beb-4e6d-888e-315f36925447-bluemix”). \
    option(“cloudant.password”, “kdjfkljsdflkjlksdjf”). \
    load(“recipedb”)

    • @RomeoKienzler, Thank you for the comments. Can you please describe the change that you have made? I don’t see any changes apart from substituting the actual values for cloudant hostname, username and password? Am i missing something?

  3. sanket.bhalerao September 22, 2016

    i am having issue while executing “mvn exec:java -Dexec.mainClass=”com.ibm.iot.iotdatagenerator.HistoricalDataGenerator” -Dexec.args=” ” “. i have replaced the username and password with actual values.

    i have http and https proxy already set in the environment, i am able to ping to the dtabase URL, if i paste the same in browser i am able to see a welcome json.
    however while executing the command i am getting below stack trace.
    java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:294)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: org.lightcouch.CouchDbException: Error executing request.
    at org.lightcouch.CouchDbClientBase.executeRequest(Unknown Source)
    at org.lightcouch.CouchDbClientBase.getCookie(Unknown Source)
    at org.lightcouch.CouchDbClientBase.(Unknown Source)
    at org.lightcouch.CouchDbClient.(Unknown Source)
    at com.cloudant.client.api.CloudantClient.doInit(Unknown Source)
    at com.cloudant.client.api.CloudantClient.(Unknown Source)
    at com.ibm.iot.iotdatagenerator.HistoricalDataGenerator.main(HistoricalDataGenerator.java:66)
    … 6 more
    Caused by: org.apache.http.conn.HttpHostConnectException: Connect to something.cloudant.com:443 [408fed74-3c2f-4bd4-97b5-76f1731cbaed-bluemix.cloudant.com/23.246.202.151] failed: Connection timed out: connect
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
    … 13 more
    Caused by: java.net.ConnectException: Connection timed out: connect
    at java.net.DualStackPlainSocketImpl.connect0(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:79)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:337)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134)
    … 22 more

  4. 1622_Holger_Hellebro September 23, 2016

    There seems to be backslashes missing from date formatting strings in the Python code. For example, dates.DateFormatter(“%dn%an%H:%M:%S”) should be dates.DateFormatter(“%d\n%a\n%H:%M:%S”)

  5. humayunkm May 18, 2017

    Heloo, Could anyone advise which plan I got to upgrade to resolve following message

    {“flag”:false,”message”:”Failed to score predict, msg=Prediction exceeded plan limitation 5000,

  6. YMDH_sathish_Palaniappan May 23, 2017

    Hi, you need to upgrade to standard or paid plan to score more than 5000 times.

  7. If your Cloudant plan is ‘lite’, you’ll probably get an error when running df.select(“temperature”).show() in step 5. As a workaround, you can reduce the number of RDD partitions by adding another option entry when creating cloudantData object: option(“jsonstore.rdd.partitions”, 4)

Join The Discussion