Contents


Overview

Skill Level: Beginner

With basic knowledge on Jupyter Notebook, Python, Pandas DataFrame and IBM Watson IoT Platform

This recipe shows how to use the IBM Data Science Experience tool to detect anomalies in historical timeseries data and create rules in IBM Watson IoT Platform based on these anomalies.

Ingredients

  • Bluemix account

Step-by-step

  1. Introduction

    The usecase

    You have historical IoT timeseries data for a device and want to identify abnormal events. From the abnormal events that you identify, derive threshold values that you can use to create rules in IBM Watson IoT Platform. With these rules you can get alerted when your IoT device sends an abnormal reading in the future.

    Accepted file format

    Note that, the sample Notebook in this recipe accepts the CSV file in one of the following file formats:

    • 2 column format: <Date and time in DD/MM/YYYY or MM/DD/YYY format, Numeric value>
    • 1 column format: <Numeric value>

    IBM Data Science Experience

    Traditionally data scientists were trained to use commercial analytics tools and had a strong background in social sciences, economics, and mathematics. A new generation that is self-trained, use mainly open source technologies, and are not scared of programming and using APIs are now beginning to appear. However, because existing tools require different levels of expertise, collaboration across tools is difficult.

    The IBM Data Science Experience(DSX) is an environment that has everything a data scientist needs to be successful. It provides an interactive, collaborative, cloud-based environment where data scientists can use multiple tools to activate their insights. Data scientists can use the best of open source tools such as R and Python, tap into IBMs unique features, grow their capabilities, and share their successes.

    The workflow

    In this recipe, we will use the Jupyter Notebook that is available in IBM Data Science Experience to load your historical timeseries data (IoT data) and to detect anomalies in the data using z-score.

    The recipe also shows how to derive threshold values from your historical data and how to use the these to create a rule in Watson IoT Platform cloud analytics. The rules alert you whenever an IoT device associated with the rule reports a reading outside of the derived threshold limits.

    Z-score

    Z-score is a standard score that indicates how many standard deviations an element is from the mean.

    A z-score can be calculated from the following formula

    z = (X - µ) / σ

    where z is the z-score, X is the value of the element, µ is the population mean, and σ is the standard deviation

    A higher z-score value represents a larger deviation from the mean value which can be interpreted as abnormal.

  2. Getting started quickly with an existing Notebook

    This section shows how to use already built Jupyter Notebook to obtain the results faster. In the subsequent sections you will see how to build the Notebook from scratch in detail, So we recommend you to go through all the sections to understand what is happening under the hood.

    1. Use a supported browser to log in to DSX at – http://datascience.ibm.com/.
      Note!
      If you have Bluemix id, you can login with the same.
    2. Setup a new Project. Projects create a space for you to collect and share notebooks, connect to data sources, create pipelines and add data sets all in one place. As shown below, click “+” then select Create Project to create a new project,create-not
    3. Specify the name and create the Project, Note: Incase if there is no Spark service and Object Storage instance created, Create them before creating the project,create-spark-object-instances
    4. Goto project and click on “add notebooks” link to create a new Jupyter notebook as shown below. The Jupyter Notebook is a web application that allows one to create and share documents that contain executable code, mathematical formulae, graphics/visualization (matplotlib) and explanatory textadd-notebook
    5. Select From URL to load an existing notebook, then specify a descriptive name for the Notebook and enter the following URL to load the sample Notebook: https://github.com/ibm-watson-iot/predictive-analytics-samples/raw/master/Notebook/Anomaly-detection-DSX.ipynb,
    6. Click Create Notebook. Note: Observe that the Notebook is created with metadata, code, and output.
    7. In the DSX menu, select Find and Add Data option to load the CSV file. You might observe the following screen,add-files
    8. Drag and drop your CSV file into the Files option, Alternatively you can load data from one or more databases where the historical data is stored.
      Tip!
      If you do not have a file, you can download the sample file from this link. When the file is successfully uploaded, the data file is listed on the Data Source pane and is saved in the Object Storage instance that is associated with your Analytics for Apache Spark service.

    Access the file in Notebook

    1. To access the file, do the following steps:
      • In the Notebook, scroll down and place the cursor in the third input cell,
      • In the Data Source pane click “Insert to code” then click Insert Credentials as shown below,add-files-1
      • Observe that the credentials for accessing the csv file is added to the cell as a Python dictionary. If you insert number of times already, read the Note in this section about “credentials_1″,
        Note!
        In case if the name of the dictionary object is not credentials_1, change the name of the dictionary object in the next cell to use the same identifier.
    2. You have now loaded the CSV file that contains the historical timeseries data and added code to access the file in the Notebook. Now we need to run the code and observe the results.
    3. In the menu row, select Cell > Run All to run the notebook.
    4. In the Notebook, scroll down to view the anomalies in your data and the threshold values for the spike and dip. You should observe a chart like below if you have loaded the sample CSV file provided in this recipe.

    As shown, the red marks are the unexpected spikes and dips whose z-score value is greater than 3 or less than -3.

    In this section, we showed how to load an already built Notebook to see anomalies in your data. Go through the following sections to build a Notebook from scratch and create rules in Watson IoT Platform.

  3. Load your data into DSX

    In this step, you will create a Jupyter Notebook in DSX and load your CSV data file into it.

    1. Use a supported browser to log in to DSX at – http://datascience.ibm.com/.
      Note!
      If you have Bluemix id, you can login with the same.
    2. Setup a new Project. Projects create a space for you to collect and share notebooks, connect to data sources, create pipelines and add data sets all in one place. As shown below, click “+” then select Create Project to create a new project,create-not
    3. Specify the name and create the Project, Note: Incase if there is no Spark service and Object Storage instance created, Create them before creating the project,create-spark-object-instances
    4. Goto project and click on “add notebooks” link to create a new Jupyter notebook as shown below. The Jupyter Notebook is a web application that allows one to create and share documents that contain executable code, mathematical formulae, graphics/visualization (matplotlib) and explanatory textadd-notebook
    5. Specify a descriptive name for the Notebook, select Python as language and click Create Notebook.
    6. In the DSX menu, select Find and Add Data option to load the CSV file. You might observe the following screen,add-files
    7. Drag and drop your CSV file into the Files option, Alternatively you can load data from one or more databases where the historical data is stored.
      Tip!
      If you do not have a file, you can download the sample file from this link. When the file is successfully uploaded, the data file is listed on the Data Source pane and is saved in the Object Storage instance that is associated with your Analytics for Apache Spark service.

    Access the file in Notebook

    1. To access the file, do the following steps:
      • In the Notebook, scroll down and place the cursor in the third input cell,
      • In the Data Source pane click “Insert to code” then click Insert Pandas Dataframe as shown below,insert-panda
      • Observe that the data is read from CSV and assigned to Pandas dataframe,
        Note!
        In case if the data is big data, then try creating a Spark SQL dataframe.

    Running code cells in a notebook

    To run code cells in a notebook, click Run Cell () in the notebook toolbar. While the code in the cell is running, a [*] appears next to the cell. After the code has run, the [*] is replaced by a number indicating that the code cell is the Nth cell to run in the notebook.

    At this step, we showed how to create a Jupyter Notebook in DSX and load the CSV file into it.

  4. Accessing the data in Notebook

    In this step, you will load the data into a Pandas DataFrame and explore the data.

    The Python Data Analysis Library (pandas) provides high-performance, easy-to-use data structures and data analysis tools that are designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. The two primary data structures of pandas are Series (1-dimensional) and DataFrame (2-dimensional)

    1. Enter the following code in the next cell to show the first 5 rows of data and click Run. Note: You may need to change the name of the df_data_1 to the identifier that you got in the previous step,
      pandaDF = df_data_1
      pandaDF.head()

      You should see the following output:

    2. Enter the following code in the next cell to show the last 5 rows of data and click Run.
      pandaDF.tail() 

      You should see the following output:

    3. Enter the following command in the next cell to get the number of rows in the CSV file (DataFrame) and click Run.
      pandaDF.count()

      You should see the following output:

      timestamp 720
      temperature 720
      dtype: int64

    4. Enter the following commands in the next cell to set timestamp as the index if its present and click Run,
      # change index to time if its present
      header_list = pandaDF.columns.values
      valueHeaderName = 'value'
      timeHeaderName = 'null'
      
      if (len(header_list) == 2):
       timeHeaderName = header_list[0] 
       valueHeaderName = header_list[1]
      else:
       valueHeaderName = header_list[0] 
      # Drop the timestamp column as the index is replaced with timestamp now
      if (len(header_list) == 2):
       pandaDF[timeHeaderName] = pd.to_datetime(pandaDF[timeHeaderName])
       pandaDF.index = pandaDF[timeHeaderName] 
       pandaDF = pandaDF.drop([timeHeaderName], axis=1)
       # Also, sort the index with the timestamp
       pandaDF.sort_index(inplace=True)
       
      pandaDF.head(n=5) 
       

      You should see the following output:

    In this step, we have successfully created a Pandas DataFrame from the CSV file and explored the data a bit. If you want to explore the data further, refer to the recipe Timeseries Data Analysis of IoT events by using Jupyter Notebook which provides a list of basic commands to explore the SQL and Pandas DataFrame.

  5. Show Anomalies

     In this step, you will calculate the z-score and plot anomalies using the Pandas DataFrame and matplotlib library.

    1. Enter the following commands in the next cell to calculate z-score for each of the values and add it as a new column in the same DataFrame,
      # calculate z-score and populate a new column
      pandaDF['zscore'] = (pandaDF[valueHeaderName] - pandaDF[valueHeaderName].mean())/pandaDF[valueHeaderName].std(ddof=0)
      pandaDF.head(n=5)

      You should see the following output:

    2. Enter the following snippet of the code in the next cell to view the anomaly events in your data and click Run.
      Note!
      Copying and pasting the code directly from the below snippet might result in errors. If you experience errors, copy the code from the Github location
      # ignore warnings if any
      import warnings
      warnings.filterwarnings('ignore')
      
      # render the results as inline charts:
      %matplotlib inline
      import numpy as np
      import matplotlib.pyplot as plt
      
      '''
      This function detects the spike and dip by returning a non-zero value 
      when the z-score is above 3 (spike) and below -3(dip). Incase if you 
      want to capture the smaller spikes and dips, lower the zscore value from 
      3 to 2 in this function.
      '''
      def spike(row):
       if(row["zscore"] >=3 or row["zscore"] <=-3):
           return row[valueHeaderName]
       else:
           return 0
       
      pandaDF['spike'] = pandaDF.apply(spike, axis=1)
      # select rows that are required for plotting
      plotDF = pandaDF[[valueHeaderName,'spike']]
      #calculate the y minimum value
      y_min = (pandaDF[valueHeaderName].max() - pandaDF[valueHeaderName].min()) / 10
      fig, ax = plt.subplots(num=None, figsize=(14, 6), dpi=80, facecolor='w', edgecolor='k')
      ax.set_ylim(plotDF[valueHeaderName].min() - y_min, plotDF[valueHeaderName].max() + y_min)
      x_filt = plotDF.index[plotDF.spike != 0]
      plotDF['xyvaluexy'] = plotDF[valueHeaderName]
      y_filt = plotDF.xyvaluexy[plotDF.spike != 0]
      #Plot the raw data in blue colour
      line1 = ax.plot(plotDF.index, plotDF[valueHeaderName], '-', color='blue', animated = True, linewidth=1)
      #plot the anomalies in red circle
      line2 = ax.plot(x_filt, y_filt, 'ro', color='red', linewidth=2, animated = True)
      #Fill the raw area
      ax.fill_between(plotDF.index, (pandaDF[valueHeaderName].min() - y_min), plotDF[valueHeaderName], interpolate=True, color='blue',alpha=0.6)
      
      # Label the axis
      ax.set_xlabel("Sequence",fontsize=20)
      ax.set_ylabel(valueHeaderName,fontsize=20)
      
      plt.tight_layout()
      plt.legend()
      plt.show()
      

      You should see a chart:

    As shown, the red marks are the unexpected spikes and dips whose z-score value is greater than 3 or less than -3. Incase if you want to detect the lower spike and dips, modify the value to 2 (and -2 for dips) or even lower and run. Similarly, if you want to detect only the higher spikes and dips, try increasing the z-score value from 3 to 4 (and -4 for dips) and beyond.

  6. Derive threshold values

    This section shows how to derive the threshold values from your historical data using the z-score, and use the same to create rules in the Watson IoT Platform to detect anomalies in the current IoT device events in realtime. This will create an alert in realtime when the current sensor reading crosses the threshold value.

    1. Enter the following command into the next cell to derive the spike threshold value corresponding to z-score value 3 and click Run.
      # calculate the value that is corresponding to z-score 3
      (pandaDF[valueHeaderName].std(ddof=0) * 3) + pandaDF[valueHeaderName].mean()

      70.601299674769308

    2. Similarly, Enter the following command into the next cell to derive the dip threshold value corresponding to z-score value -3. and click Run.Run.
      (pandaDF[valueHeaderName].std(ddof=0) * -3) + pandaDF[valueHeaderName].mean() 

      20.066561436341793

    In this section , we saw how to derive threshold values for the given historical data and in the next section, we will see how to create rules in Watson IoT Platform.

  7. Create Rules in Watson IoT

    This section shows how to create a rule for the threshold values that you derived now. To get familiar with the Watson IoT Platform and connecting devices to it, refer to the recipe Visualizing Data in Watson IoT Platform. Simulate a temperature device using this recipe and proceed with the below steps,

    Create a Schema

    1. In the Devices tab, select the Manage Schemas tab as shown below,
    2. Click Add Schema to add a new schema,
    3. Select the DeviceType for which the schema is created and click Next,
    4. Click Add a property to add the datapoint from the connected devices.
    5. Select “From Connected” option as shown below and select the temperature property. Note that the device must be sending the events for you to see the datapoint. You can also add a property manually if the device is not connected and sending events.
    6. Click Finish to create the schema.

    Create an Action

    In this example, let us create an E-mail action, such that an E-mail will be sent to the concerned person whenever the temperature value crosses the threshold values that we derived.

    1. In the Rules tab, select the Actions tab as shown below,
    2. Click “Create An Action” button, provide a name and select Email as the action as shown below,
    3. Click Next and provide the E-mail address and click Finish to create the action.

     Create a Rule

    1. In the Rules tab, select the Browse tab and click “Create Cloud Rule”,
    2. Provide a name for the Rule, select the schema name in the “Applies to” column and click Next,
    3. Set the first condition as shown below,
    4. Then Select OR and add the second condition as shown below,
    5. Once the conditions are set, select action to set it to E-mail action as shown below and click Activate to activate the rule,
    6. This will send an alert whenever the temperature crosses the set threshold values.

    In this section, we saw how to create a Rule and actions with the derived threshold values to monitor the current IoT device's data in realtime.

  8. Conclusion and the Road Ahead

    This recipe showed how to use the z-score to detect anomalies in the historical timeseries data using the IBM Data Science Experience in simple steps. Also, showed, how one can derive the threshold value for the given historical data and set the rule accordingly in IBM Watson IoT Platform to create realtime alerts. Developers can take a look at the code made available in this recipe and also the Notebook in github repository to understand whats happening under the hood. Developers can consider this recipe as a template for detecting anomalies in their historical IoT data and modify the python code depending upon the use case.

4 Comments on "Use IBM Data Science Experience to detect time series anomalies"

  1. Look like you don’t need to perform Step 2.2 or 2.3 – create Spark instance! as I don’t get this option to create Spark and able to complete all of Step 1-2 to see the anomaly output.

    • Recipes@WatsonIoT February 09, 2017

      Thank you for the comment ! Actually these steps are necessary if you have not created Spark service in Bluemix already. I have modified the recipe to have latest screenshots.

  2. In Step “Access the file in Notebook”, it should be “Insert Credentials” option when it ask to Insert to code. Also, if you insert number of times already, read the Note! in this section about “credentials_1″… Hope this help to finish this recipe, at least Step 1 & 2…

Join The Discussion