Skill Level: Any Skill Level

This recipe shows how to extract, transform and load archived IoT event data from the IBM Watson IoT Platform using IBM Cloud Object Storage, Apache Spark, Jupyter notebooks and Scala under the hood of IBM Watson Studio.


Software Requirements

Recipe Requirement
The following 4 IoT Recipes are optional pre-requisites.

The first two recipes demonstrate how to capture IoT event data coming from an Elevator Simulator serving as a digital twin. The first recipe covers the elevator simulator and basic functions of the IBM Watson IoT Platform, whereas the second recipe shows how to define triggers and rules to react to problems with the elevators caused by maintenance stops when the motor temperature got beyond 200 degree Fahrenheit.  The recipe titled ‘Store Elevator device data and alerts in Cloudant NoSQL DB’ shows how to store IoT device data in Cloudant NoSQL DB using e.g. the Historical Data Storage Extension. 

If you have run through all 3 recipes then you will be able to use your own IoT event data as input to the Jupyter notebooks developed in this recipe. If you haven't done that - no problem. You can then download the input files with archived IoT event data from a GitHub repository and then upload these files to IBM Cloud Object Storage.


  1. Introduction

    In this recipe you will get to extract, transform and load (ETL for short) archived IoT event data from the IBM Watson IoT Platform using IBM Cloud Object Storage, Apache Spark, Jupyter notebooks and the Scala programming language under the hood of IBM Watson Studio.

    The use case behind this solution is quite straightforward: when using IBM Watson IoT Platform for managing IoT devices, incoming IoT event data can be stored in a Cloudant NoSQL database serving as landing zone for the IoT event data. Eventually this data – in form of time series information – will need to be archived in storage that is cheaper and have larger capacity than a NoSQL database. Object Storage such as e.g. IBM Cloudant Object Storage is ideal for this purpose. However, the archived data may still be needed for retrospective analysis should issues occur later on with specific IoT devices. In such a situation it is possible to access and extract the archived data using IBM Watson Studio, and then filter, aggregate and transform the raw IoT event data (in form of time series information) until it turns into information in a form relevant for the business.

    This recipe takes its starting point in a single archived file of IoT event data for elevators. These elevators will – unfortunately – go out of regular operation and enter maintenance mode should the motor temperature go beyond the threshold value of 200 degree Fahrenheit. In this recipe you will use IBM Watson Studio to extract the archived time series information for the elevators and then filter, transform and aggregate this information until it comes into a form where it is clearly shows the date and maximum motor temperature when an elevator went into maintenance mode caused by overheated motors. In a first step, you will get to use a Jupyter notebook for Scala that takes a single archive file as input. In a second step we will show how features of the Scala programming language combined with higher order functional programming using map and reduce combined with Spark SQL union can be used to extend the notebook to take a collection of archive files as input rather than a single archive with just additional 2 lines of code.

    This recipe will take you through the following basic steps needed to create, test and schedule the ETL job:

    • Getting started by signing in to IBM Watson Studio.
    • Create the project that will be used to manage the relevant services, data connections and notebooks.
    • Import, configure and run a simple Scala notebook used to filter, transform and aggregate archived IoT event data before it is stored in a Db2 table.
    • Use IBM Watson Studio to view the content of the database and thus check that the notebook has worked as intended.
    • Import, configure and run a more refined Scala notebook that can perform the same set of transformations but on a collection of archived files rather than just a single file.
    • Create a Catalog and share relevant assets such as data connections (and notebooks) for use in other projects.


    Before you start, let’s enter a short introduction on the architecture and rationale behind the solution.

  2. Architecture

    The main goal of this recipe is to define a Jupyter notebook using Scala and Apache Spark as underlying engine that can load and analyze archived time series data stored as hierarchically structured JSON documents in a file. The following diagram show the architecture in terms of the involved IBM Cloud services and data flows for the recipe:


    IoT events from the elevators are captured by the IBM Watson IoT Platform and stored in a Cloudant NoSQL Database using the Historical Data Storage Extension of the IBM Watson IoT service. The event data is stored in so-called buckets where each bucket serve as storage for event data in a defined time period, e.g. for a specific day or month. When the data is no longer needed in the Cloudant NoSQL DB it will be archived to IBM Cloud Object Storage and the bucket can then be deleted from Cloudant to free up space and reduce costs for storage. However, there may still be a need to perform retrospective analysis on such archived data which can then be done using IBM Watson Studio and the underlying technology in form of Apache Spark. In this recipe we shall define a Jupyter notebook for Scala that will extract, transform and then finally load the aggregated data into a database table managed by the IBM Db2 Warehouse on Cloud service running on IBM Cloud. The following diagram outlines the transformation done:

    02.4-ETL-TransformationApache Spark – one of the key technologies used in this recipe – is a distributed computing platform designed to be a fast and general purpose engine for processing big data – be it at rest or in motion. Spark applications run as independent sets of processes on a cluster, coordinated by a driver program. The root object is SparkContext which can be used to connect to the cluster manager, which allocate resources across the cluster. Spark comes with its own scheduler but does also work with YARN and Mesos.  Once connected to the cluster manager, Spark acquires executors on worker nodes in the cluster, which are processes that run computations and store data for the application.

    When a Spark program is run, data is partitioned into different blocks across the cluster. The driver program sends the code to be executed to each worker node in the cluster. The executors on each node/block will perform the work using tasks. After the first action completes the result is send back to the driver. Data locally to each node may be cached to speed up re-use of intermediate results. The following figure (from https://spark.apache.org/docs/latest/cluster-overview.html) outlines the cluster architecture:


    Apache Spark achieves high speed by using in-memory computations and by distributing the computations across a cluster of nodes. Like Hadoop it supports the MapReduce paradigm for transforming data but at significantly better performance. Like Hadoop it supports batch applications but also interactive queries and stream processing.

    Sparks primary abstraction is that of the so called Resilient Distributed Datasets – or RDD for short. A RDD is a distributed collection of elements stored across the nodes of the Spark cluster which provides the basis for parallelized processing of data across the cluster. It offers two forms of operations: transformations and actions. Transformations are lazily evaluated and are operations that take one or more RDD’s as input and yields an RDD as result. No operations are however performed immediately on the RDD by transformations until an action is performed, rather Spark build up a Directed Acyclic Graph (DAG) that can be optimized before actual execution. This mode of operation resemble the inner working of lazy functional programming languages such as Haskell. Actions are the operations that triggers execution of a sequence of transformations on an RDD and returns a value. This mode contributes to the fault tolerance of a Spark cluster: should a node disappear or intermediate results be lost, Spark has the option to create a new node, load the data and perform the operations of the DAG again. To improve performance when handling large datasets and complex transformations it is furthermore possible to cache intermediate results – either in memory or on disk or by using a combination of both.

    The Spark Core is the foundation of the overall Spark system. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface centered on the RDD abstraction. On top of the core, Spark provides 4 libraries: Spark SQL for defining transformations using SQL like queries, Spark Streaming for handling real time processing of data in motion, MLLib for machine learning and GraphX for graph processing. RDD’s, and the operations on RDD’s in form of transformation and actions, remain however the underlying kernel abstraction of Spark.

    Spark provides APIs for Scala, Python, Java and R. Its native language is Scala, which will also be used in this recipe. Scala uses the Java virtual machine as the runtime system, but extends Java with features of functional programming languages such as functions being first class values and pattern matching. Ideal programming constructs for defining programs that transform data. At the end of this recipe you will end up with a Scala notebook whose kernel logic looks like this:


    The notebook will load archived files first (top). The kernel logic for transforming a single archive file is defined by a transformation function of the object ElevatorEventDataTransformer. The key lines of code is however provided in the third cell, where map and reduce functions on lists are used to perform the bulk operations on a collection of archived files in two lines of code. The code for storing the result in a database table is not shown above. We will get to these details later in the recipe. Notice that the notebook uses SparkSQL and data frames – an abstraction over RDDs – as the main vehicle for performing the transformations.  The operations in the transformation function (cell 2) are therefore very much SQL akin by using select, filter, groupBy and orderBy.

    IBM Watson Studio provides users with environment and tools to solve business problems by collaboratively working with data. Users can choose the tools needed to analyze and visualize data, to cleanse and shape data, to ingest streaming data, or to create, train, and deploy machine learning models.


    The main functionality offers relates to components for:

    • Create Projects to organize the resources (such as data connections, data assets, collaborators, notebooks) to achieve an analytics goal.
    • Access data from Connections to your cloud or on-premises data sources. Upload files to the project’s object storage.
    • Create and maintain Data Catalogs to discover, index, and share data.
    • Refine data by cleansing and shaping the data to prepare it for analysis.
    • Perform Data Science tasks by creating Jupyter notebooks for Python or Scala to run code that processes data and then view the results inline. Alternatively use RStudio for R.
    • Ingest and Analyze Streams data with the Streams Designer tool.
    • Create, test and deploy Machine Learning and Deep Learning models.
    • Classify images by training deep learning models to recognize image content.
    • Create and share Dashboards of data visualizations without coding.


    IBM Watson Studio is technically based on a variety of Open Source technology as depicted in the following diagram:


    In this recipe we shall use IBM Cloud Object Storage as archive of IoT event data. We may have well used Hadoop and Hadoop related technology such as for example Hive to transform the raw IoT event data into a suitable for analysis. Using Hadoop rather than IBM Cloud Object Storage would just change the way in which we connect to the data source and load the IoT event data into Spark. The rest can remain the same (unless of course the decision is taken to transform the data using Hive).

    IBM Cloud Object Storage is designed for use cases such as archiving, back-up/restore, large content repository for video/images, born-on-cloud social and mobile workloads, advanced analytics, IoT. It is not limited to a public Cloud, but provides a consistent, uniform, and open technology across on-prem, dedicated and public cloud, which makes it simpler and more efficient to manage data in a hybrid setting. IBM Cloud Object Storage is designed for unstructured data and stores data as discrete objects in flat address space (called buckets). It is massively scalable, simple to manage storage technology with a high degree of comfort when used inside IBM Watson Studio where e.g. a connection and load of data can be provide by simple button clicks with no user defined programming required at all. It is implemented as a virtual pool of storage with distributed access which in turn provides very high data durability, integrity and availability by replicating data automatically. Its continuous availability architecture design means it is always available, tolerating catastrophic regional outages without any downtime or intervention.

    Last but not least, it has built-in security by providing strong data at rest confidentiality by combining encryption and information dispersal, protecting against digital and physical breaches. IBM Cloud Object Storage is a dispersed storage mechanism that uses a cluster of storages nodes to store pieces of the data across the available nodes. IBM Cloud Object Storage uses an Information Dispersal Algorithm (IDA) to break files into unrecognizable slices that are then distributed to the storage nodes. No single node has all the data, which makes it safe and less susceptible to data breaches while needing only a subset of the storage nodes to be available to fully retrieve the stored data. This ability to reassemble all the data from a subset of the chunks dramatically increases the tolerance to node and disk failures. By automatically encrypting the data it also increases the level of security significantly. More information on ICB Cloud Object Storage can be found in the Redbook “IBM Cloud Object Storage Concepts and Architecture“.

  3. Getting Started

    To get started you will need to go through the following 2 steps:

    1. Register for IBM Watson Studio.
    2. Create a DB2 service on the IBM Cloud.


    If you have already obtained access to IBM Watson Studio (or its predecessor IBM Data Science Experience) then you can skip (1). Moreover, if you have already instantiated a Db2 Warehouse on Cloud or DB2 Service on Cloud instance that you can use for this exercise (e.g. during the recipe Create Data Warehouse for reporting on Elevator device data using IBM DB2 Warehouse on Cloud) then you can skip (2) as well and just get hold on the credentials for that service. If not, follow the instructions in this section to provision the service and create the service credentials.

    Let’s obtain access to IBM Watson Studio first. You can register with IBM Watson Studio by going through the following steps:

    1. Click the link IBM Watson Studio. This will open up a browser window for IBM Watson Studio.
    2. Click Sign Up (or alternatively Sign Up for a Free Trial). You should then see the following page:
    3. Click the link that allows you to sign in with your existing IBM ID.
    4. Enter your IBM ID and click Continue.
    5. Enter your IBM Password and click Sign In. You should now see a screen like this:
    6. Select Watson Studio, then select the Lite plan that will allow you to use the tool without costs.
    7. Once in IBM Watson Studio, you will see the welcome page.


    Next, create a Db2 Warehosue service on the IBM Cloud. To achieve this follow these steps:

    1. Log into IBM Cloud.
    2. Select the Catalog.
    3. Select the Data and Analytics tab to the left.
    4. Select the Db2 Warehouse service.
    5. Accept the defaults for the service (service name, pricing plan) on the next screen and click Create.
    6. Select the Service credentials tab to the left.
    7. Click the New Credentials button to the right.
    8. Select View credentials to the right.
    9. Copy the credentials to a local file and save them on your file system for later use.


    Having created the project and the Db2 Warehouse on Cloud service you are now almost ready to create, configure and test a Scala notebook for extracting, transforming and loading data from Cloud Object Storage into a table managed by the Db2 service.  However, before doing so, you will need to create and configure a project for managing the relevant assets such as the notebook and its data connections. We will do this in the next section.

  4. Create and Configure Project

    Projects allow you to orient a team of collaborators around a set of notebooks, data sets, articles, and analysis work streams (amongst others). This makes it easier for organizations to re-use assets, and for teams to establish and enforce shared best practices while collaborating with each other and learning from the community.

    In this section we shall create a new project for the artifacts of the recipe. Since we need to use the Spark service for the notebook, this will require 3 steps:

    1. Provision a Spark service.
    2. Create the project.
    3. Associate the Spark service with the project.


    We will create the Spark service right away. This service can be shared across projects, which in turn means that if you have already created one you can reuse that service and skip this step and continue with step (2):

    1. Select the menu item Services > Compute Services from the toolbar menu.
    2. Click the Add Service button to create a new service.
    3. In the next screen showing a choice of services, Add the Apache Spark service.
    4. Select the Lite plan and click Create.
    5. In the next screen, check that the Spark service has been created:


    To create a project do the following:

    1. Select Projects > View all Projects from the toolbar menu.
    2. Click New Project. This will open up the New Project dialog for defining the properties of the project.
    3. Enter the name and description of the project (e.g. using ‘iot-elevators’ as project name).
    4. Click Create to create the new project.
    5. Close any “Welcome to project” dialog that may pop up.


    You should now see the project in IBM Watson Studio and will be ready to establish connection to data sources and create assets like notebooks. Next you will need to add the service that you created to the project:

    1. Select the Settings tab of the project (it is the last in the list of tabs in the top of the screen).
    2. Scroll down to the section named “Associated services”.
    3. Select the Add Service pull-down menu to the right and select Spark.
    4. Click the + button to add a new Spark service.
    5. Add an existing service instance by clicking the down-arrow, then select the service instance that you created in the beginning of this section (it is likely that it will be named ‘DSX-Spark’):
    6. Click the Select Button.
    7. The selected Spark service should now appear as an associated service of the project.


    Notice that Cloud Object Storage has already been provisioned for storing the artefacts of the project, such as the Jupyter notebook that you will create in the next section.

    The project has now been created and IoT event data has been archived on IBM Cloud Object Storage, which means that we are ready to create and test a Scala notebook for analyzing the event data.


  5. Create and Configure Scala Notebook

    Jupyter notebooks are a web-based environments for interactive computing. You can run small pieces of code that process your data, and you can immediately view the results of your computation. Notebooks include all of the building blocks you need to work with data: data, code computations, visualizations, text and rich media to enhance understanding.

    In this section of the recipe you shall:

    1. Create a new notebook for Scala programming using an existing notebook on GitHub as starting point.
    2. Upload an archived Cloudant NoSQL database to the Cloud Object Storage.
    3. Update the notebook to use this file as input source for analysis.
    4. Run the notebook one step at a time – interactively – to load and transform the event data in the file.


    First, download the needed artefacts from GitHub:

    1. Click the link “https://github.com/EinarKarlsen/iot-cos-ws-analysis“.
    2. Download the files in the GitHub repository to your local computer and unzip the zip file.


    For this section you will need the file ‘db08.json’ that contains sample IoT event data that shall be uploaded to IBM Cloud Object Storage, as well as the file ‘historic-elevator-event-data-analysis template.ipynb’ that contains the Jupyter notebook that will read the event data and aggregate and filter it before it stores the results in a Db2 database. You will need the first file to create the input data and the second as a starting point for the Jupyter notebook.

    To create the notebook, do the following:

    1. Select the Assets tab for the current project (you may need to scroll up to the top of the page you are on first in order to see the list of tabs relevant for a project):
    2. Click New notebook. This will open the Create Notebook dialog for defining the properties of the notebook:


    3. Click the down arrow at the bottom of the page. This will display the possible set of runtime environments for the notebook.
    4. Select DSX Spark.
    5. Select the File tab to import the notebook from the file system.
    6. Click Choose File and navigate to the file that you downloaded (it is named ‘historic-elevator-event-data-analysis template.ipynb’).
    7. Click Create Notebook.


    Notice that you could alternatively have created the notebook using the GitHub URL as the starting point (https://github.com/EinarKarlsen/iot-cos-ws-analysis/blob/master/historic-elevator-event-data-analysis%20template.ipynb) if you had selected the From URL tab rather than the From file tab.

    The notebook uses Scala as programming language and the Spark API for Scala. In this recipe we shall use Data Frames and Spark SQL, which allows the program to query structured data inside Spark programs using SQL like commands. We will do this step by step which will allow you to print out intermediate results and observe the effect of the commands. This is not exactly production ready code. In section 7 we will therefore introduce another version of the notebook which will use Scala in a more proficient manner and which will also allow you to perform the analysis using a collection of archive file as input rather than just a single archive file.

    The notebook got 3 sections. The first section is for extracting event data from the input file on IBM Cloud Object Storage and read the data to a Spark Data Frame.  This code is actually missing and you will need to insert it, which can be achieved using IBM Watson Studio by simple user interactions:


    The second section of the notebook transforms the data into an aggregated and filtered form that just shows on which days specific elevators had issues because the motor temperature went beyond 200 degrees Fahrenheit:


    The third and last part of the notebook loads the data from the Data Frame to a database table managed by a Db2 service running on the IBM Cloud. The interaction with the Db2 database is achieved using a JDBC driver. Before you can run the notebook and save the results you will need to modify the notebook to use the credentials for the Db2 service that you provisioned in section 3:


    Before you can start testing the notebook you will need to upload the archive file with IoT event data and then configure the notebook to connect to that file. Let’s do that now:

    1. Select the Find and add data button in the top right corner of the toolbar.
    2. Drag and drop the file ‘db08.json’ that you downloaded from GitHub to the drop area.
    3. Wait until the file has been properly uploaded.
    4. Select the cell number ‘[1]’:
    5. Click the Insert to code pulldown menu and select Insert SparkSession DataFrame.
    6. This command will insert all the connection details for you, including a command to initialize the Spark session and load the data frame with data from the uploaded file:
    7. Delete the placeholder “<here>” at the beginning and the show command at the end. Also make sure that the variable holding the loaded data is named dfDataFrame1.
    8. Invoke File > Save to save the changes.


    You can now start testing the notebook:

    1. Click the Run button in the toolbar. This will load the data into the data frame associated with the variable dfData1.
    2. Wait however until the name of the cell changes from “[*]” (indicating that the Cell is executing) to “[1]” (indicating that execution has finished).
    3. In the second cell, set the variable named debug to true. This will ensure that intermediate results are printed out so that you can see what the individual statements do.
    4. Click Run. This will print out the first 5 rows of the data and the schema. There are null rows as well as null columns that need to be ignored. Moreover, the relevant information – i.e. IoT the event data – is stored in the doc column.
    5. Execute the next 2 rows that will remove the null rows and select the relevant columns:
    6. Then run the next two cells of the notebook that will only select rows where the motor temperature is above 200 and add a new column that turns the timestamp into a proper date:
    7. Finally, execute the last two cells of the transformation section to group and order the event data accordingly:
    8. Invoke File > Save to save the changes.


    The IoT event data has now been filtered,  transformed and aggregated into a form where it is relevant for analysis. At least we can determine that for the current set of data (from August 2017) there were 3 incidents where the motor temperature exceeded 200 degree Fahrenheit and it impacted the elevators 01 and 03 only.

    Before you can load that data into the Data Warehouse you will need to establish a connection to the Db2 service on the Cloud and get the credentials for that service assigned with a variable of the notebook.

    To achieve this do the following;

    1. Select the Data button in the top right of the toolbar.
    2. Select the Connections tab.
    3. Select the link named “project page” to take you to the project page. The project should show the notebook that you created as well as the file that you uploaded to Cloud Object Storage:
    4. Select the Add to project command.
    5. In the pull-down menu, select the Connection menu item to create a new connection. This will take you to the page that shows the possible connections in form of IBM and Third-party services:
    6. Select Db2 on Cloud.
    7. In the next dialog, enter the credentials for the Db2 service that you obtained and saved to your local file system in section 3. You will need to provide the host name, the database name, the user name and the password.
    8. Enter a name for the connection such as “Db2 Warehouse on Cloud”.
    9. Click Create to create the connection. This will take you back to the project page:


    You should now see the connection as a data asset of the project. You will now need to modify the notebook so that the credentials are properly included in the notebook in order to establish the connection to the Db2 service on Cloud. To achieve this go back to the web page in your browser that contains your notebook and do the following:

    1. Select the cell in the notebook where the credentials are supposed to be inserted:
    2. Select the Data icon.
    3. Select the Insert to code pull down menu.
    4. Select the command Insert Credentials.
    5. When the credentials are inserted, delete the placeholder “<here>” and rename the variable storing the credentials to credentials_1:
    6. Make sure that the property the properties are named as shown above. These names will be used in the next cell to construct the URL of the DB2 JDBC driver. Also make sure that the port property is properly defined. If you need to enter it look up the port number in the file holding the credentials that you saved in section 3.
    7. Execute the updated cell and all remaining cells in the notebook to store the information in a Db2 table:


    Observe that the code for storing information in Db2 is not exactly production ready code. It will simply append the information stored in the Data Frame to the table named “sample_data_table” in your default schema (which happens to be the same name as your Db2 user name). Every time the write command is called it will append the information to the table rather than overwriting records with the same keys (which in this case would amount to deviceId and Date). But the code is sufficient for now to show how to quickly save a data frame to Db2. In the next section we will then check that this has worked properly and use the Data Catalog feature of IBM Watson Studio to publish the relevant data connections to a data catalog.


  6. Manage Data Sources

    To test whether the data has been correctly stored into the database, you do not need at all to leave IBM Watson Studio (as was previously the case using the predecessor product called IBM Data Science Experience). We can view the table by adding it as a data source to the project. To achieve this do the following:

    1. On the project page, select Add to project:
    2. In the pull-down menu, select Connected data as shown above.
    3. In the resulting dialog “Add data asset from connection”, click the link titled “Select Source“.
    4. In the next dialog, select the Db2 Warehouse data connection that you created in the previous section:
    5. Select the schema with the same name as your Db2 service user name (it starts with DASH).
    6. Select the table created in the previous section of the recipe.
    7. Click Select.
    8. Back in the “Add data asset from connection” dialog, enter a name and description for the data asset:
    9. Click Create to create the asset. This will take you back to the project page. Observe that a new data asset has been created:
    10. To test if the data has been stored correctly, simply click the link named “Elevator Issues 2017.08″:


    As you can see, the data has been inserted correctly.

  7. Complementary Scenarios

    So far you have worked with a single archive file that was downloaded from a GitHub repository. In this section we shall show you how to use multiple archive files as input so that the analysis can cover a larger time interval. We will also show you how you can export data out of a Cloudant NoSQL database so that you can analyze any data that you stored while running through the steps of the recipe ‘Store Elevator device data and alerts in Cloudant NoSQL DB‘.

    We will show you how to extend the initial notebook to cover multiple archive files as input first. You will need to go through the following steps:

    1. Upload more archive files to IBM Cloud Object Storage.
    2. Create a new notebook using a template on GitHub following the same procedure as in section 5.
    3. Add your IBM Object Store credentials to the notebook following the same procedure as in section 5.
    4. Add your credentials for IBM Db2 Warehouse on Cloud following the same procedure as in section 5.
    5. Run the notebook.
    6. Test that the notebook has updated the Db2 database using the same procedure as in section 6.


    The GitHub repository (“https://github.com/EinarKarlsen/iot-cos-ws-analysis“) contains two additional archive files for May and July 2017 named ‘db05.json’ and ‘db07.json’ respectively. Upload the archive files to IBM Cloud Object Storage using the same procedure as you did when uploading ‘db08.json’ in section 5 of the recipe. You will then have 3 archive files to work with as input.

    Next, create a new notebook using the notebook ‘historic-elevator-event-data-collection-template.ipynb’ that you downloaded as a template. Go through the same steps as in section 5. The notebook should look like this:


    Having done this, configure the notebook to connect to one of the uploaded archive files going through the same steps as you did in section 5. However replace the last inserted line of code that opens a single input file in cell ‘[1]’  with the following:

    val dfList1 = List("db05.json","db07.json","db08.json").map(db => spark.read.json(cos.url("iotelevator-donotdelete-pr-dchdyhr697izhi", db)))

    You will probably need to replace the string “iotelevator-donotdelete-pr-dchdyhr697izhi” with the corresponding string that was generated in your case. Also update the notebook to use the names of the archive files that you have uploaded to IBM Cloud Object Storage. When executing this line of code, the list of names for the archive files will be used as input to the map function which will, for each element of the list, call the function spark.read.json to open the archive file. This function in turns yields a Data Frame as result.

    Cell ‘[2]’ declares a Scala object with a single transformation that does exactly the same as in the first notebook but in a more condensed way using functional composition.  The Data Frame passed as input to the transform function will be filtered first calling filter. This transformation returns a new Data Frame which will be used as input to the select function. The select function selects a subset of the columns in the data frame and returns as a result the handle to a new Data Frame. This Data Frame will be passed as input to the next transformation (filter) – so forth and so on – until the resulting Data Frame is finally returned to the calling context.

    The essential lines of code are given in section ‘[3]’:

    val dfListT = dfList1.map(df => ElevatorEventDataTransformer.transform(df))
    val dfReduced = dfListT.reduce((df1,df2) => df1.union(df2))

    The first line takes a list of data frames as input and applies the map function to this list in order to transform each element of the list using the function ElevatorEventDataTransformer.transform. This will yield as result a list of data frames that have all been filtered, transformed and aggregated as in the first notebook. The second line of code reduces the list to a single data frame by applying the union transformation to the data frames, thus yielding as overall result the relational union of all data frames in the list dfListT. In other words, a data frame showing all days on which specific elevators had issues with too high motor temperatures for the time interval covered by all the input archive files. Voila!

    There is one last step that you need to perform before the notebook is completely ready to be executed – from the beginning to the end. You will need to insert the credentials for the Db2 Warehouse service in cel ‘[4]’ so that the resulting Data Frame can finally  be saved in the database. Follow exactly the same steps as in section 5 to achieve this task, then run all steps of the notebook. If you want to see if the notebook works according to intention, see section 6 for instructions.

    Alternatively to using the sample data file on GitHub you can use one of your own buckets created by the Historic Data Storage Extension. This requires of course that you have run through the recipe “Store Elevator device data and alerts in Cloudant NoSQL DB“. Download of the data in a bucket can then be done using curl, but you will need to know the credentials of the service and the names of the databases that you want to download.

    To obtain the credentials and database names for the service do the following:

    1. Go the IBM Cloud.
    2. Select the application for the Elevator Simulator.
    3. Click the Connections tab.
    4. Select the View credentials menu item for the Cloudant service as shown below:
    5. Copy the credentials for the service and store them in a file on your local computer.
    6. Select the IBM Cloudant NoSQL DB service to open up the dashboard for the service.
    7. Click the Launch button to open the console.
    8. In the console, click Databases to view the existing database names. Pick your choice that you would like to use for reporting – however it must be one of the databases created by the Historic Data Storage Extension.


    You can now download a database with the name “<dbname>” and service url “<url>” with the following command:

    curl -X GET <url>/<dbname>/_all_docs?include_docs\=true > /home/developer/Desktop/db.json

    From there, repeat the steps in this section of the recipe using the file that you just created as source file.

  8. Create Catalog

    IBM Watson Studio offers a concept of Data Catalog that allows you to share assets like data connections, data and notebooks (amongst others) with users in other projects. Suppose that you would like to share the data connection to Db2 Warehouse on Cloud with other projects so that you and other users will not need to redo the rather mundane steps of creating the connection every time you author a notebook that needs that connection. This is easily done. You will need to create a catalog first, then add the data connection to the catalog from within your project page.

    To achieve this do the following:

    1. Select Catalog from the toolbar menu.
    2. Invoke the New Catalog command, either by clicking the button or selecting the link:
    3. Enter a name and a description for the new catalog.
    4. Click Create.


    From the data catalog, it is possible to add new connections and data assets. However, this would require you to specific all the credentials again. We can avoid that simply by publishing artefacts from the project to the catalog. Simply do the following:

    1. Go back to the project page by selecting Projects > iot elevators (or whatever the name of your project is) from the toolbar:
    2. In the project page, select the data connection and invoke Publish from the pull down menu:
    3. Enter a description for the project.
    4. Enter “elevators” as a tag for the asset (in the field with the text “Start typing to add tags”) and click the + button.
    5. Click Publish to create the connection.
    6. From the top toolbar menu, invoke Catalog > IoT Elevators to view the catalog:


    Notice that the tags can be used later on to filter assets.

    A major advantage of having catalogs is that it becomes more convenient to share and re-use assets. Rather than going through the tedious steps of getting hold on the credentials for a data connection and then create that connection manually, you can simply add it to a project – either from within the catalog or from within a project:


  9. Conclusion

    In this recipe we have taken on the task to extract, transform and load (ETL for short) archived IoT event data from the IBM Watson IoT Platform using IBM Cloud Object Storage, Apache Spark, Jupyter notebooks and the Scala programming language under the hood of IBM Watson Studio.

    IBM Watson Studio provides users with environment and tools to perform task of relevance to data engineers, data scientists and data stewards in a highly integrated environment. We have created, configured and executed a Jupyter notebook using Scala that analyzed IoT event data and filtered, transformed and aggregated the time series information to yield business relevant information showing malfunction of specific IoT devices. The underlying runtime platform in form of a Spark cluster was provisioned with a few user interactions. During this process we have taken advantage of the fact that IBM Watson Studio can automatically insert code required to connect to external storage such as e.g. IBM Cloud Object Storage without any need for getting into tedious and manual programming. Having executed the notebook there was no need to leave the environment in order to check if the database table was created correctly. This could simply be achieved inside IBM Watson Studio by creating a new data source for the target data table with a few user interactions.  Having created notebooks and data connections we have also shown how the concept of Catalogs could be used to share these assets with other projects in order to increase the level and convenience of reuse between projects.

1 comment on"Analyze archived IoT device data using IBM Cloud Object Storage and IBM Watson Studio"

  1. Hi Einar……. Great to find this recipe, that explains the #architecture of IBM #Cloud #Object-storage in detail and also has a stepwise guide to build the same.
    Thanks for sharing……. regards jayant

Join The Discussion