Overview

Skill Level: Any Skill Level

This recipe shows how to extract, transform and load weather data stored by IBM Cloud Object Storage using IBM SQL Query, pandas, Apache Spark, IBM Watson Studio and Jupyter notebooks for Python.

Ingredients

Software Requirements

To obtain an IBM Cloud Account and get access to the IBM Cloud and to IBM Watson Studio, please follow the instructions outlined here:

 

Recipe Requirement

This recipe covers part 3 of the 'Weather Data meets IBM Cloud' series.

Get the setup ready for this recipe, by covering the following prerequisite recipe:

Step-by-step

  1. Introduction

    The recipe titled ‘Weather Data meets IBM Cloud. Part 1 – Ingestion and Processing‘ showed how to ingest and process weather data from the Weather Company into the IBM Cloud Platform using IBM Functions and IBM Event Stream. A second recipe titled ‘Weather data meets IBM Cloud. Part 2 – Storage and Query of Weather Data‘ followed up by showing how to store and query weather data using IBM Cloud Object Storage and IBM SQL Query.

    In this recipe we shall continue up with a more comprehensive demonstration of how to transform and aggregate weather data in preparing it for analysis an visualization. In doing so we will implement a number of Extract, Transform and Load (ETL) jobs for weather observation and weather lightning data using a variety of technologies each with their pro and cons, namely IBM SQL Query, the pandas toolkit of Python as well as Apache Spark. We will start out aggregating weather observation data computing daily minimum, average and maximum temperatures using IBM SQL Query and store the resulting information in IBM Cloud Object Storage as a CSV file. In a second step we will then look into lightning data and transform it into a form that can be used for analytics by using pandas – a toolkit frequently used for data wrangling purposes when developing data science and machine learning solutions. In a third step we will briefly demonstrate how Spark SQL can be used to aggregate weather observation data as well and then finally store the results in a PostgreSQL database on IBM Cloud rather than a CSV file. For this step we will use the IBM Analytics Engine as well as IBM Cloud Databases for PostgreSQL. Key components in all tasks will be IBM Watson Studio and its support for Jupyter notebooks with Python.

    In the recipe you will go through the following exercises:

    • Section 3: Get started by provisioning the required services – e.g. IBM Watson Studio, IBM Cloud Databases for PostgreSQL and IBM Analytics Engine running Apache Spark.
    • Section 4 & 5: Define transformations by aggregating weather data using IBM SQL Query as a standalone tool interactively, then continue running the same query using Jupyter notebooks and the IBM SQL Query API.
    • Section 6: Define transformations by transforming lightning data using Jupyter notebooks and pandas.
    • Section 7: Define transformations by aggregating weather observation data using Jupyter notebooks and Apache Spark SQL to finally store the data in a PostgreSQL database.

     

    Before the hands-on exercises we will briefly introduce the overall architecture and introduce the main new components: IBM Watson Studio, IBM Cloud Databases for PostgreSQL and IBM Analytics Engine running Apache Spark

  2. Architectural Overview

    Before you start it is important to provide you with some details on the architecture of the solution. The architecture diagram is shown below and is a typical IoT architecture with services for event data ingestion, event handling (message bus), landing zone (object storage) and event data processing to transform, aggregate, load and visualize event data in a form so that it becomes business relevant. In other words, it is a data refine pipeline turning “raw oil” into “business relevant information‚ÄĚ covering data at rest.

    02.01b-Architecture-1

    In the first part of this series, weather data was retrieved from the Weather Company Data service by IBM Cloud Functions (based on Apache OpenWhisk) and delegated onwards to IBM Event Streams (based on Apache Kafka). In the second part of the series IBM Cloud Object Storage Bridge was then configured to write data from IBM Event Streams to IBM Cloud Object Storage. Once that the weather data had been stored in IBM Cloud Object Storage, it became possible to query the data using IBM SQL Query. As a prerequisite for writing and reading data from IBM Cloud Object Storage, proper access permissions had however to be created using IBM Identity and Access Management (behind the scene).

    In this part we shall continue transforming and aggregating the weather data in IBM Cloud Object Storage using a variety of technologies and then briefly discuss the pro and cons of each. None of the transformations performed will be complex. The main purpose of these transformations will be to prepare the data for analytics so that it can be used for defining dashboards and we have deliberately chosen to focus on the principles of using the technologies rather than dwelling into complex transformations requiring lengthy computations. For all tasks we will use IBM Watson Studio to develop Jupyter notebooks for Python.

    During the sections of this recipe we will work with two kinds of data: the weather observation data that was ingested and stored in the IBM Cloud Object Storage using IBM Functions and IBM Event Streams, as well as lightning data World Wide for a given point in time. The latter will be available as a JSON file that has been extracted from the Weather Company API and made available to you. You will work with a pre-extracted file that just needs to be uploaded to IBM Watson Studio rather than real-time data for a very simple reason: extracting real-time lightning data from the Weather Company API requires a special API Key with limited access and charges involved. So in this recipe we shall just pretend that the data has already been ingested and then work with an uploaded file.

    In this recipe 3 different kind of technologies will be used for defining the transformations:

    1. IBM SQL Query: For the weather observations data we will initially aggregate the data to show minimum, average and maximum temperatures per location on a daily basis making a deliberate distinction between night and day temperatures. For this we will use IBM SQL Query and will continue the work in the second recipe by defining a SQL query that loads weather observation data from a COS bucket, aggregates it and then stores the results in a dedicated CSV file.
    2. Pandas: In a second step we will look into the lightning data. Unfortunately, this data comes in a form which can neither be processed by IBM SQL Query, nor be used in for BI using IBM Watson Studio. The data must therefore be transformed into an acceptable form and for this purpose we will the pandas library for Python which is frequently used for data wrangling purposes as long as the amount of data can adequately be processed in memory of a single machine. 
    3. Apache Spark: For large data sets we will need to adopt other technologies than pandas. In a third step we will there show how to use Apache Spark and Spark SQL to aggregate the weather observation data, but rather than loading the result into a CSV file as was the case for IBM SQL Query, we will load the aggregated data into a PostgreSQL database using the SQLAlchemy toolkit for Python. For this purpose you will need two additional services in IBM Cloud: the IBM Analytics Engine offering Apache Spark (as well as Hadoop which will not be needed since we are using IBM Cloud Object Storage as data lake) and IBM Cloud Databases for PostgreSQL.

     

    In context of transformations of data the terms Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT) are usually mentioned as two different kind of approaches.

    • ETL is the traditional approach adopted for Data Warehouses and is normally a well-defined workflow implemented using specific ETL tools. ETL first extracts data from various data sources. Then data is cleansed, enriched, transformed to finally be stored in a data warehouse. The Data Warehouse is typically well defined according to a schema and adding new information to the DW is typically a rigorous process requiring changes to ETL jobs, the Data Warehouse schema and the required reposts and dashboard.
    • ELT (Extract, Load, Transform) ion the other hand is of never origin and is a variant of ETL where the extracted data is first loaded into a target system. Transformations are then performed after the data is loaded. ELT typically works well when the target system is powerful enough to handle transformations. Distributed repositories such as Hadoop or IBM Cloud Object Storage combined with e.g. Apache Spark are often used in ELT pipelines because they are highly efficient in performing transformations on large set of data. ELT is less rigorous than traditional approaches using ETL and DWs, and it does not (necessarily) presume that the data is stored according to a schema.

     

    If we look at the architecture above ‘in the large’ then it follows the ELT paradigm: data is extracted from The Weather Company, loaded into IBM Cloud Object Storage to eventually be transformed by e.g. IBM SQL Query. However, there is also an instance of a classical ETL approach in the architecture insofar that Apache Spark is used to extract data from IBM Cloud Object Storage, transform it by performing appropriate aggregations to finally load it into the Data Warehouse on PostgreSQL for further analysis.

    An overview of IBM Functions, IBM Event Streams, IBM Cloud Object Storage, IBM SQL Query and IBM Identity and Access Management has already been given in the previous 2 parts of this series. We shall therefore provide a short overview of the 3 new components in this recipe: IBM Watson Studio, IBM Cloud Databases for PostgreSQL and IBM Analytics Engine.

     

    IBM Watson Studio

    IBM Watson Studio provides users with environment and tools to solve business problems by collaboratively working with data. Users can choose the tools needed to analyse and visualize data, to cleanse and shape data, to ingest streaming data, or to create, train, and deploy machine learning models.

    02.2-Watson-Studio

    The main functionality offers relates to components for:

    • Create Projects to organize the resources (such as data connections, data assets, collaborators, notebooks) to achieve an analytics goal.
    • Access data from Connections to your cloud or on-premises data sources.
    • Upload files to the project’s object storage.
    • Create and maintain Data Catalogs to discover, index, and share data.
    • Refine data by cleansing and shaping the data to prepare it for analysis.
    • Perform Data Science tasks by creating Jupyter notebooks for Python or Scala to run code that processes data and then view the results inline.
    • Ingest and Analyze Streams data with the Streams Designer tool.
    • Create, test and deploy Machine Learning and Deep Learning models.
    • Classify images by training deep learning models to recognize image content.
    • Create and share Dashboards of data visualizations without coding.
    • Reuse content such as data sets, notebooks and t√ļtorials available on the Watson Community.

     

    IBM Watson Studio is technically based on a variety of Open Source and IBM technology.  In this recipe we shall define several Jupyter notebook for Python that will extract, transform and then finally load the result of the transformation into either a file in IBM Cloud Object Storage or a database table. The below picture provides an overview of some of the components of IBM Watson Studio and the IBM Watson Data Platform. The ones with solid yellow lines are used in this recipe, the ones with dotted lines in previous or upcoming recipes:

    02.03-Watson-Data-Platform

    During the hands-on in this recipe you will work with a number of Jupyter notebooks as shown below – either using IBM SQL Query, pandas or Spark SQL. The notebooks will take data assets such as the weather data bucket of IBM Cloud Object Storage (connection ‘COSWeatherData’) holding weather observation or lightning data (‘Lightning_global_1min’) as input, and produce several CSV files as the result:

    1. ‘weather-observations.csv’ containing detailed weather data extracted from the JSON. objects in IBM Cloud Object Storage.
    2. ‘weather-observation-temperatures.csv’ containing aggregated min, max and average temperatures.
    3. ‘lightning-data.csv’ containing lightning data in tabular form.

    02.04-Project-Artefacts

     

    IBM Analytics Engine and Apache Spark

    IBM Analytics Engine has been designed for the development and deployment of analytics applications using open source Apache Spark and Apache Hadoop. Using the service, it is possible to customize the cluster using own analytics libraries and open source packages. The service also integrates with IBM Watson Studio or third-party applications to submit jobs to the cluster. In this recipe we shall take a one-eyed view of the service just focusing on the ability to run a Spark job from within a Jupyter notebook being managed by IBM Watson Studio.

    Apache Spark is a distributed computing platform designed to be a fast and general-purpose engine for processing big data – be it at rest or in motion. Spark applications run as independent sets of processes on a cluster, coordinated by a driver program. The root object is SparkContext which can be used to connect to the cluster manager, which allocate resources across the cluster. Spark comes with its own scheduler but does also work with YARN and Mesos. Once connected to the cluster manager, Spark acquires executors on worker nodes in the cluster, which are processes that run computations and store data for the application.

    When a Spark program is run, data is partitioned into different blocks across the cluster. The driver program sends the code to be executed to each worker node in the cluster. The executors on each node/block will perform the work using tasks. After the first action completes the result is send back to the driver. Data locally to each node may be cached to speed up re-use of intermediate results. The following figure (from https://spark.apache.org/docs/latest/cluster-overview.html) outlines the cluster architecture:

    02.5-Spark-cluster-overview

    Apache Spark achieves high speed by using in-memory computations and by distributing the computations across a cluster of nodes. Sparks primary abstraction is that of the so called Resilient Distributed Datasets – or RDD for short. An RDD is a distributed collection of elements stored across the nodes of the Spark cluster which provides the basis for parallelized processing of data across the cluster. It offers two forms of operations: transformations (e.g. map, filter) and actions. Transformations are lazily evaluated and are operations that take one or more RDD’s as input and yields an RDD as result. No operations are however performed immediately on the RDD by transformations until an action is performed, rather Spark build up a Directed Acyclic Graph (DAG) that can be optimized before actual execution. This mode of operation resemble the inner working of lazy functional programming languages such as Haskell. Actions are the operations that triggers execution of a sequence of transformations on an RDD and returns a value. This mode contributes to the fault tolerance of a Spark cluster: should a node disappear, or intermediate results be lost, Spark has the option to create a new node, load the data and perform the operations of the DAG again. To improve performance when handling large datasets and complex transformations it is furthermore possible to cache intermediate results – either in memory or on disk or by using a combination of both.

    The Spark Core is the foundation of the overall Spark system. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface centered on the RDD abstraction. On top of the core, Spark provides 4 libraries: Spark SQL for defining transformations using SQL like queries, Spark Streaming for handling real time processing of data in motion, MLLib for machine learning and GraphX for graph processing. RDD’s, and the operations on RDD’s in form of transformation and actions, remain however the underlying kernel abstraction of Spark. Spark provides APIs for Scala, Python, Java and R. Its native language is Scala, but in this recipe we will use Python and Spark SQL which provides a much higher level of abstraction than RDDs. When running SQL from within another programming language such as Python the results will be returned as a Dataset/DataFrame.¬†Datasets provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL‚Äôs optimized execution engine.¬†A DataFrame is a Dataset organized into named columns and is conceptually equivalent to a table in a relational database and unlike Datasets it is supported by Python.¬†

     

    IBM Cloud Database Services

    IBM Cloud‚ĄĘ Databases for PostgreSQL offers an object-relational SQL database, complemented by powerful enhancements like indexable JSON, publish/subscribe functions and drivers. The service is fully managed with IBM Cloud taking care of high availability, backups, logging, monitoring, scaling, hardware setup and software patching. IBM provides ready-to-use integrations with IBM Identity and Access Management to extend access control and auditing. IBM Cloud Databases for PostgreSQL deployment are built as highly available clustered configurations leveraging multiple availability zones in a given region and have been designed with scalability and economics in mind: Databases for PostgreSQL are billed hourly and helps enable you to separately scale disk and RAM for each of your deployments.

    02.02-IBM-Cloud-Databases

    High availability of IBM Cloud Databases is provided by running the databases on several availability zones, with one of the zones taking the role as the leader and the other database the role of follower. Resiliency is achieved by backing up data to a cross regional instance of IBM Cloud Object Storage.

  3. Getting Started

    We will assume that you have already gained access to IBM Cloud and eventually also IBM Watson Studio (see the “Prerequisites” section at the beginning of the recipe for the links needed for registering).

    In this section of the recipe you will get started by doing the following:

    1. Create an instance of IBM Databases for PostgreSQL.
    2. Create an instance of the IBM Analytics service (running Apache Spark).
    3. Create an instance of IBM Watson Studio.
    4. Create an IBM Watson Studio project.
    5. Upload a dataset for lightning data from GitHub.

     

    Provision IBM Cloud PostgreSQL Database Service

    You will need to create an instance of the PostgreSQL database and configure credentials for the service:

    1. Select the IBM Cloud Catalog.
    2. Enter ‘post’ as search string.
    3. Select the ‘Databases for PostgreSQL’ service.
      03.01-Create-PostgreSQL
    4. On the ‘Databases for PostgreSQL’ page:
      1. Provide a name for the service (optionally using ‘Weather Data Demo’ as prefix).
      2. As resource group choose the one that you created in the previous recipe (e.g. ‘weather-data-demo’).
      3. Add a tag (e.g. ‘weather-data-demo’) to the service.
      4. Scroll down and select for pricing the Standard plan.
    5. Click Create to create the service.
    6. Wait until the service has been created.
    7. Select Service Credentials in the toolbar to the left.
    8. Select the New Credential button.
      1. Enter a name for the credentials in the Add new credential dialog.
      2. Click Add to create the
    9. Select the View credentials action.
    10. Copy and paste the JSON object that defines the new credential to a local file for later use.

     

    Provision IBM Cloud Analytics Service

    You will need to create an instance of the IBM Cloud Analytics service:

    1. Select the IBM Cloud Catalog.
    2. Enter ‘spark’ as search string.
    3. Select the ‘Analytics Engine’ service.
      03.03-Create-Analytics-Engine
    4. On the ‘Analytics Engine’ page:
      1. Provide a name for the service (optionally using ‘Weather Data Demo’ as prefix).
      2. As resource group choose the one that you created in the previous recipe (e.g. ‘weather-data-demo’).
      3. Add a tag (e.g. ‘weather-data-demo’) to the service.
      4. Scroll down and select for pricing the Lite plan.
    5. Click Configure to configure the service.
    6. On the Configure service instance page:
      1. Keep the default for the Hardware configuration.
      2. Keep the default of 1 for Number of compute nodes.
      3. Keep the default of ‘Spark and Hive’ for the Software package.
      4. Click Create.
    7. Close the dialog that informs you that the service creation may take some time.

     

    You do not need to wait for the service to be fully provisioned but can continue with the next steps right away. The service will not be needed before you start in section 7 of this recipe using Apache Spark.

     

    Create IBM Watson Studio Service

    To create the Watson Studio service, do the following:

    1. Select the IBM Cloud Catalog.
    2. Enter ‘studio’ as search string.
    3. Select the ‘Watson Studio’ service.
      03.02-Create-Watson-Studio-Service
    4. On the Watson Studio page:
      1. Provide a name for the service (optionally using ‘Weather Data Demo’ as prefix).
      2. As resource group choose the one that you created in the previous recipe (e.g. ‘weather-data-demo’).
      3. Add a tag (e.g. ‘weather-data-demo’) to the service.
      4. Scroll down and select for pricing the Lite plan.
    5. Click Create to create the service.
    6. Wait until the service has been created.
    7. On the dashboard of the service, select Get Started.
    8. This will take you to the Home Page showing you commands for creating a project and searching a Catalog.

     

    If you already have an account for IBM Watson Studio you can of course decide to use that service. You will then need to log into the service and optionally change the region to the region used for the other services of this recipe (or preferably a region that is close to the location of those services).

     

    Create IBM Watson Studio Project

    Next you will need to create an IBM Watson Studio project. Goto the dashbaord of IBM Watson Studio and do the following:

    1. Click Create a project.
    2. On the next page titled ‘Create a Project’, select the Standard Project template.
    3. Click Create Project.
    4. In the New Project dialog,
      1. Enter a name to the project such as “Weather Data Demo”.
      2. Provide a short description of the project.
      3. Select the Object Storage Instance where you want project artefacts to be stored.
      4. Create.
    5. Wait until the project has been created.

     

    Watson Studio will now provide with an overview of the project as well as tabs for viewing assets, environments, bookmarks, collaborators and settings that allow you to add services, access tokens and integrations to e.g. GitHub.

     

    Upload Lightning Data Set to IBM Watson Studio Project

    To upload the lightning data set and the notebook do the following:

    1. Download the ‘Lightning_global_1min_201812041030.json‘ data set from the GitHub project weather-data-demo.
    2. Select the Find and add data command in the main toolbar of Watson Studio (located in the upper right corner).
    3. Drag the file named ‚ÄėLightning_global_1min_201812041030.json‚Äė from your local file system to the pane with the title ‘drop files here’.
    4. Select the command Add to project.

     

    If you go back to the IBM Cloud dashboard and select Resource List from the Navigation Menu in the top left corner, you should now be able to view the resources tagged with ‘weather-data-demo’. The list should (eventually) show you the following services:

    03.05-Resource-list-1

     The Analytics Engine will appear in the list as well as soon as it has been completely provisioned.

  4. Transformations using IBM SQL Query

    In the previous recipe we have used IBM SQL Query on a very simple example to query all the weather observations stored in a large number of JSON files. In this recipe we shall continue with a more refined and realistic example by aggregating the weather data based on observation location, date and day time (night/day) and then compute statistical features such as min, max and average temperatures.

    The SQL Query that we will use is defined as follows:

    WITH weatherdata as (
        SELECT class, day_ind, CAST(from_unixtime(valid_time_gmt) AS DATE) as obs_date, obs_name, pressure, temp, wdir, wspd
        FROM cos://eu-de/weather-data-demo/* STORED AS JSON o
        WHERE o.class = "observation"
       )
    SELECT
       weatherdata.obs_name,
       weatherdata.obs_date,
       weatherdata.day_ind,
       ROUND(MIN(weatherdata.temp),1) AS mintemp,
       ROUND(AVG(weatherdata.temp),1) AS avgtemp,
       ROUND(MAX(weatherdata.temp),1) AS maxtemp
    FROM weatherdata
    GROUP BY weatherdata.obs_name, weatherdata.obs_date, weatherdata.day_ind
    ORDER BY weatherdata.obs_name, weatherdata.obs_date, weatherdata.day_ind
    INTO cos://eu-geo/weatherdatademo-donotdelete-pr-2qk5tcfdz1qs9t/result/ STORED AS CSV

     The query will, when executed, return results as shown in the screenshot below:

    04.01-SQL-Query

    The SELECT statement in the WITH clause is similar to the query used in the previous recipe and will return a consolidated table of weather observations from the bucket named ‘weather-data-demo’. The next SELECT statement groups the information according to observation location, date and time of day and then computes the average, maximum and minimum temperatures for the group. The last part of the query outputs the results to the bucket that contains the IBM Watson Studio project artefacts. This way we can easily import the result into the project once that it has been generated. For more details regarding the supported SQL syntax please see the SQL Reference available on the online documentation for the IBM SQL Query service.

    There is a crucial point worthwhile mentioning which has to do with the URL for the IBM Cloud Object Storage endpoint. The source bucket that contains the weather data (‘weather-data-demo’) and the target bucket that holds the project artefacts (‘weatherdatademo-doniotdelete…’) have different endpoint locations although both buckets share the same IBM Cloud Object Storage instance. The FROM clause of the weather data demo bucket uses the endpoint location ‘eu-de’ whereas INTO clause for the bucket holding the project artefacts uses the location ‘eu-geo’:

    04.02-COS-Buckets

    The reason for this is simple. The weather data demo bucket was created as a regional bucket located in the IBM Cloud availability zones in Frankfurt whereas the bucket containing the project artefacts has been defined as a cross region bucket replicated across the IBM Cloud data centres in Europe. IBM Watson Studio uses cross region buckets to store project artefacts to achieve a higher level of availability. Usually this is transparent to the user, but the difference becomes important when defining queries or data source connections inside IBM Watson Studio that accesses the bucket: if you use the wrong location, e.g. ‘eu-geo’ for the weather-data-demo bucket rather than ‘eu-de’, the query or the connection will fail and you will get an error.

    To run the query, do the following:

    1. Open the dashboard for your IBM Cloud Object Storage service.
    2. Select Buckets in the toolbar to the left and note (copy) name and location of the two relevant buckets.
    3. Open the dashboard of your IBM SQL Query service.
    4. Copy the query from the recipe into the editor of the service.
    5. Modify the FROM clause and the INTO clause so that the names and locations match your buckets.
    6. For the first test run, prefix the INTO clause with ‘–‘ to comment it out. This way you avoid cluttering up your project workspace with erroneous results. They will be written to the default location instead (which is us-south).
    7. Select the Run command to run the query.
    8. Check that query has returned meaningful results.
    9. If this is the case, delete the comment marker in the INTO clause and rerun the query to send the results to your project bucket.

     

    Once that the query has run successfully time has come to import the results into the project. To achieve this, do the following:

    1. Open the dashboard for IBM Watson Studio.
    2. Open the weather data demo project that you created.
    3. Select Assets.
      04.03-Add-as-data-asset
    4. Select the Find and add data button in the top right corner (labelled with 0 and 1).
    5. Select the Files tab.
    6. You should now see a number of generated files that have been produced by the SQL Query jobs.
    7. Select the last file.
    8. Select the menu “…” as shown on the screenshot.
    9. Select the Add data asset menu item in the resulting pull-down menu.
    10. Select Apply in the¬† resulting dialog named ‘Apply changes’.
      04.04-Added-CSV-File

     

    This will add the file. In a last step you can then change the name of the file to something more meaningful than the cryptical generated file name:

    1. Select the name of the generated CSV file in the list of data assets. This will open up the file and render its content.
    2. In the upper left part of the screen click the pencil to the right of the generated file name.
      04.05-File-content
    3. Select the part of the name before ‘.csv’ and delete the text.
    4. Change the name of the data asset to ‘weather-observation-temperatures.csv’
    5. Select the Apply command.
    6. In the upper left part of the screen select the link named Weather Data Demo to take you back to your list of project assets.

     

    You have now transformed and aggregated the weather data to finally add the generated result file as a data asset to your project. This file can conveniently be used e.g. as input to a dashboard, which we will look into in the next recipe.

    04.06-Project-Assets

     

    As a last step of this recipe we will also need to generate a CSV file containing the detailed weather observations. Basically, this means re-running the query that was used in the previous recipe but with the addition that the data is this time to be stored in the bucket that contains the project artefacts. This file will then be used in the next recipe for generating statistical diagrams using matplotlib. This time you will have use the query from Part 2:

    The SQL Query that we will use is defined as follows:

    SELECT
       class,
       clds,
       day_ind,
       expire_time_gmt,
       valid_time_gmt,
       CAST(from_unixtime(valid_time_gmt) AS DATE) as date,
       obs_name,
       pressure,
       temp,
       uv_desc,
       wdir,
       wdir_cardinal,
       wspd,
       wx_phrase
    FROM cos://eu-de/weather-data-demo/* STORED AS JSON o WHERE o.class = "observation"
    ORDER BY valid_time_gmt DESC
    INTO cos://eu-geo/weatherdatademo-donotdelete-pr-2qk5tcfdz1qs9t/result/weather-observations.csv STORED AS CSV

     

    You will need to go through the following steps – very similar to what have already been doing so far to generate ‘weather-observation-temperatures-csv’:

    1. Copy, paste and run the above query using IBM SQL Query but for the URL in the FROM and INTO clauses use your URLs for the buckets containing the weather data and the bucket containing your project artefacts – respectively.
    2. Add the generated CSV file as a data asset to your project in a same way as you added ‘weather-observation-temperatures.csv’.
    3. Rename the added file to ‘weather-observations.csv’.
    4. Open the data asset and view the results.

     

    04.07-Weather-Obsaervation-Data

  5. Transformations using Jupyter Notebooks and IBM SQL Query API

    In this section of the recipe we shall demonstrate how to run the SQL Query using API calls. For this purpose we will reuse and configure an existing notebook named ‘IBM Cloud SQL Query‘ that is available on the IBM Watson Studio Community. IBM Watson Studio comes with a fairly comprehensive set of assets that are shared with the community of the Watson Studio users. These assets amount to articles, data sets, notebooks, papers and tutorials.

    To import the required notebook into your project, do the following:

    1. Select the Navigation Menu of IBM Watson Studio in the upper left corner of the dashboard.
    2. Select the menu item named Community from the pull-down menu. This will take you to the Community page:
    3. Enter the search string ‘SQL Query’. This will provide you with icons for a number of various artefacts.
      05.01-Community-Artefacts
    4. Select All Filters to filter the result set further.
    5. Select Notebook to just show the Jupyter notebooks.
    6. Select the notebook named ‘IBM Cloud SQL Query’ (sse screenshot).
    7. In the notebook, click the Copy command in the upper right part of the toolbar to copy the notebook into your project.
    8. In the ‘New Notebook’ dialog,
      1. Select your weather data demo project as target project.
      2. Select the Free runtime system for Python.
      3. Click Create Notebook.

     

    You have now imported the notebook into your project. However, the notebook will need to be modified as follows in order to get it to work with weather data:

    1. Provide the credentials for the COS bucket containing the weather data – including API Key and Cloud Resource Name.
    2. Insert project token for the COS project (target) bucket, which is needed to store the resulting artefact of the query in the project using API calls.
    3. Provide a definition for the SQL query to be submitted.
    4. Delete superfluous cells in the notebook.
       

    The notebook starts in section 1 by installing and importing libraries needed to configure and run the SQL query. Make sure that the libraries get installed by doing the following:

    1. Select the first code cell in section 1 that installs the ‘pyarrow’ and ‘sqlparse’ libraries.
    2. Click the Run Cell command in the toolbar and wait for the import to complete (it is complete when the cell marker changes from * to a number).
    3. Run the two remaining cells in the notebook.

     

    05.03-Notebook-Setup-Libraries

    Section 2 of the notebook titled ‘Configure SQL Query’ sets credentials and other access information that are needed for obtaining access to IBM Cloud Object Storage and IBM SQL Query. Follow the instructions in the notebook with a couple of modifications:

    1. For the API Key:
      1. Invoke the command Manage > Access (IAM).
      2. Select IBM Cloud API Keys in the toolbar to the left.
      3. Invoke the Create an IBM Cloud API Key command.
      4. In the ‘Create API Key’ dialog enter a name for the key, e.g. ‘Weather Data Demo Key’ and click Create.
      5. In the ‘API Key successfully created’ dialog, copy and download the key. Then close the dialog window.
      6. Paste the key into the notebook.
    2. For the IBM SQL Query Cloud Resource Name (CRN):
      1. Go to the IBM Cloud dashboard and select the resource lists.
      2. Filter the resources using the tag ‘weather-data-demo’.
      3. Select your instance of IBM SQL Query and copy the CRN from the dashboard of the service.
      4. Paste the CRN into the notebook so that it is assigned to the variable ‘instancecrn’.
    3. For the SQL COS Endpoint:
      1. Open the dashboard for your IBM Cloud Object Storage service and select the Buckets tab.
      2. For the bucket that contains your weather data, invoke the Bucket Actions > SQL URL command in the pulldown menu to the right of the bucket.
      3. Copy the value and assign it to a variable named ‘sql_cos_endpoint’ in the notebook as shown above in the screenshot.
    4. Run the cell.

     

    The ‘Configure SQL Query’ section continues with cells that sets the project id and project token and from this basis computes the name of the bucket holding project artefacts as well as the target URL for the target project. This information is needed in order to get access to the project bucket when the results of the query will be saved:

    05.04-Notebook-Configure-SQL

    To modify the notebook, you will need to obtain and insert the project token into the notebook:

    1. Open a new window for the project, e.g. by pressing CTRL-Left Click on the name (Weather Data Demo) of the project in the upper left part of the dashboard.
    2. Select the Settings tab in for the project in the project dashboard.
    3. Scroll down to the section named Access Tokens.
    4. Invoke the command New token.
      05.06-Create-Project-Token
    5. In the dialog named ‘New token’:
      1. Enter a name for the token, e.g. ‘WeatherDataDemoWriter’.
      2. Select Editor as role.
      3. Click Create.
    6. The project token will now be listed in the section titled Access Token.
    7. You can view the token by invoking the Action pull-down menu to the left of the access token and then select the menu item named View.
    8. Go back to your notebook.
    9. Select the cell that contains the heading ‘inserting the project token
    10. Invoke More > Insert project token which will created a hidden cell akin to the selected cell in the screenshot above.
      05.07-Insert-Project-Token
    11. Check that the inserted project access token matches the one in step (7). If not, copy the one from step (7) and replace it with the access token in the notebook.
    12. Run the 3 cells in the screenshot above in the order they are listed. When asked about input (e.g. for the ‘apikey’) just leave the text field blank and press Return.

     

    You have now created a project token and the necessary statements in the notebook that will provide the SQL Query API with sufficient credentials to create assets in the project for storing the results of the query.

    The notebook continues in section 3 with Python statements that defines and formats the SQL query to be submitted:

    05.05-Notebook-Run-SQL

    Section 3 builds the query string whereas section 4 contains the final statements for executing the SQL query and returning the results in form of a pandas data frame. In this recipe we shall use the following simple query definition first just to get a confirmation that everything works:

     sql='SELECT class, day_ind, valid_time_gmt, obs_name, pressure, temp, uv_desc, wdir, wdir_cardinal, wspd, wx_phrase ' + \
    ' FROM ' + sql_cos_endpoint + \
    '/* STORED AS JSON o WHERE o.class = "observation"'

    Do the following to change the query and then in it run the query:

    1. Replace the SQL query in section 3 of the notebook with the one above.
    2. Run the cell.
    3. Press Return if asked to enter the query.
    4. Select the first cell in section 4.
    5. Invoke Run and wait for the execution to complete (this may take some time)
    6. Invoke Run in the next cell to see the first 10 records.

     

    The results returned should look something akin to:

    05.08-Results-of-Query

    We will only use the first part of the notebook so you can delete all cells following the first two cells in section 4 and downward until section 8 of the notebook. The easiest way to do this is top use keyboard shortcuts. You can get information on the available shortcuts from the Help menu by invoking Help > Keyboard Shortcuts. To delete the superfluous content:

    1. Set the editor from insert mode (where cells are marked green in the left margin) into command mode (where cells are marked blue in the left margin) by pressing Esc.
    2. Scroll down and select cell 12 (this is the cell after the cell with the orange text ”Orders by Employee’).
    3. Then cut it by clicking X.
    4. Continue until all superfluous cells are deleted and you have reached the section titled ‘Next Steps’.
    5. Invoke File > Save AS to save the file.
    6. Provide a new name for the notebook, e.g. ‘Weather Data ETL – IBM Cloud SQL Query’.

     

    You can now change the query and try the one that was defined in section 4 that aggregates weather observation temperatures by location, day and time of day.  Replace the code cell in section 3 with the following to change the definition of the query:

    sql = "WITH wd AS (SELECT day_ind, CAST(from_unixtime(valid_time_gmt) AS DATE) as obs_date, obs_name, temp \n" + \
    "FROM " + sql_cos_endpoint + "/* STORED AS JSON o WHERE o.class = 'observation') \n " + \
    "SELECT wd.obs_name, wd.obs_date, wd.day_ind, ROUND(MIN(wd.temp),1) AS mintemp, ROUND(AVG(wd.temp),1) AS avgtemp, ROUND(MAX(wd.temp),1) AS maxtemp \n " + \
    "FROM wd \n " + \
    "GROUP BY wd.obs_name, wd.obs_date, wd.day_ind \n " + \
    "ORDER BY wd.obs_name, wd.obs_date, wd.day_ind \n " + \
    "INTO {}result/ STORED AS CSV".format(targeturl)
    print('\nYour SQL statement is:\n')
    print(sql)

     

    Then run all cells in section 3 and 4 to get the aggregated figures:

    05.09-Aggregation-Query

    Rename the notebook to ‘Weather Data ETL – IBM Cloud SQL Query V2’ and save it.

    A minimal version of the notebook titled ‘ETL Weather Data Using IBM Cloud SQL Query 2‘ that does not format the query or ask the user for inputs has been uploaded to GitHub. To use this notebook, you will need to provide your project credentials and bucket name in cell 3 and the project token in cell 4.

  6. Transformations using Jupyter Notebooks and Pandas

    In this section you will work with a Jupyter notebook that extracts lightning data from the Weather Company, transforms it from a JSON multi-line representation to a tabular representation as a pandas Data Frame, to finally load it into a CSV file in a format that can be used for analytics. Rather than fetching the data from the Weather Company API directly in real-time (which would require a special API Key), you will work the file that was uploaded to IBM Watson Studio in section 2.

    The data in the input file (named ‘Lightning_global_1min_201812041030.json’) comes in a multi-line JSON representation. If you select the file in IBM Watson Studio it will be opened, and you can investigate the structure in more detail:

    06.01-Input-File-Structure

    The measures such as latitude, longitude and intensity are represented as single properties with an associated list that contains the values for all observations. Each single observation is delimited by a comma and a newline. In principle it would be possible to turn such a structure into a collection of observations in table format with one row for each observation and one column for each property using an SQL query on the form:

    SELECT explode(lightning.intensity) intensity
    FROM cos://eu-geo/weatherdatademo-donotdelete-pr-2qk5tcfdz1qs9t/Lightning_global_1min_201812041030.json STORED AS JSON

    A more comprehensive example of a query using the explode function, which has been used a basis for define the query above, can be found in the sample queries of IBM SQL Query (see ‘Samples > CSV and JSON Parsing Queries > Advanced JSON flattening‘).

    However, there is one caveat here: IBM SQL Query does not accept multi-line JSON structures of the form that the file is in for the time being. There are several solutions to the problem. On could for example try to use the command line utility jq with the -c option, e.g. “$ jq -c “.[]” or a conversion using functions of the JSON package as described in ‘Converting JSON into newline delimited JSON in Python‘. A third option would be to use pandas which in general comes in very handy for transforming data sets that are not very large, which is the actually the case for the 1.3 MB file that we are dealing with just now.

    We shall therefore convert the file using pandas and a Jupyter notebook for Python that you can download from GitHub. You will need to go through the following steps:

    1. Add the notebook on GitHub to the project.
    2. Insert pandas commands for reading the data set into your notebook.
    3. Run the cells in the notebook in the given order and observe the effects.

     

    To create and download the notebook from GitHub do the following in IBM Watson Studio:

    1. Select the command Add to project from the toolbar.
    2. Select Notebook in the ‘Choose asset type’ dialog.
    3. Select From URL in the ‘New Notebook’ dialog.
    4. Copy the URL ‘https://github.com/EinarKarlsen/weather-data-demo/blob/master/Weather%20Lightning%20Data%20ETL%20-%20Pandas.ipynb‘ to the URL Field. This should set the name of the notebook as well (‘Weather Lightning Data ETL – Pandas’).
    5. In the Select runtime field, select the runtime ‘Default Python 3.5 Free’ which provides a runtime free of cost.
    6. Click Create Notebook.

     

    Once the notebook has been imported you will initially need to insert commands for reading the lightning data set in your project into the first cell of the notebook which is empty.

    06.03-Insert-Pandas-Data-Frame

    Fortunately, this is supported by commands of IBM Watson Studio:

    1. Select the first cell in the notebook.
    2. Select the Find and add data command in the toolbar
    3. Select Insert to code > Insert pandas DataFrame for the dataset containing the lightning data. This wil insert all the code needed to read the data.
    4. Uncomment the last two lines in the first cell that reads the data frame and assigns it to a variable.
    5. In the next cell, assign the values for the variable ‘cos’, ‘wsbucket’ and ‘df1’ to the relevant values used in cell 1.
    6. Invoke File > Save to save the changes.
    7. Run the first 4 cells to extract the information and print out some information about the structure of the data set.

     

    Having imported the data, the notebook outlines the structure of the dataset:

    06.04-Extracted-File-Properties

    This data will need to be transformed into a format where the attributes of interest are represented at columns and where each observation is represented as a row. This is simply done in pandas by constructing a new data-frame:

    06.05-Transform-Data-SetRun the cells 5 and 6 to transform the data and print out the first 5 observations in the new data frame.

    Having transformed the data, we can save it as a CSV file in IBM Cloud Object Storage for later use. The data frame will be written to CSV first using the ‘to_csv’ command. In a next step, this file is then uploaded to the project:

    06.06-Load-to-CSV

    To generate the CSV file and import it into the project do the following:

    1. Run the cells 7-10 to create and upload the CSV file.
    2. Select the Assets tab for the project in IBM Watson Studio.
    3. Select the Find and add data button in the top right corner (labelled with 0 and 1).
    4. Select the Files tab.
    5. Enter ‘lightning’ in the Find in storage text field.
    6. Select the file named ‘lightning-data.csv’.
    7. Select the menu “…” as shown on the screenshot.
    8. Select the Add data asset menu item in the resulting pull-down menu.
    9. Select Apply in the resulting dialog named ‘Apply changes’.
      06.08-Add-to-Project

    For additional information on how to use notebooks to work with COS see for example the notebook ‘Working With IBM Cloud Object Storage In Python‘ and the online documentation ‘Using Python functions to work with Cloud Object Storage‘. The uploading and downloading of files is touched upon in the blog ‘How to Upload/Download Files to/from Notebook in my Local machine‘ by Charles Gomes.

  7. Transformation using Jupyter Notebooks and Apache Spark

    In this section of the recipe we shall again deal with weather observation data rather than lightning data. Suppose that we would like to aggregate the data just like in section 4 of the recipe, but rather than storing the data in a CSV file (or Parquet file) we would like to store it to a to a PostgreSQL database (or some other kind of database not currently supported as target database for IBM SQL Query). A simple solution would be to do it in two steps: let IBM SQL Query aggregate and store the data in an intermediate CSV file and then, in a second step, let another program in the ETL pipeline read the CSV file and store the information to PostgreSQL. The second step can be achieved with a few (around 10) lines of code using pandas.

    However, in situations where the transformations to be performed are very complex calling e.g. for a solution expressed in a procedural or functional language rather than SQL, other technologies like Apache Spark come in very handy – either using Apache Spark RDDs or Data Frames. We shall therefore briefly show how to use Apache Spark to implement the ETL process but for simplicity reasons we will in this example keep the transformation so simple that we can reuse the SQL query of section 4 but change the target from a CSV file to a PostgreSQL database.

    In this section of the recipe you will need to create a new Jupyter notebook using a Spark for Python runtime system. The code for extracting the data using Apache Spark can be generated by IBM Watson Studio, provided that there is a data connection and data asset available for the data in the project. The connection to IBM Cloud Object Storage and a data asset in the weather data bucket must therefore be established first. Once this is done it is quite easy to create a Spark SQL data frame, submit the query and get hold on the resulting aggregated data.  The loading of the data into the PostgreSQL database will finally be done using the SQLAlchemy toolkit.

    You will consequently need to go through the following steps:

    1. Create a connection to the COS weather data bucket.
    2. Add a (random) data set in that bucket to your project as an data asset.
    3. Download the notebook for aggregating weather observation data from GitHub.
    4. Add Apache Spark python code to the notebook to read your weather observations in your weather data bucket.
    5. Run the commands for loading and transforming the weather data.
    6. Add URL endpoint (including credentials) for your instance of the PostgreSQL database.
    7. Run the commands that saves the aggregated data frame to a database table in PostgreSQL.

     

    To create the data connection and connected data set do the following:

    1. Select the command Add to project from the toolbar.
    2. Select Connection in the ‘Choose asset type’ dialog.
    3. Select Cloud Object Storage in the New Connection dialog.
    4. In the New Connection Dialog for IBM Cloud Object Storage.
      07.01-Create-Connection
      1. Enter a name for the connection.
      2. For the Login URL paste in the endpoint for your weather data bucket (you can get this information in the endpoint section of the dashboard for your IBM Cloud Object Storage service but be aware that you select the correct resiliency and location settings).
      3. For the API Key paste in the API key that you created for the bucket in the previous recipe (you can find this information in the credentials section of your IBM Cloud Object Storage service but be aware that you use the credentials that you created in this recipe. In our case we named the credentials ‘WeatherDataWriter’).
      4. Click Create to create the new connection.
    5. In IBM Watson Studio you should now be able to see the new connection in the list of assets.

     

    Continue by adding a connected data set to the project:

    1. Select the command Add to project from the toolbar.
    2. Select Connected data in the ‘Choose asset type’ dialog.
    3. Click Select Source to select the source.
    4. Expand the selection until you have selected one of the weather observation files.
      07.02-Add-connected-data
    5. Click Select to select the object file.
    6. In the Add Asset from Connection dialog, enter a name for the asset (e.g. ‘COSWeatherDataObservationFile’).
    7. Invoke the Select command.
    8. In IBM Watson Studio you should now be able to see the new connected data in the list of assets.
      07.03-Assets

     

    To create and download the notebook from GitHub do the following:

    1. Select the command Add to project from the toolbar.
    2. Select Notebook in the ‘Choose asset type’ dialog.
    3. Select From URL in the New Notebook dialog.
    4. Copy the URL ‘https://github.com/EinarKarlsen/weather-data-demo/blob/master/Weather%20Temperature%20ETL%20-%20Spark%20SQL.ipynb‘ to the URL Field.
    5. This should set the name of the notebook as well (‘Weather Temperature ETL – Spark SQL’). If not, enter a name for the notebook.
    6. In the Select runtime field, select the runtime ‘Python 3.5 with Spark’.
    7. Click Create Notebook.

     

    Once the notebook has been imported you will initially need to insert commands for reading the lightning data set into the first cell of the notebook which is empty. 07.04-Read-Weather-Data

     Fortunately, this is supported by commands of IBM Watson Studio:

    1. Select the first cell in the notebook.
    2. Select the Find and add data command in the toolbar
    3. Select Insert to code > Insert SparkDSession Setup for the dataset containing the weather observation. This will insert all the code needed to read the single file.
    4. Change the COS URL in the last line of cell 1 so that the URL designates all files in the weather data bucket. This basically means that you will need to replace the designator for a single object with ‘/*’ (see screenshot above).
    5. In the next cell, assign the values for the variable ‘cos’, ‘wsbucket’ and ‘df1’ to the relevant values used in cell 1.
    6. Run the first 3 cells to import the data and print out the schema.

     

    The schema shows the various properties found in the read files. Notice that there may be corrupted records as well, e.g. files that can’t be interpreted as JSON files.

    07.05-Data-Schema

    Next the data will need to be aggregated and grouped by location, date and time of day to compute minimum, average and maximum temperatures:

    07.06-Transforming-Weather-Data

    Notice that the query needs to compute the date from a timestamp (as in section 4) but that the expression using Spark SQL is slightly different from the expression used by IBM SQL Query. 

    Having transformed the data, we can store the data in a database rather than on the level of an object storage. In this recipe we shall use PostgreSQL as the data warehouse. For more information on how to manage PostgreSQL on IBM Cloud we refer to the online documentation titled ‘About IBM Cloud Databases for PostgreSQL‘.

    There are several libraries that can provide access to a PostgreSQL instance in Python. What is needed in this context is a library that can conveniently save a pandas data frame to a database table. How to do this is for example outlined in the YouTube video titled ‘Pandas Tutorial 14: Read Write Data From Database (read_sql, to_sql)‘ and on the online documentation for Pandas SQL. Both use the SQLAlchemy library which will also be used in this recipe. There is also an existing notebook on the IBM Watson community titled ‘Access PostgreSQL with Python‘ that shows how to interact with PostgreSQL but this notebook is using psycop.

    07.07-Loading-Weather-Data

    The notebook contains cells for setting the PostgreSQL credentials and endpoint followed by a cell that creates the engine. Saving to a database is done with the data frame operation named ‘to_sql’ which takes the database table name and engine as parameters. Additional parameters then specify that the existing table content (if any) shall be replaced, that the data frame index shall be ignored when saving and that the data shall be sent to PostgreSQL with a given chunk size. For testing purposes, we then follow up reading the content of the database table and printing the first 5 records.

    To get the notebook to run on your data do the following:

    1. In IBM Cloud open up the dashboard for your PostgreSQL service.
    2. Scroll down and copy the ‘Public postgres endpoint’ to your notebook and assign it to the variable ‘postgreSQLEndPoint’. This URL contains placeholders for the user name and the password for the service.
    3. Comment out the ‘?sslmode=verify-full’ parameter of the URL at the end.
    4. Select the Service credentials tab.
    5. Open the credentials that you created in section 3 (or alternatively create new ones).
    6. Copy the variables ‘postgres.authentication.username’ and ‘postgres.authentication.password’ and insert them in the endpoint URL in place of the placeholders.
    7. Run the cells 7-13 that loads trhe data into PostgreSQL.

     

  8. Conclusion

    In this recipe we have shown various ways of transforming weather data using IBM SQL Query, pandas and Apache Spark SQL (running under the hood of IBM Analytics Engine).

    IBM SQL Query has the advantage that it is a light weight, easy to use and cost-effective solution. It does not require a runtime cluster to be configured which may consume resources even if being idle, but adopt a paradigm that is more akin to that of Functions as a Service. Queries that are conveniently expressed in SQL are good use cases for IBM SQL Query provided that the input and output source formats are supported. Moreover, it is old news that some computational problems are very hard to express using SQL, whereas they can be very easily formulated using either procedural or functional languages. Pandas can in such cases conveniently be used as a light weight and cost-effective alternative to IBM SQL Query, since it does only require a runtime system for Python. The combination of Python and pandas is a proved solution for data wrangling purposes but with the limitation that it does not address big data use cases. So the fallback position, either in case of complex transformation, special input and output sources or big data would be to use Apache Spark Рbe Apache Spark RDDs,  Data Frames and/or Spark SQL. The disadvantage here is that it requires a cluster of compute nodes to be available, i.e. unlike IBM SQL Query, you will need to provision a cluster first and then run the Spark jobs once this has been done. There is a free plan available for IBM Analytics Engine that can be used for a limited time for test purposes but in general one must expect Рin context of Cloud offerings Рthat a cost will occur for the compute and storage resources set aside for the cluster Рwhether it is being used or not.

    The lessons learned from this recipe also has impact on the architecture of the solution. Actually it would be relevant to go back to IBM Functions and then change the definition of sequences dealing with multi-line JSON data structures from The Weather Company by including actions to transform the ingested data before it is send to IBM Event Streams for storage in IBM Cloud Object Storage. If this is carefully done, the data would be stored in IBM Cloud Object Storage in the right format, and IBM SQL Query could directly be used to work on the data and there would be no need to transform it using e.g. notebooks and pandas.  That of course would mean that the approach would be changed from an ELT approach to an approach looking more like an ETL.

    It is also worth mentioning that it is possible to get IBM SQL Query to work with IBM Functions so that queries are executed in context of actions. For more information on that please see ‘SQL Query Package‘ project on GitHub. This implies another relevant change of the architecture: we could trigger aggregation jobs currently implemented with IBM SQL Query using IBM Functions.

  9. Acknowledgement

    The Jupyter notebook named ‘IBM Cloud SQL Query’ represents work by Torsten Steinbach from the IBM Cloud Data Platform team. I would also like to thank IBM Cloud support (Witchaya Keosung and Daniel Pittner) for pointing me to the workarounds for transforming lightning data from a multi line to a single line JSON file using e.g. jq.

Join The Discussion