Build machine learning models with or without code in a collaborative data science environment

Build models and visualize the results using IBM Cloud Pak for Data Jupyter Notebooks, AutoAI, and Embedded Dashboard on Amazon Web Services (AWS) Cloud

By

Manoj Jahgirdar,

Sharath Kumar RK

In this tutorial, you will build time-series machine learning models and visualize the results using IBM Cloud Pak for Data Jupyter Notebooks, AutoAI, and Embedded Dashboard on Amazon Web Services (AWS) Cloud. You will learn both code and no-code approaches to building models and visualizing the results.

When you have completed this tutorial, you will understand how to:

  • Build a state-of-the-art Long Short Term Memory (LSTM) prediction model using IBM Cloud Pak for Data Jupyter Notebook
  • Visualize the actual vs. predicted values in IBM Cloud Pak for Data Cognos Dashboard Embedded
  • Build and compare different predictive models with the no-code approach in IBM Cloud Pak for Data using Watson AutoAI experiments

Flow

Flow diagram

  1. Pre-processed datasets are loaded into an Amazon S3 bucket.
  2. The datasets from the S3 bucket are read in Jupyter Notebooks.
  3. Different models are built and evaluated in Jupyter Notebooks and the final prediction data is returned to the S3 bucket.
  4. The datasets from the S3 bucket are copied into Watson Studio Project and loaded into AutoAI. Different models are built and compared in AutoAI with no code.
  5. The prediction data that's produced by the Jupyter Notebook models and stored in the S3 bucket is read by Cognos Dashboard Embedded to visualize the data in the form of an interactive dashboard.

Prerequisites

  1. Sign up for an AWS account.
  2. Deploy IBM Cloud Pak for Data 4.x on AWS.

Estimated time

Completing this tutorial should take about 1 hour.

Video

Watch this video for a brief overview of the concepts presented in this tutorial.

Steps

1. Set up an S3 bucket

1.1. Create an S3 bucket in AWS

1.2. Upload data to the S3 bucket

Sign in to the AWS Management Console and open the Amazon S3 console.

2. Set up a project in IBM Cloud Pak for Data

2.1. Create a project

  • To create a project in IBM Cloud Pak for Data. Click on the hamburger menu () and select All Projects.

    Select All projects

  • Click on New Project.

    • Select project type as Analytics project.
    • Click on Create a project from file.
    • Upload the cpd-project.zip file.
    • Enter a project name and click on Create.
  • After the project has been created, click on View project. You should see the overview of the project as shown below:

    Project overview

  • Click on the Assets tab and you should see Data and Notebooks.

2.2. Create a connection to S3

  • Click on Add to project and select Connection.

  • Select Amazon S3 for the connection type.

    • Enter the credentials to connect to your S3 bucket.
    • Click on Test connection and you should see a connection successful message if you have entered the correct credentials.
    • Click on Create.

      Create a connection to S3

  • When the connection has been created, you should see the connection in the Assets tab under Data assets. With this connection, you can access all the datasets present in your S3 bucket from your Cloud Pak for Data project.

3. Code approach: Build prediction models with IBM Watson Studio

In the code approach, you will learn how to build two types of prediction models in Watson Studio Jupyter Notebooks. As a developer, you have full control over the model's hyperparameters and the training data in this section.

The section is divided into following sub-sections:

3.1. About the notebooks

  • Click on the Assets tab and you should see the following notebooks:

    • Region-Brussels-LSTM.ipynb
    • Region-Wallonia-LSTM.ipynb
    • Region-All-Decision-Trees.ipynb
  • LSTM notebooks are used to build prediction models for predicting future COVID-19 cases for the Brussels and Wallonia regions, respectively. LSTM models are built using the data from the datasets in the S3 bucket. Both models are built with different hyperparameters.

  • The LSTM model for the Brussels region is built with the following hyperparameters:

    • train_test_split: 0.70
    • lookback: 30
    • hidden_layers: 2
    • units: 55, 100
    • dropouts: 0.15, 0.15
    • optimizer: adam
    • learning_rate: 0.001 (default)
    • epochs: 25
    • batch_size: 32
  • The LSTM model for the Wallonia region is built with the following hyperparameters:

    • train_test_split: 0.70
    • lookback: 30
    • hidden_layers: 2
    • units: 60, 100
    • dropouts: 0.15, 0.15
    • optimizer: adam
    • learning_rate: 0.001 (default)
    • epochs: 25
    • batch_size: 32
  • In addition, a decision tree notebook is used to build a model to predict the risk index for the Brussels, Flanders, and Wallonia regions.

  • Decision tree models are built with the following hyperparameters:

    • train_test_split: 0.70
    • max_depth: 4
    • min_samples_split: 2
    • min_samples_leaf: 1
    • criterion: entropy

3.2. Notebook 1: Predict future COVID-19 cases for the Brussels region with the LSTM model

In this section, you work with a popular open source machine learning algorithm, Long Short-Term Memory (LSTM). You use this time-series algorithm to build a model from historical data of total COVID-19 cases, and then use the trained model to predict future COVID-19 cases.

  • Find the Region-Brussels-LSTM.ipynb notebook and click on the edit button to open the notebook in edit mode: Notebook 1 - edit

  • The notebook should look something like this: Notebook 1 - preview

  • Before running the notebook, you need to add the S3 connection to the notebook.

    • Click on the third code cell in the notebook.
    • Click on the find and add data button in the upper right corner of the page.
    • Click on the Connections tab.
    • You should then see your connection variable. Click on Insert to code and select pandas DataFrame.
    • Select the ts-brussels-grouped.csv dataset from the connection variable. Notebook 1: Add the data connection
  • In the generated code snippet, verify that the name of the dataframe is data_df_1.

  • Click on Cell and select Run All to run the notebook. This may take some time. Run the notebook

  • When the notebook is completed, you should see the following in the notebook:

    • Current Trend of COVID-19 cases in Brussels
    • LSTM Model Accuracy
    • LSTM Model Loss
    • LSTM Model Prediction
  • Current Trend of COVID-19 cases in Brussels: This trend is shown in the following graph. Notebook 1 graph: COVID-19 cases in Brussels

  • LSTM Model Accuracy: You'll see that the Root Mean Squared Error (RMSE) values are almost the same for the training and test data, which confirms the accuracy of the model without overfitting or underfitting. Notebook 1: Model accuracy

  • LSTM Model Loss: You should see that there is no vanishing gradient descent as the LSTM model with optimal configuration has taken care of the gradient descent problem. Notebook 1: Model loss

  • LSTM Model Prediction: As you can see, the model is able to catch the pattern in the data. Notebook 1: Model prediction

  • The following CSV files are generated from the notebook:

    • Brussels.csv: This dataframe contains the historical COVID-19 cases in Brussels.
    • brussels-actualVsPredicted.csv: This dataframe contains the actual and predicted COVID-19 cases in Brussels.
    • brussels-errorEvaluation.csv: This dataframe contains the error evaluation of the model.
    • brussels-next7Prediction.csv: This dataframe contains the predictions for the next 7 days of COVID-19 cases in Brussels.
  • These CSV files are stored in your S3 bucket as well as in the Data Assets in your Cloud Pak for Data project.

Note: These CSV files will be used to visualize the data in Watson Cognos Dashboard Embedded.

When you have successfully completed this section, you can move onto the next one.

3.3. Notebook 2: Predict future COVID-19 cases for the Wallonia region with the LSTM model

In this section, you work with a popular open source machine learning algorithm, Long Short-Term Memory (LSTM). You use this time-series algorithm to build a model from historical data of total COVID-19 cases, and then use the trained model to predict future COVID-19 cases.

  • Find the Region-Wallonia-LSTM.ipynb notebook and click on the edit button to open the notebook in edit mode: Notebook 2 - edit

  • The notebook should look something like this: Notebook 2 - preview

  • Before running the notebook, you need to add the S3 connection to the notebook.

    • Click on the third code cell in the notebook.
    • Click on the find and add data button in the upper right corner of the page.
    • Click on the Connections tab.
    • You should then see your connection variable. Click on Insert to code and select pandas DataFrame.
    • Select the ts-wallonia-grouped.csv dataset from the connection variable. Notebook 2: Add the data connection
  • In the generated code snippet, verify that the name of the dataframe is data_df_1.

  • Click on Cell and select Run All to run the notebook. This may take some time. Notebook 2: Run the notebook

  • When the notebook is completed, you should see the following in the notebook:

    • Current Trend of COVID-19 cases in Wallonia
    • LSTM Model Accuracy
    • LSTM Model Loss
    • LSTM Model Prediction
  • Current Trend of COVID-19 cases in Wallonia: This trend is shown in the graph: Notebook 2 graph: COVID-19 cases in Brussels

  • LSTM Model Accuracy: You'll see that the Root Mean Squared Error (RMSE) values are almost the same for the training and test data, which confirms the accuracy of the model without overfitting or underfitting. Notebook 2: Model accuracy

  • LSTM Model Loss: You should see that there is no vanishing gradient descent as the LSTM model with optimal configuration has taken care of the gradient descent problem. Notebook 2: Model loss

  • LSTM Model Prediction: As you can see, the model is able to catch the pattern in the data. Notebook 2: Model prediction

  • The following CSV files are generated from the notebook:

    • Wallonia.csv: This dataframe contains the historical COVID-19 cases in Wallonia.
    • wallonia-actualVsPredicted.csv: This dataframe contains the actual and predicted COVID-19 cases in Wallonia.
    • wallonia-errorEvaluation.csv: This dataframe contains the error evaluation of the model.
    • wallonia-next7Prediction.csv: This dataframe contains the predictions for the next 7 days of COVID-19 cases in Wallonia.
  • These CSV files are stored in your S3 bucket as well as in the Data Assets in your Cloud Pak for Data project.

Note: These CSV files will be used to visualize the data in Watson Cognos Dashboard Embedded.

When you have successfully completed this section, you can move onto the next one.

3.4. Notebook 3: Risk Index Prediction with Decision Tree

In this lab exercise, you work with a popular machine learning algorithm, Decision Tree. You use this classification algorithm to build a model from historical data on a region and its total cases, and then use the trained Decision Tree to predict the risk index for that region.

  • Find the Region-All-Decision-Tree.ipynb notebook and click on the edit button to open the notebook in edit mode: Notebook 3 - edit

  • The notebook should look something like this: Notebook 3 - preview

  • Before running the notebook, you need to add the S3 connection to the notebook.

    • Click on the third code cell in the notebook.
    • Click on the find and add data button in the upper right corner of the page.
    • Click on the Connections tab.
    • You should then see your connection variable. Click on Insert to code and select pandas DataFrame.
    • Select the RI-data-ML.csv dataset from the connection variable. Notebook 3: Add the data connection
  • In the generated code snippet, verify that the name of the dataframe is data_df_1.

  • Click on Cell and select Run All to run the notebook. This may take some time. Notebook 3: Run the notebook

  • When the notebook is completed, you should see the following in the notebook:

    • Decision Tree Model Accuracy
    • Decision Tree Visualization
  • Decision Tree Model Accuracy: You can see that the accuracy of the model is 86.63%: Notebook 3: Decision Tree Model Accuracy

  • Decision Tree Visualization: You can see the decision tree is in the notebook. Notebook 3: Decision Tree Visualization

You have now successfully completed this section.

4. No-code approach: Build prediction models with IBM Cloud Pak for Data using AutoAI

This section shows you how to build AI models without any code. After you've created the project in step 2.1, click Add to project in the upper right and select AutoAI experiment as the asset type:

Add AutoAI experiment to project

Create an AutoAI experiment by giving it a name and selecting the environment definition as 8vCPU and 32GB RAM:

Create AutoAI experiment

Add the data file and click on Select from project:

Add the data file

Then select the RI-data-ML.csv file from the project’s data assets:

Select the asset

Select No for Create a time series forecast? You are building a multi-class classifier. Select Risk_Index as the option for What do you want to predict? and then click Run experiment.

Run experiment

It may take a couple of minutes to complete the experiment. You should then see Experiment completed on the right side of the screen:

Experiment completed

Review the 8 pipelines generated in the pipeline leaderboard:

Pipeline leaderboard

Select the first pipeline (Rank 1) and then choose the Save as option in the upper right:

Select pipeline

Under Save as, select Model and then click Create:

Save as model

You should then see the Saved model successfully message as below. Click on View in project:

Model saved message

Click on Promote to deployment space:

Promote to deployment space

Under Target space, select Create a new deployment space:

Create a new deployment space

Give a name to the deployment and click Create:

Name the deployment

After a minute or so, the deployment space should be created:

Deployment space created

Next, promote the deployment by clicking on Promote in the bottom right corner:

Promote deployment

You should then see a message stating that the model has been successfully promoted to the deployment space:

Successfully promoted

Click on the hamburger menu () in the upper left and select Deployments:

Hamburger menu - Deployments

Click on the Predict-RI deployment, which you created in a previous step:

Predict-RI deployment

Click on Assets and select the Random Forest Classifier model:

Select Random Forest Classifier model

Click on New deployment space:

New deployment space

Select Online for Deployment type, give the deployment a name, and click Create:

Name and create deployment

After a couple of minutes, you should see the status as Deployed:

Status - deployed

If you click model-deploy, you should see the endpoint and code snippets:

Model endpoint

Now let's do some predictions. Click on the Test option and input the data using the form or JSON format:

Input data predictions

Enter the input data using one or more samples (JSON). To do a single sample, specify Brussels for REGION AND 100 for Total_cases, and then click Add to list:

Single sample input data

You should see the Input list updated with the sample values. Click Predict to generate predictions:

Input list with sample values

As you can see, the predicted value in the Result section is 0. This means that the risk index is predicted as low for the input data for the Brussels region, with about 100 cases on a given day.

You have now learned how to build AI predictive models without using any code, deploy the model, and generate predictions. Feel free to play around and get comfortable using AutoAI to generate accurate predictions.

5. Visualize the predictions in IBM Cloud Pak for Data Cognos Embedded Dashboard

In this section, you learn how to build responsive data visualization in Watson Studio's Cognos Embedded Dashboard. You can build interactive charts, tables, graphs, and more in the Cognos Embedded Dashboard. You will use the data generated in the previous section to build the following visualizations:

  • Current trends and future predictions of COVID-19 cases by region
  • Model evaluation metrics, such as actual vs. predicted cases and model loss by region

The section is divided into two sub-sections:

5.1. Set up Cognos Embedded Dashboard

Create a new Cognos Embedded Dashboard

  • Before you get started, download the Covid-19-predictions-dashboard.zip dashboard file and extract the zip file.

  • In the Cloud Pak for Data project, click on Add to Project and select the Dashboard asset type. Dashboard asset type

  • Select create a new dashboard from the local file:

    • Upload the Covid-19-Predictions-Dashboard.json extracted file.
    • Enter a name for the dashboard.
    • Click on Create.

Relink data assets to the dashboard

  • Once the dashboard is created, you should see a message saying Missing data asset (1/9). Missing data asset

  • To relink the missing data assets, click on Relink, select Data Assets and select the dataset, and then link the following data assets:

    • Brussels.csv
    • brussels-next7Prediction.csv
    • wallonia-next7Prediction.csv
    • brussels-actualVsPredicted.csv
    • brussels-errorEvaluation.csv
    • wallonia-actualVsPredicted.csv
    • wallonia-errorEvaluation.csv
    • Wallonia.csv

Relink missing data assets

Once all the assets are relinked, you will see the dashboard view:

Dashboard view of relinked assets

The next section provides more details about the dashboard.

5.2. Analyze Cognos Embedded Dashboard

There are 2 tabs in the dashboard, Trends and Model Evaluation.

  • The Trends tab has the following widgets for the Brussels and Wallonia regions:

    • Total Cases: Shows the total number of cases for the region.
    • Region Map: Shows the map of the region.
    • Current Trends: Shows the current trends for the region.
    • 7 Days Prediction: Shows the 7-day prediction for the region.

      Cognos dashboard - Trends tab

  • The Model Evaluation tab has the following widgets for the Brussels and Wallonia regions:

    • Actual vs. Prediction: Shows the actual vs. predicted values for the model of a particular region.
    • Model Loss: Shows the model loss for the model of the particular region.

      Cognos dashboard - Model Evaluation tab

The dashboard is interactive, so you can click on any data point from the dashboard to see the details change in real time.

Summary

In this tutorial, you learned how to build time-series and decision tree machine learning models on IBM Cloud Pak for Data Jupyter Notebooks, and visualize the results on IBM Cloud Pak for Data Embedded Dashboard on Amazon Web Services (AWS) Cloud using the code approach. You also learned how to build models and deploy them with AutoAI using the no-code approach.

License

This tutorial is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this tutorial are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2. For more information, read the Apache License FAQ