IBM Cloud Pak for Data is a cloud-native and multiuser data science platform that assists the development of a complete machine learning workflow.
This tutorial describes a flight delay analysis use case (predicting which flights are likely to be cancelled or diverted) to demonstrate a range of features from Cloud Pak for Data. This example has been developed on an IBM POWER9 processor-based server, which is an enterprise system built for data and AI workloads, and therefore, a very reliable on-premises infrastructure to deliver performance on Cloud Pak for Data workloads.
The use case demonstrated in this tutorial involves the following tasks:
- Import the required data set and preprocess it using the Data Refinery tool
- Visualize and analyse the data set in Jupyter Notebook
- Train and compare machine learning models using AutoAI features
- Deploy the model in IBM Watson machine learning and run inferences
Intended audience for this tutorial include data scientists and technical specialists who want to get practical experience on Cloud Pak for Data and discover its features.
All additional files mentioned in this tutorial are publicly available at the following link: https://github.com/IBM/Flight-delays-tutorial-Cloud-Pak-for-Data.
For an airline carrier, it is useful to analyze flight data to understand what variables can cause a flight delay or lead to cancellation. This analysis is essential to minimize such incidents, provide a better service, and optimize related costs (customer refunds).
We used this scenario in our lab, based on a public Kaggle dataset containing a record of all US flights for the month of January 2019. It contains 21 columns including information about flight origin and destination, departure and arrival times, airline carrier information, and whether the flight has been diverted or cancelled.
We will apply machine learning and see if some of these variables influence the flight outcome.
In order to carry out the experiment that allows to predict which flights are likely to be cancelled or diverted using a public flight record data set, the following prerequisites need to be fulfilled.
- Access to a Cloud Pak for Data version 3.5 instance (we can provide you access to one; do not hesitate to contact me if needed).
- Notions of Python 3 and Pandas/Seaborn libraries to understand the notebook
- Notions of data science:
- Knowledge about pre-processing, training, and inference
- Knowledge about true and false positives, accuracy, and so on
- [Optional] A Python interpreter to make remote requests to the deployed model
The estimated lab time to predict which flights are likely to be cancelled or diverted using a public flight record data set is around 1 hour and 30 minutes.
You need to perform the steps provided in the following sections to predict flight delays and cancellations.
Step 1 – Load data
For this experiment, we provide a (slightly preprocessed) version of the Kaggle data set.
Download the data set Flights_Jan2019.csv from the GitHub repository: https://github.com/IBM/Flight-delays-tutorial-Cloud-Pak-for-Data.
To create a new project in Cloud Pak for Data, in the left pane, click Projects -> All projects.
On the Projects page, click New Project.
On the New project page, click Analytics project and then click Create an empty project. Enter the project name as flight-delays and click Create.
In the Assets tab, click the browse link to import the data set you previously downloaded. After uploading, the data set is displayed in the Data section.
Step 2 – Preprocess the data
Let us now explore the data set and preprocess it using the Data Refinery tool.
In the Data section, click the data set that you have uploaded. The preview shows that it contains 15 columns and a sample of 1000 rows. By exploring the columns, we can notice that there are two columns for origin (indicating the ID and code) and two similar columns for destination airports. This information is redundant, and we can therefore, drop the ID columns using Data Refinery.
Click Refine to perform it directly using the Data Refinery service. We now see a data set preview and a flow of operations that can be applied.
Add an operation to that flow. Click Operation and then click Remove. Then select the ORIGIN_AIRPORT_ID column that you want to remove and click Apply to add the operation to the flow.
Repeat the same step to remove the other DEST_AIRPORT_ID column.
- Explore the other preprocessing possibilities in the Operation menu.
Optionally, configure the name of the output data set using the options in the Information panel.
We finally apply that flow by clicking the top-right icon to save and run (see screenshot) and click Save and create a job. Enter a name for the preprocessing job and click Next while retaining the default parameters (they would optionally allow to schedule the job for later) and click Create and run on the summary page.
Wait for the job to complete (1 to 2 minutes) and the preprocessed data set, Flights-Jan2019_csv_shaped to appear. Navigate to the project home page and click the Jobs tab to track its progress.
Step 3 – Visualize data
Using a Jupyter notebook, we now explore the data set in-depth, draw some visualizations, and apply further preprocessing.
- Download the flights-delays-notebook.ipynb notebook from the GitHub repository: https://github.com/IBM/Flight-delays-tutorial-Cloud-Pak-for-Data
Click Add to project and then select Notebook as the asset type.
Click the From file tab, and enter a name for the notebook.Then, upload the notebook and click Create.
Read the explanations and run the cells. Make sure you follow instructions in red (at the beginning of the notebook, to load your new data set into the notebook).
After running the notebook, go back to the project home page: you should have a new Flights-Jan2019-Clean.csv data set.
These are samples of the plots you can generate by running the notebook:
Step 4 – Train AutoAI models
We can now train a model to predict the flight status for: ONTIME, DIVERTED, and CANCELLED.
On the project home page, click Add to project and select AutoAI Experiment as the asset type.
Enter a name for the experiment, retain the default compute configuration, and click Create.
You need to specify the data set to use. Click Select from project, select the latest Flights-Jan2019-Clean.csv file, and click Select asset.
Select FLIGHT_STATUS as the target column to predict.
Click Experiment settings and then click the Prediction tab.
- Make sure that the prediction type is Multiclass classification because we have three categories.
- Change the optimized metric. As the data set is unbalanced, the default accuracy metric is not a good one (because a model which would predict only ‘ONTIME’ would be very bad but can still get 97% accuracy). It is better to use the F1 Micro metric (which will average precision and recall per class).
Finally, because training takes a long time, reduce the number of models tested. In the algorithms to include, select the Decision Tree Classifier checkbox. This model is fast and works well for this use case.
Note that infrastructure matters here; and that a performant hardware is crucial when multiple users run heavy workloads, such as AutoAI. IBM Power Systems is a powerful hosting platform for Cloud Pak for Data because of its scalability and capacity to handle heavy workloads. Click Save Settings.
Click Run experiment.
Depending on the cluster size and current load, the experiment duration might vary. You can expect around 10 minutes for the task to complete. Cloud Pak for Data splits the data set (90% for the training set and 10% for the testing set) and trains four decision tree classifiers.
Step 5 – Assess model performance
After the AutoAI experiment is complete, you can see four trained models.
On the Model Evaluation page, click the best model (although they are likely to have almost the same F1 score, as Cloud Pak for Data applies different feature engineering and hyper-parameter optimization for each model which seem to have little to no impact for that usecase).
On the Feature Importance page, you can see which variables the AutoAI experiment predicted to have maximum influence on the target column. In this example, it is the flight status. Surprisingly, notice that the TAIL_NUM column referencing the airplane tail number has a large influence, and the other variables include flight distance and destination airport as well.
If you are familiar with precision/recall curves, feel free to explore the other evaluation sections but that is beyond the scope of that tutorial.
Step 6 – Deploy model
So far, we have prepared data and trained a machine learning model. The last step is to deploy it to be able to use it for new data.
First, we must save the best model as a project asset. To do so, on the best model page, click Save as, select Model as the asset type and click Create. Note that you can also export the model as a notebook, for users who want to see and modify the training code.
The model is now saved and visible on your project home page. Go back to that home page and click the model name.
On the next page, click Promote to deployment space. The Target space field is likely to be empty because you don’t have any deployment space yet. Click New space, enter a name for the space, and click Create. You can now select that new deployment space. Then, click Promote.
Your model is now associated to a deployment space. You finally need to deploy that model. In the left page, click Deployments. Then click the deployment space name. Notice that your model is listed as an asset. Click the Deploy (small rocket) icon at the right of the model name (only visible when your mouse is over the model name). Select Online as the deployment type, enter a name for the model, and click Create.
Step 7 – Use model for inference
Now that our model is deployed, we can use it to infer flight status for new data.
Click the model name. You will see an endpoint that is used to make queries from outside Cloud Pak for Data.
Download the sample flights-delays-request.py script from the GitHub repository: https://github.com/IBM/Flight-delays-tutorial-Cloud-Pak-for-Data and modify it:
- Add your model URL (the endpoint above) in the
- Optionally, modify the data to predict in the
- Add your model URL (the endpoint above) in the
Run the script locally on your computer if you have a Python interpreter. Alternatively, you can copy the code into a notebook (that you can create as a project asset) and run it from inside the notebook.
Enter your username and password when prompted and notice that the script calls the model for inference. In this use case, the Cloud Pak for Data model returns a flight status (ONTIME, CANCELLED, or DIVERTED).
This allows to easily integrate predictions from a Cloud Pak for Data model into an existing code and integrate it with external services.
In this tutorial, we focused on the Cloud Pak for Data features rather than a real model assessment and improvement. Here are a few ideas of improvement for you to experiment to be more confident at using the features of Cloud Pak for Data (ranked by increasing difficulty).
- You can run AutoAI with more model types and see if other models achieve a better accuracy or show different results in terms of feature importance.
- We only used data from January 2019, but data from February 2019, January 2020 and February 2020 are also available at: https://www.kaggle.com/divyansh22/flight-delay-prediction and https://www.kaggle.com/divyansh22/february-flight-delay-prediction. You can download them, preprocess them the same way and see if you come up with a better model.
- Edit the Python script to run tests for the whole month of February 2019 (or alternatively, train a model on 2019 data and test on 2020 data) and see if you matched the accuracy from Cloud Pak for Data. If not, it would mean that data from January 2019 doesn’t help a lot at predicting data from the following months.
- Find other data sources and see if they are relevant. For example, flights issues are likely to be caused by external data, such as weather events, employee strike, and so on. You can, for example, find a weather forecasts data set, try to add a ‘predicted weather’ column to the data set and see if this helps explaining the status of flights.
Using this tutorial, you should be able to navigate on a Cloud Pak for Data instance, understand and use some of its features. We showed how to conduct an end-to-end machine learning workflow that you can now reproduce on other data projects to quickly gain insights on your data and prototype models.
I hope you enjoyed this tutorial. For any question, issue, or improvement ideas, feel free to contact me at email@example.com.