Flight delays are an inconvenience. Wouldn’t it be great to predict how likely a flight is to be delayed? You could remove uncertainty and let travelers plan ahead. Usually, the weather is to blame for delays. So I’ve crafted an analytics solution based on weather data and past flight performance.

This solution takes weather info from Weather Company Data for IBM Bluemix and combines it with flight history from flightstats.com to build a predictive model that can forecast delays. To load and combine all this data, we use our Simple Data Pipe open source tool to move it into a NoSQL Cloudant database. Then I use Spark MLLib to train predictive models using supervised learning algorithms and cross-validate them.

flight predict architecture

About Predictive Modeling

To create a solution that can make accurate predictions, we need to tease meaningful information out of our data to craft a predictive model that can make guesses about future events. We do this using our historical weather and flight data, which we divvy up into 3 parts:

  • the training set helps discover potentially predictive variables and relationships between them.
  • the test set assesses the strength of these relationships and improves them, shaping our model.
  • Finally the blind set validates the model.

Here’s the iterative flow:

Flow Diagram

Set Up a Flightstats Account

We get our historical data from flightstats.com, so you’ll need to create an account to get access to their data sets.

Save Time! If you don’t feel like walking through flightstats account setup. but want to understand the analytics, you can use a sample database I created. Skip ahead to the Add Weather Company Data for IBM Bluemix service section.

  1. Sign up for a free developer account at FlightStats.com.
  2. Fill out the form and monitor email for confirmation link (access to APIs may take up to 24 hours).
  3. Once you get your access confirmation email, go to https://developer.flightstats.com/admin/applications and copy your Application ID and Application Key (you will need them in a few minutes).

    get flightstats keys

    Tip: While you’re here, you can also explore the flightstats APIs:



Deploy Simple Data Pipe

The Simple Data Pipe is a handy data movement tool our team created to help you get and combine JSON data for use where you need it. The fastest way to deploy this app to Bluemix is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.

Using my sample credentials? In that case, you don’t need to import data with the pipe. Feel free to read and understand, but then skip ahead to: Add Weather Company Data for IBM Bluemix service.

Deploy to Bluemix

If you don’t already have a Bluemix account, you’re prompted to create one. You can sign up for a free trial.

If you would rather deploy manually, or have any issues, refer to the readme.

When deployment is done, leave the Deployment Succeeded page open. You’ll return here in a minute.

Add Weather Company Data for IBM Bluemix service

To work its magic, the flight predict connector that we’re about to install needs weather data. So add IBM’s Weather Company Data for IBM Bluemix service:

  1. Open a new browser window or tab, and in Bluemix, go to the top menu, and click Catalog.
    If you don’t yet have a Bluemix account, sign up for a free trial.
  2. In the Search box, type Weather, then click Weather Company Data for IBM Bluemix.

  3. Under app, click the arrow and choose your new Simple Data Pipe application. Doing so binds the service to your new app. (If you’ll use my sample data, leave it unbound.)
  4. Choose plan. You have 2 choices, depending upon how you’re following this tutorial:
    • If you’re completing all steps and importing your own data, then choose Premium plan to ensure you’ll have enough authorized API calls to try out this app.
    • If you’ll use my sample data, choose Free plan.
  5. Click Create.
  6. If you’re prompted to restage your app, do so by clicking Restage.

If you’re using my sample data, skip ahead to the Create an IPython notebook section.

Upgrade Cloudant Plan

The Cloudant service that comes bundled in the Simple Data Pipe app is set on the free Lite plan, which is too limited to import the data you need to complete this tutorial. To proceed, do one of the following:

  • If you want to run all parts of this tutorial yourself, upgrade Cloudant. To do so, go to your Bluemix dashboard. Find and open your Cloudant service, then click the Plan tab. Choose Standard. Follow prompts to restage and enter credit card details.
  • Ride for free! Follow along using our sample data. Just skip ahead to the Create an IPython notebook section.

Install Flightstats Connector

I created a custom connector for the Simple Data Pipe app that loads and combines historical flight data from flightstats.com with weather data from Weather Company Data for IBM Bluemix.

Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry.

  1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button.
  2. Click the package.json file to open it.
  3. Edit the package.json file to add the following line to the dependencies list:

    "simple-data-pipe-connector-flightstats": "*"

    Tip: Be sure to end the line above your new line with a comma and follow proper JSON syntax.

  4. From the menu, choose File > Save.
    Save changes
  5. Press the Deploy app button and wait for the app to deploy again.

    deploy button

Load the Data

We’ll load 2 sets of data, an initial set of flight data from 10 major airports, and a test set, that the connector prepares for you.

Load initial data set

  1. Launch simple data pipe in one of the following ways:
    • In the code editor where your redeployed, go to the toolbar and click the Open button for your simple data pipe app.
      open url button
    • Or, in Bluemix, go to the top menu and click Dashboard Find your Simple Data Pipe app and click its URL or the Open URL button.
      launch app
  2. In Simple Data Pipe, go to menu on the left and click Create a New Pipe.
  3. Click the Type dropdown list, and choose Flight Stats.
    Type dropdown
    When you added a Flightstats connector earlier, you added the option you’re choosing now.
  4. In Name, enter training (or anything you want).

  5. If you want, enter a Description.

  6. Click Save and continue.

  7. Enter the Flightstats App ID and App Key you copied when you set up your FlightStats account.
  8. Click Connect to FlightStats.
    You see a You’re connected confirmation message.

  9. Click Save and continue.

  10. On the Filter Data screen, click the dropdown arrow and select Mega SubSet from 10 busiest airports. Then click Save and continue.

  11. Click Skip, to bypass scheduling.
  12. Click Run now.

    View your progress: If you want, you can see the data load in-process. In a separate browser tab or window, open or return to Bluemix. Open your Simple Data Pipe app, go the menu on the left, and click Logs.

    When the data’s done loading, you see a Pipe Run complete! message.

Load Test Set

Create a new pipe again to load test data.

  1. In your Simple Data Pipe app, click Create a new Pipe.
  2. In the Type dropdown, select Flight Stats.
  3. In Name enter test.
  4. If you want, enter a Description.
  5. Click Save and Continue.
  6. Enter the Flightstats App ID and App Key you copied when you set up your FlightStats account.
  7. Click Connect to FlightStats.
    You see a You’re connected confirmation message.
  8. Click Save and continue.
  9. On the Filter Data screen, click the dropdown arrow and select Test set. Then click Save and continue.

Create an IPython notebook

Shortcuts: If you’ve opted to use my sample credentials, go through the following steps to create the notebook and run its commands. If you want to skip these notebook creation steps too, you can follow the rest of this tutorial by viewing this prebuilt notebook on Github: https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/blob/master/notebook/Flight%20Predict%20PyCon%202016.ipynb

Create a notebook on IBM’s Data Science Experience (DSX):

  1. Sign in or create a trial account on DSX.
  2. Create a new project (or select an existing project).

    On the upper right of the screen, click the + plus sign and choose Create project.

  3. Add a new notebook (From URL) within the project.
    1. Click add notebooks.
    2. Click From URL.
    3. Enter notebook name.
    4. Enter notebook URL: https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
    5. Select the Spark Service.
    6. Click Create Notebook.

If prompted, select a kernel for the notebook. The notebook should successfully import.

Install Python package and add service credentials

Here, we install the Python Library I created, which lets you write code inline within notebook cells and encapsulate helper APIs within the Python package. This package helps keep our notebook short and performs most of the hard work. (See this library on GitHub.)

When you use a notebook in DSX, you can run a cell only by selecting it, then on the Run Cell (▸ icon) button. If you don’t see the Run Cell button and Jupyter toolbar, go to the toolbar and click pencil icon Edit.

  1. Run the first cell of the notebook, which contains the following command:
    import training  #module contains apis to train the models
    import run  #module contains apis to run the models

Tip: An alternative method to install the package (not recommended for use in this tutorial) is to use pip:

!pip install --user --exists-action=w --egg git+https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git#egg=flightPredict

Compare these 2 ways of using helper Python packages
 – SparkContext.addPyFile. Easy addition of python module file, supports multiple module files via zip format, and recommended during development where frequent code changes occur.
 – egg distribution package: pip install from PyPi server or file server (like GitHub). Persistent install across sessions, and recommended in production.

Add credentials

Before your new notebook can work with flight and weather data, it needs access. To grant it, add your Cloudant and Weather service credentials to the notebook.

Using my sample credentials? Skip ahead to Step 4 and confirm that you see the following values:
cloudantHost: dtaieb.cloudant.com
cloudantUserName: weenesserliffircedinvers
cloudantPassword: 72a5c4f939a9e2578698029d2bb041d775d088b5
weatherUrl: --insert your Weather URL here--
To get only your Weather Company Data URL, go to your Bluemix dashboard, open the service, and click Service Credentials. Click View credentials and copy the URL.

  1. In Bluemix, open your app’s dashboard.
  2. In the menu on the left, click Environment Variables.
  3. Copy credentials for Cloudant and/or Weather Company Data.


  4. Return to your notebook, and in the second cell, paste in your credentials, replacing the ones there. (If you’re just following along in the notebook, leave existing credentials in place, except for the WeatherURL.)
    enter svc creds

  5. Run that cell to import python modules the notebook uses and to connect to services.

Train the machine learning models

  1. Load training set in Spark SQL DataFrame.

    Within the next cell, make sure the training dbName is your dbname from Cloudant. (To find it, go to your Simple Data Pipe app dashboard, click the Cloudant tile, then click Launch. The Cloudant dashboard shows your dbname.)
    cloudant database name

    Then run the following code:

    %time cloudantdata = training.loadDataSet(dbName,"training")
    %time cloudantdata.printSchema()
    %time cloudantdata.count()
  2. Visualize classes in scatter plot.

    Run the next 3 cells to plot delays based on factors like temperature, pressure, and wind speed. These plots are good first step to check distribution and possibly identify patterns.

  3. Load the training data as an RDD of LabeledPoint.

    Run the following code to Spark SQL connector to load data into a DataFrame.

    trainingData = training.loadLabeledDataRDD("training")

  4. Train multiple classification models.

    Here we apply several machine-learning classification algorithms. To ensure accuracy of our predictions, we test the following different methods, and use cross-validation to choose the best one. Run the next few cells to train:

    • Logistic Regression Mode
    • NaiveBayes Model
    • Decision Tree Model
    • Random Forest Model

Test the models

  1. Load test data

    Make sure your dbname is the test database name from Cloudant (check your Cloudant dashboard as you did in the preceding section). Then run the following code:

    testCloudantdata = training.loadDataSet(dbTestName,"test")
  2. Run Accuracy metrics

    Run the next cell to compare the performance of the models.

    accuracy metrics

  3. Run the next few cells to get confusion matrixes for each model.

    While the metrics table we just created can tell us which model performs well overall, the confusion matrixes let us see the performance of individual classes (like Delayed less than 2 hrs) and help us decide if we need more training data or if we need to change classes or other variables.

  4. Plot the distribution of your data with Histograms

    Run the code in cell 15 to refine classifications and see a bar chart. Each bar is a bin (group of data points). You can specify different numbers of bins to examine data distribution and identify outliers. This info, combined with the confusion matrix results, helps you quickly uncover issues with your data. Then you can fix them and create a better predictive model.

    bar chart
    If you see an extremely long tail here (lots of bins that yield few results), you may have a data distribution issue, which you could solve by tweaking your classes. For example, this graph prompted me to change Delayed more than 4 hours and Delayed less than 2 hours to shorter increments of: Delayed less than 13 minutes, Delayed between 13-41 minutes, and Delayed more than 41 minutes. Doing so improved accuracy and helped us include the most meaningful results in our model.

  5. Customize the training handler.

    Run the cell beneath the bar chart to provide new classification and add day of departure as a new feature. This code also re-builds the models, re-computes accuracy metrics.

Run the models

Now our predictive model is in place! Our app is working with enough accuracy to let flyers enter flight details and see the likelihood of a delay.

Run the final cell.

bos to aus

If you want, replace the flight details (in red) with info on an upcoming flight of yours and run it again to see if you’ll make it on time.


Predictive modeling is an art form and an intensely iterative process. It requires substantial data sets and a fast, flexible way to test and tweak approaches. Simple Data Pipe let us load the pertinent data into Cloudant. From there, we used IBM Analytics for Apache Spark to create a notebook for analysis and modeling. You saw how flexible a Python notebook can be. Using it in combination with APIs in my Python package let us leverage Spark MLLIB to train predictive models and cross-validate fast and effectively.

Feel free to play with this code and extend it. For example, a great improvement for deploying this app in production, would be to create a custom card for Google Now that automatically notifies a mobile user of impending flight delays and then proposes alternative flight routes using Freebird.

Also, read how we enhanced our Flight Predict notebook adding an interactive app and visualizations built using PixieDust, the open source Python helper library.

2 comments on"Predict Flight Delays with Apache Spark MLLib, FlightStats, and Weather Data"

  1. Hi,

    Thank you for sharing this interesting tutorial. I am lost in the “Install Flightstats Connector” – How to invoke Bluemix and then click Edit Code? I search Catalog but not be able to find Bluemix DevOp. Please advise. Thanks.



Join The Discussion

Your email address will not be published. Required fields are marked *