Taxonomy Icon

Data Science

Predict flight delays using big data and R4ML

Get the code

Summary

In this developer code pattern, we will use R4ML, a scalable R package running on IBM Watson™ Studio, to perform various machine-learning exercises. For users who are unfamiliar with Watson Studio, it is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.

Description

If you’re a data scientist who’s needed to know how to do large-scale model training for classification using a support vector machine (SVM) or perform tuning using cross-validation, you’ve come to the right place.

Living in the age of big data, we have tons of data generated every day, so it is important to analyze the data for optimal business results. However, traditional data science tools will not scale to big data, which is why frameworks like Apache Spark were created. R4ML is one approach toward that goal.

This pattern provides an SVM example to demonstrate the ease and power of R4ML in implementing scalable classification. R4ML provides various out-of-the-box algorithms to experiment with. For those users who are new to R4ML, or for functionality, support, documentation, and roadmap, please see the related links.

We will use the Airline On-Time Statistics and Delay Causes from RITA. A 1-percent sample of the dataset is available from the American Statistical Association (ASA). All of the data is in the public domain. We will be using a subset of the above dataset, which is shipped with R4ML, but this pattern can also work with the larger RITA dataset.

After you proceed through this pattern, you will understand how to:

  • Use Jupyter Notebooks to load, visualize, and analyze data.
  • Run Jupyter Notebooks in IBM Watson Studio.
  • Leverage R4ML to conduct preprocessing and exploratory analysis with big data.

Flow

flow

  1. Load the provided notebook onto into IBM Watson Studio.
  2. The notebook interacts with an Apache Spark instance.
  3. A sample big data dataset is loaded into a Jupyter Notebook.
  4. R4ML, running atop Apache Spark, is used to perform machine learning.

Instructions

Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.