Big data preparation and exploration using R4ML, a scalable ML framework

Introduction

The “Perform big data preparation and exploration” code pattern is for anyone new to Watson™ Studio and scalable machine learning who is interested in big data for doing various data exploration and data preparation tasks using R4ML, which augments the capabilities of Apache Spark R.

This code pattern uses the Airline On-Time Statistics and Delay Causes from RITA. A subset of data is shipped with R4ML, and we will use that. Note that the code pattern will work with the larger RITA dataset.

Scalable Data Analysis and Exploration

We live in the age of big data. Tons of data are generated every day, and it is important for analysts and data scientists to analyze them for optimal business results. However, traditional data scientist tools like R and Python-based scikit-learn will not scale to big data. Hence, many frameworks like Apache Spark and Apache Hadoop were created. R4ML is one approach toward that goal.

R4ML explained

R4ML is the built on the top of Apache Spark R and Apache SystemML. Spark R takes the approach of big data as the big data frame. SystemML takes the approach of Matrix.

R4ML is an open source scalable machine-learning framework built using Apache Spark and Apache SystemML, allowing R scripts to invoke custom algorithms developed in Apache SystemML. R4ML integrates seamlessly with Spark R, so data scientists can use the best features of Spark R and SystemML together in the same scripts. In addition, the R4ML package provides a number of useful new R functions that simplify common data cleaning and statistical analysis tasks.

Explore and analyze data

This code pattern is divided into two notebooks. We will follow the notebook R4ML introduction and exploratory analysis. In this tutorial style notebook, we first provide an overview of the R4ML R package, integration with Spark and SystemML, and installation. The following topics will be covered:

  • Loading the big data set.
  • Doing the uniform sampling of it so we can do visual data exploration using the popular ggplot2.
  • Exploring alternative ways of getting to the same conclusion using a scalable analytical approach.

Data preparation and dimensionality reduction

Typically, data scientists spend most of the time on data preparation, feature engineering, and one can expect the same on big data. R4ML provides many out-of-the-box utilities to help users achieve the same.

Data preparation

R4ML provides supports the following common data preparations.

Method R4ML options Description
NA removal imputationMethod, imputationValues, omit.na These options let one remove the missing data, or allows the user to substitute with a constant or mean (if it is numeric).
Binning binningAttrs, numBins One typical use case of binning is, for example, we have the height of people in feet and inches, but we only care about knowing the top three categories (e.g. short , medium, and tall).
Scaling and centering scalingAttrs Most of the algorithm’s predictive power or result becomes better if it is normalized (i.e. substract the mean and then divide it by stddev).
Encoding (Recode) recodeAttrs Since most of the machine-learning algorithm is matrix-based linear algebra, the categorical columns often have inherit order, like height or shirt_size, so this option provides the ability to encode string columns into the ordinal categorical numeric values.
Encoding (OneHot or DummyCoding) dummyCodeAttrs This is useful when categorical columns have no inherit orders, like a person’s race or the state they live in.

We will walk users through applying these steps on big data with an example.

Dimensionality reduction using scalable PCA

Dimensionality reduction Dimensionality reduction is choosing a basis or mathematical representation within which you can describe most but not all of the variance within your data, thereby retaining the relevant information, while reducing the amount of information necessary to represent it. There are a variety of techniques for doing this including but not limited to PCA, ICA, and Matrix Feature Factorization. These will take existing data and reduce it to the most discriminative components. All of these allow you to represent most of the information in your dataset with fewer, more discriminative features.

Why dimensionality reduction is useful In terms of performance, having data of high dimensionality is problematic because:

  • It can mean high computational cost to perform learning and inference.
  • It often leads to overfitting when learning a model, which means that the model will perform well on the training data but poorly on test data.

Dimensionality reduction addresses both of these problems, while (hopefully) preserving most of the relevant information in the data needed to learn accurate, predictive models.

Also note that, in general, visualizations of lower-dimension data and its interpretation are more straightforward and could be used for gaining insight into the data.

Running an example of dimensionality reduction using R4ML Finally, we will walk users through an example of how to run the dimensionality reduction using R4ML’s scalable API r4ml.pca. We will discuss the results and use various visual and statistical techniques to ensure that the PCA captures more than 90 percent of variance with 50 percent of features.

Summary

Awesome job going through the blog! Now go try and take this further or apply it to a different use case. With that information, I encourage you to use different big data to apply various techniques discussed here. I hope you’ll take the time to check out my code, follow along, build upon it, and beat the score!