Win $20,000. Help build the future of education. Answer the call. Learn more

Archived | Perform big data preparation and exploration

Archived content

Archive date: 2019-06-04

This content is no longer being updated or maintained. The content is provided “as is.” Given the rapid evolution of technology, some content, steps, or illustrations may have changed.


This developer code pattern use R4ML, a scalable R package, running on IBM Watson Studio to perform various machine-learning exercises. Developers new to Watson Studio and scalable machine learning who are interested in big data for data exploration and data preparation tasks will learn how to use R4ML, which augments the capabilities of the Apache Spark R framework.


In this code pattern, we will use R4ML, a scalable R package running on IBM Watson™ Studio to perform various machine-learning exercises. For users who are unfamiliar with Watson Studio, it is an interactive, collaborative cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.

We live in the age of big data. Tons of data are generated every day, and it is important for analysts and data scientists to analyze the data for business results. However, traditional data science tools like R and Python-based scikit-learn will not scale to big data, which is why frameworks like Apache Spark and Apache Hadoop were created. R4ML is one approach toward that goal.

R4ML provides various out-of-the-box tools and a pre-processing utility for doing the feature engineering. It also provides utilities to sample data and for exploratory analysis. This pattern provides an end-to-end example to demonstrate the ease and power of R4ML in implementing data pre-processing and data exploration.

When you have completed this code pattern, you will understand how to:

  • Use Jupyter Notebooks to load, visualize, and analyze data.
  • Run Notebooks in IBM Watson Studio.
  • Leverage R4ML to conduct data preparation and exploratory analysis with big data.



  1. Load the provided notebook into IBM Watson Studio.
  2. The notebook interacts with an Apache Spark instance.
  3. A sample big data dataset is loaded into a Jupyter Notebook.
  4. R4ML, running atop Apache Spark, is used to perform machine data pre-processing and exploratory analysis.


Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.