Data Scientist Romeo Kienzler shows you how to code a machine learning flow from scratch using the new DataFrames-based Spark ML pipelines with data from a recent Kaggle competition.

In this video:

Please find the notebooks used in this tutorials here

IBM Chief Data Scientist Romeo Kienzler demonstrates how to use the new DataFrames-based SparkML pipelines (with data from a recent Kaggle competition on production line performance) to code a machine learning workflow from scratch. Romeo starts by showing you how to ingest the Kaggle data then performs the ETL (extract, transform, load) process using the Apache Parquet format and OpenStack Swift to store the data to ObjectStore.

He uses common, pre-processing techniques such as one hot encoding and String Indexing demonstrate how to create the Spark ML pipeline. Finally, Romeo feeds the data into an algorithm called RandomForrest and illustrates how to evaluate the results.

After the session, you will come away with a template you can use for your data science projects. The event will be performed using the IBM Data Science Experience so you can join up and immediately replicate the example.

The Apache Spark open-source cluster-computing framework sports Spark ML, a package introduced in Spark 1.2 which provides a uniform set of high-level APIs that help developers create and tune practical machine learning pipelines. Spark ML represents a common machine learning workflow as a pipeline, a sequence of stages in which each stage is either a transformer or an estimator.

A simple text document processing workflow follows stages similar to this, usually in order:

  1. Split the document text into words
  2. Convert the words into a numerical feature vector
  3. Develop a prediction model using the feature vectors and labels

A transformer is an abstraction which includes feature transformers and learned models; it implements a transform() method which converts one schema of a resilient distributed dataset (RDD) into another. An estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data by implementing a fit() method that accepts a schema RDD and produces a transformer.

Here is more information on how a Spark ML pipeline works.

The Spark DataFrames API, an extension to the RDD API and inspired by data frames in R and Python, was designed to support modern big data and data science applications. It is simply a distributed collection of data organized into named columns that can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Romeo Kienzler is a Senior Data Scientist and DeepLearning and AI Engineer for IBM Watson IoT and an IBM Certified Senior Architect who spends much of his waking life helping global clients solve their data analysis challenges. Romeo holds an MSc (ETH) in Computer Science with specialization in information systems, bioinformatics, and applied statistics from the Swiss Federal Institute of Technology. He is an Associate Professor of artificial intelligence and his current research focus is on cloud-scale machine learning and deep learning using open source technologies including R, Apache Spark, Apache SystemML, Apache Flink, DeepLearning4J, and TensorFlow.

Resources for you

Perform a machine learning exercise

developerWorks Live Webcasts
Live coding demos, webinars, and “ask me anythings”.

Subscribe by email | Subscribe on YouTube

Apache Spark
Create algorithms to harness insight from complex data.

Join The Discussion

Your email address will not be published. Required fields are marked *