IBM Chief Data Scientist Romeo Kienzler will demonstrate live on September 26 how to use the new DataFrames-based SparkML pipelines (with data from a recent Kaggle competition on production line performance) to code a machine learning workflow from scratch. Romeo will start by showing you how to ingest the Kaggle data then perform the ETL (extract, transform, load) process using the Apache Parquet format and OpenStack Swift to store the data to ObjectStore.
He will demonstrate how to create the Spark ML pipeline using common, pre-processing techniques such as one hot encoding and String Indexing.
Finally, Romeo will feed the data into an algorithm called RandomForrest and illustrate how to evaluate the results.
After the session, you will come away with a template you can use for your data science projects. The event will be performed using the IBM Data Science Experience so you can join up and immediately replicate the example.
The Apache Spark open-source cluster-computing framework sports Spark ML, a package introduced in Spark 1.2 which provides a uniform set of high-level APIs that help developers create and tune practical machine learning pipelines. Spark ML represents a common machine learning workflow as a pipeline, a sequence of stages in which each stage is either a transformer or an estimator.
A simple text document processing workflow follows stages similar to this, usually in order:
- Split the document text into words
- Convert the words into a numerical feature vector
- Develop a prediction model using the feature vectors and labels
A transformer is an abstraction which includes feature transformers and learned models; it implements a
transform() method which converts one schema of a resilient distributed dataset (RDD) into another. An estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data by implementing a
fit() method that accepts a schema RDD and produces a transformer.
The Spark DataFrames API, an extension to the RDD API and inspired by data frames in R and Python, was designed to support modern big data and data science applications. It is simply a distributed collection of data organized into named columns that can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
Romeo Kienzler is a Senior Data Scientist and DeepLearning and AI Engineer for IBM Watson IoT and an IBM Certified Senior Architect who spends much of his waking life helping global clients solve their data analysis challenges. Romeo holds an MSc (ETH) in Computer Science with specialization in information systems, bioinformatics, and applied statistics from the Swiss Federal Institute of Technology. He is an Associate Professor of artificial intelligence and his current research focus is on cloud-scale machine learning and deep learning using open source technologies including R, Apache Spark, Apache SystemML, Apache Flink, DeepLearning4J, and TensorFlow.