Accelerate Generalized Linear Model training with Watson Machine Learning Accelerator and Snap ML
Drive online advertising click-through prediction with Watson Machine Learning Accelerator, SnapML, and AC922
IBM Watson™ Machine Learning Accelerator is a software solution that bundles IBM PowerAI, IBM Spectrum Conductor®, IBM Spectrum Conductor Deep Learning Impact, and support from IBM for the whole stack, including the open source deep learning frameworks. Watson Machine Learning Accelerator provides an end-to-end deep learning platform for data scientists. This includes complete lifecycle management from installation and configuration; data ingest and preparation; building, optimizing, and distributing the training model to moving the model into production. Watson Machine Learning Accelerator truly shines when you expand your deep learning environment to include multiple compute nodes. There’s even a free evaluation available. See the prerequisites from our introductory tutorial: Classify images with Watson Machine Learning Accelerator.
IBM has developed an efficient, scalable machine learning library that enables fast training of various machine learning models. Using this library, clients can remove training time as the bottleneck for machine learning workloads, paving the way to a range of new applications. The Snap Machine Learning (Snap ML) library combines recent advances in machine learning systems and algorithms and uses GPUs to accelerate generalized linear models. This was made possible by innovations in the algorithmic level, and also by the high-speed interconnection link between GPUs and POWER9™ CPUs: the NVLink 2.0.
The importance of this state-of-the-art library is amplified by the fact that logistic regression, decision trees, and random forests are the top three most used machine learning models at work by data scientists, (2017 Kaggle Data Science Survey), and all are supported by Snap ML today.
Snap ML (PowerAI 1.6.0) currently supports the following models.
Generalized linear models:
- Logistic regression
- Linear regression
- Ridge regression
- Lasso regression
- Support vector machines (SVMs)
- Decision trees
- Random forest
Unique value proposition
There are three main features that distinguish the unique value proposition that Snap ML offers:
Distributed training — IBM has built the system as a data-parallel framework, enabling clients to scale out and train on massive data sets that exceed the memory capacity of a single machine, which is crucial for large-scale applications.
GPU acceleration — IBM has implemented specialized solvers designed to leverage the massively parallel architecture of GPUs while respecting the data locality in GPU memory to avoid large data transfer overhead. To make this approach scalable, IBM takes advantage of recent developments in heterogeneous learning to achieve GPU acceleration even if only a small fraction of the data can be stored in the accelerator memory.
Sparse data structures — Many machine learning data sets are sparse. Snap ML employs new optimizations for the algorithms when applied to sparse data structures.
All of this results in significantly reduced training times and the ability to handle terabyte-scale data sets.
This is the third tutorial of the IBM Watson Machine Learning Accelerator education series. In our series, we have trained a logistic regression classifier to predict clicks on advertisements using a 20-GB data set that consists of online advertising click-through data, containing 45 million training examples and 1 million features. We will show you how to accelerate logistic regression model training with the Snap ML library, and compare the performance with open source Spark ML. This series consists of three parts:
Part 1 — Prepare the Criteo Kaggle data set
- Downloading and extracting the data set
- Creation of a train/test split using scikit-learn
Part 2 — Installation and configuration
- Creation of two Spark instance groups
- Installation and configuration of two Livy services on Watson Machine Learning Accelerator
Part 3 — Running logistic regression model
- Customization of a notebook package to include sparkmagic
- Connecting to a Watson Machine Learning Accelerator cluster from a notebook
- Training a logistic regression model to predict customer click-through rate with Spark ML and with IBM Watson Machine Learning Accelerator Snap ML
The end-to-end tutorial takes about two hours and includes about 30 minutes of model training, plus installation and configuration as well as driving the model through the GUI.
The tutorial requires access to a GPU-accelerated IBM Power Systems server model AC922 or S822LC. In addition to acquiring a server, there are multiple options to access Power Systems servers listed on the PowerAI Developer Portal.
Part 1: Prepare the Criteo Kaggle data set
Download the Criteo Kaggle competition data.
Extract the contents.
tar xzf criteo.kaggle2014.svm.tar.gz
Execute the following Python script to create the training/test files.
from sklearn.datasets import load_svmlight_file from sklearn.model_selection import train_test_split from sklearn.datasets import dump_svmlight_file X,y = load_svmlight_file("criteo.kaggle2014.train.svm") X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) dump_svmlight_file(X_train, y_train, 'criteo.kaggle2014-train.libsvm', zero_based=False) dump_svmlight_file(X_test, y_test, 'criteo.kaggle2014-test.libsvm', zero_based=False)
As a result, two data sets are generated.
Part 2: Installation and configuration
We will create two separate Spark instance groups:
- Livy-integration-notebook for GPU workload
Livy-integration-notebook-CPU for CPU workload
Creation of Spark Instance Group #1 – Livy-integration-notebook
Enter the required fields and click Configuration.
Modify the following Spark properties.
Set spark.default.parallelism to total number of available GPUs.
Select Additional Parameters and add following parameters.
Configure resource groups and plans
Creation of Spark Instance Group #2 – Livy-integration-notebook-cpu
Enter the required fields and click Configuration.
Modify the following Spark properties. Set spark.default.parallelism to the total number of available CPU cores.
Select Additional Parameters and add following the parameters.
Install and configure two Livy services: SnapML-Livy & SnapML-Livy-CPU on Watson Machine Learning Accelerator.
As a result, you should have two Livy instances up and running and its output values (available on the Overview tab in the cluster management console) show the end-point location as livy_URL.
Part 3: Training the logistic regression model
Download the sample notebook and load into your favorite notebook environment. To access the Spark instance group by using the Apache Livy endpoint, you must load the client library, create a Livy session, and use it for the Spark job submission. The
sparkmagiccommand helps to automate the process.
Train the logistic regression model to predict the customer click-through rate and distribute training across multiple cores in CPU with Spark ML:
a. Load the Sparkmagic extension:
b. Create a Livy CPU session by using the livy_URL value from the application instance: %spark add -s cpu_session -l python -u
-a u -k config`
c. After the Livy CPU session is created, you can launch the logistic regression model training to predict the customer click-through rate with Spark ML, running distributed across 33 CPU cores. This model has 1 million features and will train with a 20-GB Criteo Kaggle 2014 Test data set and run inference with 6-GB Criteo Kaggle 2014 Test data set. The execution completes in 202.96 seconds.
d. Finally, the Livy session must be cleaned up to release the associated resource:
%spark delete -s cpu_session
Train a logistic regression model to predict customer click-through rate and distribute jobs in GPU with IBM Watson Machine Learning Accelerator Snap ML:
a. Create a Livy GPU session by using the livy_URL value from the application instance.
b. After the Livy CPU session is created, you can launch the logistic regression model training with Snap ML, running distributed across eight GPUs. Execution completes in 18 seconds.
The Snap ML library offers GPU acceleration and distributed computing capabilities that accelerate machine learning model training and enable handling large data sets. In our tutorial, we trained a logistic regression classifier to predict clicks on online advertisements using a 20-GB data set that consists of online advertising click-through data, containing 45 million training examples and 1 million features. This is a highly relevant application for companies serving ads on their websites and online bidding companies, responsible for billion-dollar revenues in today’s connected society.
Snap ML speeds up this training workload tenfold by accelerating the execution time from 202 seconds (using Spark ML running on CPUs) to 18 seconds. This heavily improves productivity and might even enable such use cases as online retraining of machine learning models to adapt to rapidly changing situations or business requirements.
Want to know more? Take a look at this video to learn what Snap ML technology is and how you can benefit from it.