Machine learning (ML) has been defined as the field of study that gives computers the capability to learn without being explicitly programmed . Machine learning provides us with tools to gain insights from data. In his blog , Jason Brownlee gives an intuitive picture on classifying ML algorithms according to their learning style and similarity in form or function. For example, ML algorithms can be classified into supervised, unsupervised, and semi-supervised learnings according to their learning style. Figure 1 shows various machine learning models grouped by their similarity in form and function as defined in .
Figure 1. Grouping machine learning models by similarity in form or function
Figure 1 is derived from the information provided in .
A wide array of frameworks and libraries have emerged to enable people to solve machine learning problems without needing to worry about the implementations of the algorithms. To mention a few of these frameworks: scikit-learn , H2O , RAPIDS , and IBM Snap Machine Learning (Snap ML) . Interested readers can refer to the articles in , , and  to learn more about the landscape of available machine learning and deep learning frameworks . These frameworks come in various flavours to address the needs of enterprises and common users. Some frameworks allow for scaling up and can take advantage of the acceleration capabilities of the underlying hardware. In this blog, we shall compare the following two ML frameworks with acceleration capabilities:
Figure 2. Kaggle MS and DS Survey 2019 
Figure 2 (image source – page 18 in Kaggle’s State of Data Science and Machine Learning 2019) shows the mostly used data science methods in 2019 from a recent Kaggle survey.
Given the popularity of the linear regression models, in this study, we evaluate the performance of the linear models that are common across Snap ML and cuML, namely Ridge, Lasso, and Logistic regression, on the Epsilon , Higgs , Price prediction  and, Taxi  data sets. We have attempted to pick a collection of data sets with a varying number of features and density / sparsity levels. The parameters of the linear models were set so that the training objective functions in both Snap ML and CuML are identical, thus resulting in similar accuracy. To compare the two ML frameworks, we use the training time as the performance metric.
The following table captures the characteristics of the data sets considered for this study.
Table 1. Data set characteristics
|Price prediction preprocessed||175000||5074||Sparse||0.3||99.7|
Table 2 lists the hardware and the software used in the experimental setup.
Table 2. System configuration
|IBM Power System AC922||Intel® Xeon® Gold 6150|
|40 cores (two 20c chips), IBM POWER9™ with NVLink 2.0||36 cores (two 18c chips)|
|3.8 GHz, 1 TB memory||2.70 GHz, 502 GB memory|
|One Tesla V100 GPU, 16 GB GPU||One Tesla V100 GPU, 16 GB GPU|
|Red Hat Enterprise Linux (RHEL) 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243||Ubuntu 18.04.3 with CUDA 10.1.243|
|nvidia-driver – 418.67||nvidia-driver – 418.67|
|Software: IBM pai4sk: WML CE 1.6.2; pai4sk-1.5.0 (py36_1071.g5abf42e) scikit-learn 0.21.3, numPy 1.16.5||Software: cuML- 0.10.0. numPy 1.17.3, scikit-learn 0.21.3|
Software setup and instructions
You can find the benchmarking script and the script to download and preprocess the data sets for both platforms at:
Figure 3. Speedup of Snap ML in comparison with cuML
Figure 3 shows the speedup of Snap ML in comparison with cuML for the Epsilon, Higgs, Taxi and Price Prediction preprocessed data sets . Speedup = cuML training time divided by Snap ML training time. In Figure 3, we can observe that Snap ML gives very good performance for very sparse data sets such as Price Prediction and Taxi for the Ridge, Lasso, and Logistic regression algorithms. Snap ML has inherent support for sparse data sets similar to scikit-learn. Snap ML also performs better than cuML for a dense data set such as Epsilon, which has a large number of features.
We also observe that cuML performs better in the case of Ridge and Logistic regression for the Higgs data set, where the number of features is small. You must also note that we had to preprocess the Price Prediction data set and obtain a data set with lesser number of features because cuML fails with an out of memory error for the full Price Prediction data set.
CuML was unable to handle cases where the data set does not fit in the GPU memory or the runtime artefacts exceeds GPU memory (temporary or additional memory consumed by the training algorithm). In Figure 4, we compare only Snap ML GPU (single-GPU-based training) with scikit-learn (CPU-based training) for the full Price Prediction data set with 17052 features and 1185328 instances.
Figure 4. Snap ML versus scikit-learn training speedup on the full price prediction data set
We observe that Snap ML is up to 80 times faster than the scikit-learn framework. Snap ML provides native support for out-of-core training, which is needed when the data set is too large to fit in the system’s GPU memory. cuML, however, fails if the data set or runtime artifacts exceeds the GPU memory.
For data sets that do not fit in the GPU memory, Snap ML is a clear ML framework winner. For dense data sets with a small number of features, cuML is a better candidate. For sparse data sets however, snap ML is again winning against both cuML and scikit-learn. Although the data sets chosen for this study are not a comprehensive list covering the entire spectrum of possibilities, they definitely give some guidance about the ML framework that data scientists and ML practitioners should use to optimize their ML pipelines.
- The results were obtained by taking the average of the best 5 among the 10 runs.
- The results can vary with different data sets and model parameters.
- For cuML, the data was passed in the ndarray format. For snap ML, the dense data sets were passed in the ndarray format and the sparse data sets in the CSR format.
 Snap ML paper
 H20 RAPIDS NVIDIA