Benchmarking linear models of machine learning (ML) frameworks


Machine learning (ML) has been defined as the field of study that gives computers the capability to learn without being explicitly programmed [1]. Machine learning provides us with tools to gain insights from data. In his blog [2], Jason Brownlee gives an intuitive picture on classifying ML algorithms according to their learning style and similarity in form or function. For example, ML algorithms can be classified into supervised, unsupervised, and semi-supervised learnings according to their learning style. Figure 1 shows various machine learning models grouped by their similarity in form and function as defined in [2].

Figure 1. Grouping machine learning models by similarity in form or function

Figure 1

View image larger

Figure 1 is derived from the information provided in [2].

A wide array of frameworks and libraries have emerged to enable people to solve machine learning problems without needing to worry about the implementations of the algorithms. To mention a few of these frameworks: scikit-learn [15], H2O [16], RAPIDS [17], and IBM® Snap Machine Learning (Snap ML) [11]. Interested readers can refer to the articles in [12], [13], and [14] to learn more about the landscape of available machine learning and deep learning frameworks . These frameworks come in various flavours to address the needs of enterprises and common users. Some frameworks allow for scaling up and can take advantage of the acceleration capabilities of the underlying hardware. In this blog, we shall compare the following two ML frameworks with acceleration capabilities:

  • Snap ML [11] which is part of IBM WML CE 1.6.2
  • cuML [7] from NVIDIA

You can find more information about Snap ML in the Snap ML paper [10] and in the following blog [8] by Sumit Gupta. The Snap ML documentation is available at:

Figure 2. Kaggle MS and DS Survey 2019 [3]

Figure 2

View image larger

Figure 2 (image source – page 18 in Kaggle’s State of Data Science and Machine Learning 2019) shows the mostly used data science methods in 2019 from a recent Kaggle survey.

Given the popularity of the linear regression models, in this study, we evaluate the performance of the linear models that are common across Snap ML and cuML, namely Ridge, Lasso, and Logistic regression, on the Epsilon [4], Higgs [4], Price prediction [5] and, Taxi [6] data sets. We have attempted to pick a collection of data sets with a varying number of features and density / sparsity levels. The parameters of the linear models were set so that the training objective functions in both Snap ML and CuML are identical, thus resulting in similar accuracy. To compare the two ML frameworks, we use the training time as the performance metric.

The following table captures the characteristics of the data sets considered for this study.

Table 1. Data set characteristics
Data set Rows/Instances Columns/Features Dense/Sparse Density(%) Sparsity(%)
Epsilon 300000 2000 Dense 100 0
Higgs 8250000 28 Dense 92.1 7.9
Taxi 1600000 606 Sparse 3 97
Price prediction preprocessed 175000 5074 Sparse 0.3 99.7
Price prediction 1185328 17052 Sparse 0.1 99.9

Experimental setup

Table 2 lists the hardware and the software used in the experimental setup.

Table 2. System configuration
IBM Power® System AC922 Intel® Xeon® Gold 6150
40 cores (two 20c chips), IBM POWER9™ with NVLink 2.0 36 cores (two 18c chips)
3.8 GHz, 1 TB memory 2.70 GHz, 502 GB memory
One Tesla V100 GPU, 16 GB GPU One Tesla V100 GPU, 16 GB GPU
Red Hat Enterprise Linux (RHEL) 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243 Ubuntu 18.04.3 with CUDA 10.1.243
nvidia-driver – 418.67 nvidia-driver – 418.67
Software : IBM pai4sk: WML CE 1.6.2; pai4sk-1.5.0 (py36_1071.g5abf42e) scikit-learn 0.21.3, numPy 1.16.5 Software : cuML- 0.10.0. numPy 1.17.3, scikit-learn 0.21.3

Software setup and instructions

You can find the benchmarking script and the script to download and preprocess the data sets for both platforms at:


Figure 3. Speedup of Snap ML in comparison with cuML

Figure 3]

View image larger

Figure 3 shows the speedup of Snap ML in comparison with cuML for the Epsilon, Higgs, Taxi and Price Prediction preprocessed data sets. Speedup = cuML training time divided by Snap ML training time. In Figure 3, we can observe that Snap ML gives very good performance for very sparse data sets such as Price Prediction and Taxi for the Ridge, Lasso, and Logistic regression algorithms. Snap ML has inherent support for sparse data sets similar to scikit-learn. Snap ML also performs better than cuML for a dense data set such as Epsilon, which has a large number of features.

We also observe that cuML performs better in the case of Ridge and Logistic regression for the Higgs data set, where the number of features is small. You must also note that we had to preprocess the Price Prediction data set and obtain a data set with lesser number of features because cuML fails with an out of memory error for the full Price Prediction data set.

CuML was unable to handle cases where the data set does not fit in the GPU memory or the runtime artefacts exceeds GPU memory (temporary or additional memory consumed by the training algorithm). In Figure 4, we compare only Snap ML GPU (single-GPU-based training) with scikit-learn (CPU-based training) for the full Price Prediction data set with 17052 features and 1185328 instances.

Figure 4. Snap ML versus scikit-learn training speedup on the full price prediction data set

Figure 4

View image larger

We observe that Snap ML is up to 80 times faster than the scikit-learn framework. Snap ML provides native support for out-of-core training, which is needed when the data set is too large to fit in the system’s GPU memory. cuML, however, fails if the data set or runtime artifacts exceeds the GPU memory.


For data sets that do not fit in the GPU memory, Snap ML is a clear ML framework winner. For dense data sets with a small number of features, cuML is a better candidate. For sparse data sets however, snap ML is again winning against both cuML and scikit-learn. Although the data sets chosen for this study are not a comprehensive list covering the entire spectrum of possibilities, they definitely give some guidance about the ML framework that data scientists and ML practitioners should use to optimize their ML pipelines.


  • The results were obtained by taking the average of the best 5 among the 10 runs.
  • The results can vary with different data sets and model parameters.
  • For cuML, the data was passed in the ndarray format. For snap ML, the dense data sets were passed in the ndarray format and the sparse data sets in the CSR format.


[1] Machine Learning Definition

[2] Jason Brownlee’s blog

[3] 2019 Kaggle ML and DS Survey

[4] Epsilon and Higgs data sets

[5] Price Prediction data set

[6] Taxi data set source

[7] cuML

[8] Snap ML blog

[9] Snap ML documentation

[10] Snap ML paper

[11] Snap ML WML CE 1.6.2 Knowledge Centre

[12] Machine learning frameworks

[13] Deep learning frameworks

[14] List of machine learning and deep learning frameworks

[15] Scikit-learn

[16] H20