As we start a new year in 2018, I feel very excited to see many innovative enterprises make agile forays into adopting Cognitive, AI, Machine Learning and Deep Learning techniques. Their agile adoption is enabled by open source tools such as Tensorflow, Caffe, Spark, Docker, Kubernetes etc. But I also see two technical inhibitors:
- Trusted and tested binaries for new open source tools and techniques.
- Ever growing dataset sizes and complex modeling techniques.
How fast can data scientists pick up new tools and techniques
Python and R are the two most common languages that are used by data scientists (see Kaggle survey). Popular open source frameworks such as Tensorflow, scikit-learn and xgboost are developed collaboratively with source code available from GitHub and pre-built binary packages distributed from myriad sources such as PyPI or Anaconda channels for Python modules or CRAN for R packages or Maven for Scala libraries. Agile development methodologies such as continuous integration used in collaborative development communities for such open source frameworks also lead to rapid iterative releases of new versions with high frequencies (e.g. a new release every quarter or so). The problem arises when several different open source components are put together to create applications used for training and deploying models. As newer component versions become available, often there are unforeseen problems in the overall application. This is a common cloud-native agile development problem. Data scientists need pre-built and continuously tested software stacks containing popular tools built from open source, so that they can focus on new mathematical modeling techniques, data wrangling and exploration. As shown in Figure 1, Data Science is indeed a team sport.
Where are the compute and data infrastructures
Enterprise data scientists are also constrained by a lack of compute resources in the right places. Deep Learning neural networks and Machine learning techniques require a lot of processing power and a lot of memory particularly during training on larger datasets. For example, it took us almost an hour to complete a k-means clustering unsupervised learning analysis on a 15 GB data set in a DSX Jupyter notebook container running on a Intel x86 Broadwell system (see Figure 3 below for more details). Newer types of hardware components such as GPUs can accelerate machine and deep learning techniques significantly, however it is expensive to rent a GPU compute instance in a public cloud. Enterprise security restrictions often imply that datasets cannot be moved into public clouds. In many cases moving large datasets (sized in GBs) over the public internet is not an easy matter (takes time and costs money). It is often a better idea to bring the compute (and the open source compute frameworks) to the data inside a private cloud.
IBM Data Science Experience (DSX) Local and PowerAI for the enterprise data scientist
IBM Data Science Experience (DSX) Local and PowerAI are enterprise software offerings from IBM for data scientists, built with open source components. We follow agile iterative development processes to continuously build and test with new component versions creating a stable out of the box experience for data scientists within the secure confines of enterprise private networks. IBM Data Science Experience is built on Docker and Kubernetes and is also available as part of IBM’s enterprise private cloud offering called IBM Cloud Private. This blog outlines some instructions for installing IBM Data Science Experience on IBM Cloud Private on POWER Systems.
DSX provides an interactive Jupyter and Zeppelin notebook and an RStudio server UI interface. IBM PowerAI provides deep learning frameworks such as Tensorflow and Caffe built for IBM POWER Systems with GPUs. DSX Local includes PowerAI and it is activated when installed on POWER systems with GPUs such as the Power Systems S822 LC HPC with GPU acceleration. Enterprise data scientists can easily pick up new open source tools and methods by using pre-built components in DSX (built to exploit GPUs) and notebooks created by other data scientists describing different approaches. There are also a myriad number of data connectors available in DSX to pull data from different enterprise data sources into the main memory of the compute instances for running with various in-memory compute analytics frameworks such as Spark and scikit-learn.
IBM POWER systems configurations for DSX Local with PowerAI
IBM POWER systems with its large number of hardware threads and high memory bandwidth and tightly integrated NVidia GPUs are very well suited for running machine learning or deep learning computations using open source frameworks on large data sets in enterprise private cloud environments. There are several system models (including IBM Power Systems S822LC BigData and IBM S822LC HPC with GPUs and SSDs) available to set up small and large multi-user clusters for running DSX. We ran a some performance tests with a few different system types and found that we could run Tensorflow k-means clustering on a 15GB dataset in a DSX cluster on a IBM S822LC for HPC in a little more than 20 minutes (something which took almost an hour in a comparable Intel x86 Broadwell machine) (see Figure 3 for more details).
We also found that 4 users can concurrently run the same k-means clustering algorithm on a 1GB dataset in the same system while still leaving 16% of the CPU un-utilized (see Figure 4 for more details). 4 additional users can also use a GPU each in the IBM S822LC for HPC on the same dataset to complete the analysis almost 12 times faster than using the CPUs in a similar Intel x86 Broadwell system (without any GPUs, work is in progress for DSX to support GPUs on Intel x86 systems).
Get started with DSX on POWER
It is possible to get started using IBM Data Science Experience in the IBM public cloud or by using the Desktop edition. It is not possible to exploit the full performance advantages of running on IBM POWER systems in a public cloud or in desktops, but it is possible to get an understanding of the productivity benefits of using DSX. DSX can be installed on IBM POWER systems using virtual machines, LPARs or bare metals running RHEL 7.2 or above in private clouds, in these two configurations (DSX Local uses locally attached storage in the virtual machines or LPARs):
- A 3 or a 9 LPAR/VM PoC configuration on any IBM POWER8 system (see Figure 5).
- A 3 bare metal or 9 bare metal configuration using IBM S822LC BD (for Big Data a.k.a Briggs) with one or more IBM S822LC HPC (for High Performance Computing a.k.a Minsky)
Conclusion and Acknowledgements
This blog outlines describes some of the productivity benefits enterprises can hope to get when using IBM POWER Systems with Data Science Experience and PowerAI. The performance results shown are preliminary, more work is under progress. The author would like to acknowledge the work involved in creating the performance results by Theresa Xu from IBM Systems Performance and some of the slides created by IBM offering managers including Keshav Ranganathan and Alex Jones. The author would also like to thank IBM’ers Igor Khapov, Yulia Gaponenko, Konstantin Maximov, Ilsiyar Gaynutdinov, Ekaterina Krivtsova, Alanny Lopez, Suchitra Venugopal, Shilpa Kaul, Champakala Shankarappa, Anita Nayak in porting IBM DSX on POWER and integrating with PowerAI.