A ‘fail fast’ solution for end-to-end machine learning
Learn about an end-to-end machine learning pipeline, which can be built using Pythonic frameworks, that allows you to fail fast at TeraScale data levels
Enterprise AI solutions are characterized by an end-to-end workflow that involves data sourcing, querying, ETL, feature engineering, and training the machine learning algorithms. Did you know there’s an end-to-end machine learning pipeline, which can be built using Pythonic frameworks, that allows you to fail fast at TeraScale data levels?
Big data versus AI conundrum
Training machine learning algorithms with large amounts of relevant data addresses overfitting and helps with building more robust models. Consequently, handling big data in a distributed compute environment is an integral part of engineering enterprise AI solutions. Spark has become synonymous with big data processing, so let’s take a look at the traditional Spark-based workflow for AI solutions, as shown in Figure 1.
Figure 1. A typical Spark-based workflow for training machine learning algorithms
Currently, a major drawback of this workflow is that the Spark ecosystem doesn’t support seamless GPU integration. At TeraScale data levels and beyond, GPUs offer huge benefits for accelerating the entire machine learning pipeline, including data processing and feature engineering. Though several efforts are underway to facilitate easy GPU usage within Spark (Apache Arrow, Project Hydrogen, and others), there’s not yet a ready-to-use Spark API that effectively hides the complexity of GPU programming (and the architectural details thereof) from the user. Furthermore, any intermediate CPU operations in the pipeline necessitate transforming data between CPU and GPU.
However, the recently evolved machine learning frameworks (IBM Snap ML, Tensorflow, PyTorch, and others) have been highly successful in transparent GPU execution of machine learning algorithms. This Pythonic ecosystem is also rapidly evolving around big data (cuDF and Dask-cuDF) to catch up with the machine learning frameworks in utilizing GPUs.
An alternative fail fast solution
Failing fast, with quick turnaround times for training workloads by adopting a flexible end-to-end machine learning pipeline, is critical for development of enterprise AI applications. Read about why fail fast methodology becomes relevant in AI application development.
Figure 2 shows an end-to-end pipeline designed by IBM Systems Lab Services consultants using Pythonic frameworks as an alternative solution to a Spark-based workflow. It relies on open source tools and the IBM cognitive product line including IBM Power Systems, IBM Spectrum Scale, and Snap ML to leverage GPUs throughout the machine learning pipeline.
Figure 2. An end-to-end machine learning pipeline built using open source Pythonic tools and IBM cognitive product line
Similar to Spark, the workflow shown in Figure 2 seamlessly integrates data processing, querying, feature engineering, and machine learning algorithms. Additionally, it allows developers to fail fast by providing the following benefits over a Spark-based workflow:
- Unified GPU integration through an easy-to-use API that hides GPU programming complexity
- Shortened time to development
- Easy setup and maintenance of the popular machine learning frameworks on IBM Power Systems through conda
- GPU optimized TeraScale machine learning training through IBM Snap ML
- High bandwidth CPU-GPU data transfers using NVLink
- Interactive mode using python notebooks
Contact IBM Lab Services today to get more information on how you can achieve your AI goals by designing the workflows appropriate for your enterprise.