Moving from data to insights to production is complicated, or it used to be. There is a new platform for hosting your trained machine learning models out in the cloud as a service, one that IBM relies on for its own Watson services: Watson Machine Learning (WML). Watson ML reduces the time and expertise needed to share insights from your models as API endpoints while providing metrics for measurement.
The new code pattern Create and deploy a scoring model to predict heartrate failure, demonstrates how to build an end-to-end application that utilizes a predictive machine learning model deployed into production. As you work through this example, you’ll wear a few different hats to accomplish this goal. First, as a data engineer using Python, Jupyter Notebooks, and Cloud Object Storage, you’ll learn how to import, explore, and clean data. Next, as a data scientist using Apache Spark, you’ll learn how to select features and build a pipeline for training, and evaluate a machine learning model. You’ll also work as a developer, deploying and consuming this predictive model as an API endpoint.
This code pattern demystifies the steps and tools necessary to move insights into production. Let’s take a quick look at the different roles:
- Data engineer
- Data scientist
The data engineer
You’re tasked with understanding, acquiring, and preparing data. During exploration, you discover first insights, uncover subsets in data and form hypotheses for hidden information. You’re transforming disparate data sources into a single format that can be used by modeling tools while selecting attributes, removing invalid data, and ensuring consistency across sources.
Importing and exploring data
The Data Science Experience, IBM’s Cloud IDE for data science, simplifies many of the tasks faced by wranglers of data today. Built on top of open source tools such as Jupyter, there is a familiar platform with some nice extras like connectors for a variety of data sources. Within a notebook, it only takes a single click to import data from Cloud Object Storage into an Apache Spark DataFrame. The code is written for you, the connection created and a variable set for use throughout the notebook.
The data scientist
With a dataset to work with, the data scientist works on modeling and evaluating the model against a problem domain. Here data is further prepared and normalized for input into a machine learning algorithm for training. This additional preparation requires extensive understanding of machine learning algorithms, and tools, to ensure the input provides the best possible outcome. Extracting insights from data by creating a model is only one piece, the model needs to be continuously evaluated for accuracy as new data emerges, and validated against assumptions.
Building a ML pipeline
Pipelines have recently accelerated the discipline and adoption of machine learning, by bundling the task of transforming raw input into a format understandable by a trained model. A pipeline is just what it sounds like; a pipeline of tasks or transformations that occur to data before being piped into a trained model. A pipeline can take a set of features that are categorized and labeled with strings and convert them to integers. It can vectorize several features into a single feature, and perform an integer to string transformation. Because custom transformers can be created by developers to perform any number of actions the possibilities are limitless. Data as it flows through a pipeline undergoes several transformations, arriving at a format that can be passed to a machine learning model; another step within the pipeline.
Before pipelines it was extremely difficult to quickly deploy models for consumption, each model required the app or the caller to perform all the transformations on the input data before calling the model. When you train a model with a pipeline, the transformations are baked into the model. There’s no need to perform any data transformations prior or after calling the model.
After a model is selected and it’s determined that it achieves the defined business objectives, it’s usually thrown over the wall to a development team. Developers are asked to build the infrastructure to support the model, and any applications an end user would use to gain insights from the model. As you can imagine, this is a very diverse technology stack and requires a diverse domain of expertise.
Persisting and deploying the model as a service
The Watson Machine Learning service simplifies many of the infrastructure demands. It provides a hosting environment for trained models, while exposing the API endpoints so you can focus more on the user experience. Watson ML is accessible through API calls, and a web-based dashboard, for saving, deploying, and monitoring models.
Put on your data engineering, data scientist, and developer hats, and accelerate your path to production and insights with the Create and deploy a scoring model to predict heartrate failure code pattern. Download the data, deploy the app, and revel in the simplicity of Watson ML and IBM’s Data Science Platform.