Acknowledgments: This blog was written by Fehmina Merchant, Sivakumar Anne, and Aleksandr Petrov. Special thanks to John J. Thomas and Allen Kliethermes for their valuable inputs.
(Note: Data Science Experience (DSX) is now Watson Studio. Although the name has changed and some images may show the previous name, the steps and processes in this post will still work.)
Machine learning is all about building algorithms and models that can learn from data to be able to make accurate predictions. By incorporating machine learning technology, organizations can create intelligent applications that help avoid risks, identify opportunities, and make more insightful, data-driven decisions.
Machine learning is being leveraged today in a variety of use cases across various industries and it is opening up even greater opportunities for organizations as they amass more and more data. But one of the challenges companies are facing is the shortage of talent – we simply don’t have enough skilled people around with experience in working with complex machine learning algorithms and models. IBM wants to democratize and simplify machine learning and make it more accessible to a wider audience. With the new Watson Machine Learning GUI (coming soon) within IBM Data Science Experience, individuals at all skill levels will be able to leverage machine learning technology.
In this blog post, we want to give you a “glimpse” into how easy it is to work with the upcoming Watson Machine Learning GUI within IBM Data Science Experience. With this new wizard-based graphical user interface, we can build machine learning pipelines in no time without having to know complex machine learning algorithms and without writing a single line of code! Let’s see how it works in the context of a retail use case.
Machine learning for retail use case
We’ll use a widely adopted IBM dataset built on a fictitious retail company called “The Great Outdoors Company” as an example. The Great Outdoors Company sells camping and sports equipment through retail stores around the world. Over the last few months, the company has collected transactional sales data which captures the buying patterns of their customers. We’ll apply machine learning to this Great Outdoors Company sales dataset (GoSales_Tx_LogisticRegression.csv) to be able to predict if a customer is likely to buy a “tent” or not.
The sales dataset that we’ll be using for this tutorial consists of over 60,000 rows of observations about customers. Here’s how a small portion of the dataset looks like:
As you can see, each row in the dataset has five columns (we’re using a simplified version of the original dataset for this example). The first column, IS_TENT, has the value TRUE or FALSE which tells us if a customer has bought a tent or not. The rest of the columns correspond to the various attributes of a customer such as their GENDER, AGE, MARITAL_STATUS, and PROFESSION. Since we’re working with labeled data (meaning we know exactly what each column means), we’ll use supervised learning for this scenario.
In supervised machine learning, when you’re working with a labeled dataset, you typically need to specify a set of features and a target label. This enables the machine learning model to learn to use the features to predict the target label. In our example, we’ll use GENDER, AGE, MARITAL_STATUS, PROFESSION as our features and IS_TENT as the target label to predict.
Prerequisites to use Watson Machine Learning GUI
Before we can use the new Watson Machine Learning GUI to build our machine learning pipeline for the dataset, we have to take care of a few prerequisites within IBM Data Science Experience (DSX):
- If you don’t have access to Watson Machine Learning within DSX, sign up here: http://datascience.ibm.com/features#machinelearning
- Create a new project in DSX. Make sure you have at least one Apache Spark service instance and an object store container available for your project.
- Add the sales data set to the data assets for your project. Click on browse, select the data file (in our case, we’ll select our GoSales_Tx_LogisticRegression.csv dataset), click Open, and then click Apply to load the file.
Watson machine learning within DSX
With all the prerequisites in place, we’re now ready to give the Watson Machine Learning GUI a spin.
In machine learning, it is common to run a workflow consisting of a sequence of steps to process and learn from datasets. Watson Machine Learning represents such a workflow as a pipeline consisting of a series of steps to be run in a specific order. Guidance and automation are provided at each step of the pipeline.
Let’s start by creating a Watson Machine Learning pipeline:
- From the project overview page, click on the + icon, and choose Create pipeline from the drop-down options. Type a name and description for the pipeline and click Create.
The Watson Machine Learning wizard-based GUI can now guide us through the end-to-end process of selecting and preparing the data, training and evaluating the model, and, finally, deploying the model and be able to make predictions.
Here’s a screenshot of the machine learning pipeline that was created for us. We’ll now walk through the steps in the pipeline in the next three sections.
Select data and prepare
We’ll start with the step to select our dataset for our machine learning pipeline. In our case, we’ll select the retail dataset that we have uploaded for our project as follows:
Select the GoSales_Tx_LogisticRegression.csv data asset and click Next.
Next, we’ll prepare our dataset by using the built-in transformation functions provided. Watson Machine Learning offers over 18 transformation functions out-of-the-box that can help with scaling, converting and modifying datasets. We’ll use some of these built-in transformations to quickly convert our features and label into a form that can be understood by the machine learning model downstream as follows:
- On the Prepare Data Set page, click Add a transformer. Name it TransformGender, select StringIndexer and click Configure. For the input column, select GENDER, and for the output column type
GENDER_CODEand click Save.
- Click Add a transformer for a second transformer. Name it TransformMStatus, select StringIndexer and click Configure. For the input column, select MARITAL_STATUS, and for the output column, type
MARITAL_STATUS_CODEand click Save.
- Click Add a transformer to add a third transformer. Name it TransformProfession, select StringIndexer and click Configure. For the input column, select PROFESSION, and for the output column, type
PROFESSION_CODEand click Save.
- Click Add a transformer to add a fourth transformer. Name it TransformIsTent, select StringIndexer and click Configure. For the input column, select IS_TENT, and for the output column, type
labeland click Save.
- Click Add a transformer one more time to create a feature vector. Name it FeatureVector, select VectorAssembler and click Configure. For the input columns list all the feature output fields by typing the following:
[GENDER_CODE,AGE,MARITAL_STATUS_CODE,PROFESSION_CODE], and for the output column, type
featuresand click Save.
At this point, our pipeline will look like the screenshot below with all our transformations configured.
Let’s click Next to advance to the training step in the pipeline.
Train, select model and evaluate
We are now ready to train a machine learning model on our features and label (which we prepared in the earlier step). In our use case, since our objective is to be able to classify customers into two groups – either the group that is likely to buy a tent or the group that is not likely to buy a tent – we will use the binary classification algorithm like Logistic Regression as the estimator (algorithm) to apply to our data set. Here are the steps to perform:
- On the Train Model page, for the label column, select label.
- Add an estimator and select Logistic_regression, and click Configure.
- In the Model name box, type
- For the features column, select features.
- For the prediction column, type
- For the raw prediction column, type
- For the probability column, type
- Click Save.
- In the Model name box, type
Here’s a screenshot of how things will look at this point:
We can start the model training process by clicking Next. Watson Machine Learning will automatically use a portion of our dataset to train the model and reserve the rest for evaluation. When training completes, click Next to select the model. On the Select model page, click the model we just created, and then click Next. On the Evaluate model page, click Evaluate.
Deploy the machine learning model and predict
We’re now ready to deploy our machine learning model and start performing real-time predictions. Here are the steps to perform:
- Click Next to go to the Deploy model page.
- On the Deploy model page, select the Realtime deployment type, type a deployment name, and in the Average requests box, type
15and in the Peak requests box, type
- Click Deploy. When model deployment is complete, note the scoring end point for future reference. We can only have one deployed model per pipeline.
- Test the model prediction on the Predict page.
The Predict page should look like the following screen:
In this case, what we’re able to predict is that a single professional male of age 27 has a high probability of buying a tent. To test other scenarios, we can type different values in the input fields for a customer and click Predict.
Although this was a simple use case, it showcases how we can easily use Watson Machine Learning GUI to apply machine learning our dataset to predict customers’ buying preferences. These insights can be leveraged by a company like Great Outdoors to increase their revenue by targeting those likely buyers of tents with promotions and deals. And this is just the beginning!
We hope that this post gives you a “glimpse” into what capabilities we will be introducing with Watson Machine Learning GUI within DSX in the near future. With the new Watson Machine Learning GUI (coming soon), data scientists and developers of all skill levels will be able to quickly leverage machine learning to gain valuable insights from their data, even if they don’t want to write a single line of code!