Develop, train, and deploy a spam filter model on Hortonworks Data Platform using Watson Studio Local
Use natural language processing and machine learning to identify spam
Watson Studio Local is now part of IBM Cloud Pak for Data. Learn more Cloud Pak for Data.
The focus of this code pattern is to provide an easy-to-follow set of examples that detail how a user might integrate the Hortonworks Data Platform (HDP) with Watson Studio Local.
In this blog post we will:
- Describe what the new code pattern does.
- Provide a brief overview of HDP and Watson Studio Local.
- Explain how a user can train and deploy a model leveraging the compute power and data storage in HDP using Watson Studio Local.
What’s a Spam filter?
A Spam filter is a type of classification model that can determine if any given SMS text message is spam, or ham (a legitimate message). In our code pattern, we attempt to build such a filter.
We start by examining and processing real-life trained data. In our case, we used a publicly available dataset from kaggle.com. The dataset contains over 5K messages, each tagged appropriately as spam or ham. Using natural language processing and machine learning algorithms, we take you through the process of building and training our Spam Filter classification model.
A brief intro to Watson Studio Local and HDP
As mentioned in the title, this code pattern made use of Watson Studio Local and HDP. We’ll briefly describe both in this section before diving into how we used the two together.
What is Watson Studio Local?
IBM Watson Studio Local is an out-of-the-box on-premises solution for data scientists and data engineers. It addresses the entire Data Science life cycle and provides an environment where data scientists can work with a variety of tools such as Spark, R, Python, and Anaconda – all integrated to work together in a productive collaborative experience. Either due to GDPR or other data privacy-related issues, Watson Studio Local is perfect for users wanting to perform complex data science related work in the security of their private network.
Aside from running notebooks, Watson Studio also provides projects for multi-tenancy and collaboration, identity hooks for LDAP, an admin console for management, a community tab for finding sample content, integration with GitHub and GitHub Enterprise, oh, and it’s deployable to IBM’s popular IBM Cloud Private.
What is the HortonWorks Data Platform?
Hortonworks Data Platform (HDP) is a widely popular massively scalable platform for storing, processing and analyzing large volumes of data. HDP is used in a variety of industries from medical, to insurance to financial, to see various HDP solutions on their website. HDP consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper, and Ambari. Check out the image below to see what Apache Hadoop projects go into any given HDP release. More detail can be found for each of these projects by going to Apache Hadoop’s documentation.
For our code pattern, we focused on three components: Apache Spark, HDFS, and Livy.
Apache Spark is where the SMS text message data is first loaded into and then it’s machine learning library (MLlib) is used to train a classification model.
HDFS is used to store project data sets and is processed by Spark in a distributed fashion to do normal and ML transformations.
Apache Livy is a key feature of the Hadoop Integration service that enables easy interaction with a Spark cluster over a REST interface. The Hadoop Integration service is a component of Watson Studio Local and is installed on the edge node of the HDP cluster.
How we integrate the Watson Studio Local and HDP platforms
In our code pattern, we provide three different examples of how to train and deploy a Spam Filter Model. In each example, you will learn first how to develop the model locally in Watson Studio Local, and then remotely by leveraging the HDP cluster via the Hadoop Integration service.
The HDP remote integration provides two major advantages:
- There is no limitation on the compute and storage space needed for building the model, as you can leverage all of the resources in the HDP cluster.
- You aren’t required to move or copy the data from the HDP cluster to Watson Studio Local so you can train the model where the data lives.
How can I get started?
Try the code pattern out by going directly to our GitHub repo. The code pattern will walk the user through configuring HDP, Python library setup, running the notebook, and lastly interpreting the results.
Keep an eye on IBM Code for more Watson Studio related patterns!