by Vinodh Mohan, Rich Hagarty | Published November 2, 2018
AnalyticsArtificial intelligenceData ScienceMachine LearningPython
The focus of this code pattern is to provide an easy-to-follow set of examples that detail how a user might integrate the Hortonworks Data Platform (HDP) with Watson Studio Local.
In this blog post we will:
A Spam filter is a type of classification model that can determine if any given SMS text message is spam, or ham (a legitimate message). In our code pattern, we attempt to build such a filter.
We start by examining and processing real-life trained data. In our case, we used a publicly available dataset from kaggle.com. The dataset contains over 5K messages, each tagged appropriately as spam or ham. Using natural language processing and machine learning algorithms, we take you through the process of building and training our Spam Filter classification model.
As mentioned in the title, this code pattern made use of Watson Studio Local and HDP. We’ll briefly describe both in this section before diving into how we used the two together.
IBM Watson Studio Local is an out-of-the-box on-premises solution for data scientists and data engineers. It addresses the entire Data Science life cycle and provides an environment where data scientists can work with a variety of tools such as Spark, R, Python, and Anaconda – all integrated to work together in a productive collaborative experience. Either due to GDPR or other data privacy-related issues, Watson Studio Local is perfect for users wanting to perform complex data science related work in the security of their private network.
Aside from running notebooks, Watson Studio also provides projects for multi-tenancy and collaboration, identity hooks for LDAP, an admin console for management, a community tab for finding sample content, integration with GitHub and GitHub Enterprise, oh, and it’s deployable to IBM’s popular IBM Cloud Private.
Hortonworks Data Platform (HDP) is a widely popular massively scalable platform for storing, processing and analyzing large volumes of data. HDP is used in a variety of industries from medical, to insurance to financial, to see various HDP solutions on their website. HDP consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper, and Ambari. Check out the image below to see what Apache Hadoop projects go into any given HDP release. More detail can be found for each of these projects by going to Apache Hadoop’s documentation.
For our code pattern, we focused on three components: Apache Spark, HDFS, and Livy.
Apache Spark is where the SMS text message data is first loaded into and then it’s machine learning library (MLlib) is used to train a classification model.
HDFS is used to store project data sets and is processed by Spark in a distributed fashion to do normal and ML transformations.
Apache Livy is a key feature of the Hadoop Integration service that enables easy interaction with a Spark cluster over a REST interface. The Hadoop Integration service is a component of Watson Studio Local and is installed on the edge node of the HDP cluster.
In our code pattern, we provide three different examples of how to train and deploy a Spam Filter Model. In each example, you will learn first how to develop the model locally in Watson Studio Local, and then remotely by leveraging the HDP cluster via the Hadoop Integration service.
The HDP remote integration provides two major advantages:
Try the code pattern out by going directly to our GitHub repo. The code pattern will walk the user through configuring HDP, Python library setup, running the notebook, and lastly interpreting the results.
Want to see the notebook results directly? Use NBViewer to view one of our code pattern notebooks, for example, this one that envokes Spark on our remote HDP cluster.
Keep an eye on IBM Code for more Watson Studio related patterns!
January 30, 2019
January 22, 2019
January 16, 2019
Back to top