Extend Watson text classification  

Use the Python NLTK toolkit and IBM DSX to achieve the desired text classification results

Last updated

Watson™ Natural Language Understanding requires multiple documents for training in order to obtain good results. In new subject domains, there is limited time to create multiple training documents. In such a scenario, the approach suggested in this developer journey augments the results from Natural Language Understanding with a simple input configuration JSON file, which can be prepared by a domain expert. This approach gives accurate results without the need for training documents.

By Vishal Chahal, Balaji Kadambi

Overview

In this journey, we show you how to use Watson Natural Language Understanding (NLU) service and IBM Data Science Experience (DSX) to augment the text classification results when there is no historical data available. A configuration JSON document prepared by a domain expert is taken as input by IBM DSX. The configuration JSON document can be modified to obtain better results and insights into the text content.

When you have completed this journey, you will understand how to:

  • Create and run a Jupyter Notebook in DSX.
  • Use DSX Object Storage to access data and configuration files.
  • Use the NLU API to extract metadata from a document in Jupyter Notebooks.
  • Extract and format unstructured data using simplified Python functions.
  • Use a configuration file to build configurable and layered classification grammar.
  • Use the combination of grammatical classification and regex patterns from a configuration file to classify word token classes.
  • Store the processed output JSON in DSX Object Storage.

Flow

  1. Documents that require analysis are stored in IBM Object Storage for Bluemix®.
  2. The Python code retrieves the document content from Object Storage along with the configuration JSON.
  3. The document contents are sent to Watson NLU and a response is obtained.
  4. The Python Natural Language Toolkit (NLTK) module is used to parse the document and generate keywords, POS tags, and chunks, based on tag patterns.
  5. The configuration JSON is read, and the domain-specific keywords and attributes are classified.
  6. The response from NLU is augmented with the results from Python code.
  7. The final document classification and attribution is stored in Object Storage for further consumption and processing.

Components

IBM Data Science Experience

Analyze data in a configured and collaborative environment.

Bluemix Object Storage

Build and deliver cost effective apps and services with high reliability and fast speed to market in an unstructured cloud data store.

Watson Natural Language Understanding

A service that analyzes text to extract metadata from content, such as concepts, entities, keywords, categories, sentiment, emotion, relations, and semantic roles using natural language understanding.

Jupyter Notebook

An open source web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text.

Technologies

Analytics

Finding patterns in data to derive information.

Data Science

Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.

Python

Python is a programming language that lets you work quickly and integrate systems more effectively.

Related Blogs

Related Links

NLTK

A leading platform for building Python programs to work with human language data.