Extend Watson text classification  

Use the Python NLTK toolkit and IBM DSX to achieve the desired text classification results

Last updated | By Vishal Chahal, Balaji Kadambi


Watson™ Natural Language Understanding requires multiple documents for training in order to obtain good results. In new subject domains, there is limited time to create multiple training documents. In such a scenario, the approach suggested in this developer journey augments the results from Natural Language Understanding with a simple input configuration JSON file, which can be prepared by a domain expert. This approach gives accurate results without the need for training documents.


In this journey, we show you how to use Watson Natural Language Understanding (NLU) service and IBM Data Science Experience (DSX) to augment the text classification results when there is no historical data available. A configuration JSON document prepared by a domain expert is taken as input by IBM DSX. The configuration JSON document can be modified to obtain better results and insights into the text content.

When you have completed this journey, you will understand how to:

  • Create and run a Jupyter Notebook in DSX.
  • Use DSX Object Storage to access data and configuration files.
  • Use the NLU API to extract metadata from a document in Jupyter Notebooks.
  • Extract and format unstructured data using simplified Python functions.
  • Use a configuration file to build configurable and layered classification grammar.
  • Use the combination of grammatical classification and regex patterns from a configuration file to classify word token classes.
  • Store the processed output JSON in DSX Object Storage.


  1. Documents that require analysis are stored in IBM Cloud Object Storage.
  2. The Python code retrieves the document content from Object Storage along with the configuration JSON.
  3. The document contents are sent to Watson NLU and a response is obtained.
  4. The Python Natural Language Toolkit (NLTK) module is used to parse the document and generate keywords, POS tags, and chunks, based on tag patterns.
  5. The configuration JSON is read, and the domain-specific keywords and attributes are classified.
  6. The response from NLU is augmented with the results from Python code.
  7. The final document classification and attribution is stored in Object Storage for further consumption and processing.

Related Blogs

Related Links


A leading platform for building Python programs to work with human language data.