Extend Watson text classification
Use the Python NLTK toolkit and IBM DSX to achieve the desired text classification results
Watson Natural Language Understanding requires multiple documents for training in order to obtain good results. In new subject domains, there is limited time to create multiple training documents. In such a scenario, the approach suggested in this developer journey augments the results from Natural Language Understanding with a simple input configuration JSON file, which can be prepared by a domain expert. This approach gives accurate results without the need for training documents.
In this pattern, we show you how to use Watson Natural Language Understanding (NLU) service and IBM Watson Studio to augment the text classification results when there is no historical data available. A configuration JSON document prepared by a domain expert is taken as input by IBM Watson Studio. The configuration JSON document can be modified to obtain better results and insights into the text content.
When you have completed this pattern, you will understand how to:
- Create and run a Jupyter Notebook in Watson Studio.
- Use Watson Studio Object Storage to access data and configuration files.
- Use the NLU API to extract metadata from a document in Jupyter Notebooks.
- Extract and format unstructured data using simplified Python functions.
- Use a configuration file to build configurable and layered classification grammar.
- Use the combination of grammatical classification and regex patterns from a configuration file to classify word token classes.
- Store the processed output JSON in Watson Studio Object Storage.
- Documents that require analysis are stored in IBM Cloud Object Storage.
- The Python code retrieves the document content from Object Storage along with the configuration JSON.
- The document contents are sent to Watson NLU and a response is obtained.
- The Python Natural Language Toolkit (NLTK) module is used to parse the document and generate keywords, POS tags, and chunks, based on tag patterns.
- The configuration JSON is read, and the domain-specific keywords and attributes are classified.
- The response from NLU is augmented with the results from Python code.
- The final document classification and attribution is stored in Object Storage for further consumption and processing.
Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.