by Vishal Chahal and Balaji Kadambi
Text classification is used to derive insights from unstructured text. All Natural Language Processing (NLP) toolkits classify text based on the standard corpus of data that has been used to train the toolkit. In order to classify unstructured text in new domains, the toolkit has to be trained with new data to achieve the desired text classification. It is a common problem to find insufficient or no documents in new domains to train the toolkit.
The new “Extend Watson text classification” developer journey demonstrates how IBM Watson™ NLU text classification output can be augmented to achieve the desired text classification results using domain-specific configuration files. The IBM Data Science Experience environment with the Python NLTK toolkit has been used to process the configuration file and augment the response from Watson NLU.
As part of the journey code, the classification rules based on keyword tagging and regex tagging is available in the configuration file. The configuration file is extensible, enabling us to add more domain-specific classification rules with generic processing logic. Check out the “Extend Watson text classification” journey, which includes demos, code, and more.