Discover archetypes in your data records

Summary

In this code pattern, learn how to use IBM® Watson™ services and Jupyter Notebooks to find meaningful archetypes in your records and classify new records against this set of archetypes.

Description

Systems of records are ubiquitous in the world around us, ranging from music playlists, job listings, medical records, customer service calls, and Github issues. Archetypes are formally defined as a pattern, or a model, of which all things of the same type are copied. More informally, you can think of archetypes as categories, classes, and topics.

When we read through a set of these records, our mind naturally groups the records into some collection of archetypes. For example, we might sort a song collection into easy listening, classical, or rock. This manual process is practical for a small number of records. However, large systems can have millions of records, so we need an automated way to process them. Additionally, without prior knowledge of these records, we might not know beforehand the archetypes that exist in the records, so we also need a way to discover meaningful archetypes that can be adopted. Because records are often in the form of unstructured text, such automated processing needs to be able to understand natural language. Watson Natural Language Understanding, coupled with statistical techniques, can help you to:

  • Discover meaningful archetypes in your records
  • Classify new records against this set of archetypes

In this code pattern, we use a medical dictation data set to show the process. The data is provided by ezDI and includes 249 actual medical dictations that have been anonymized.

When you have completed this code pattern, you understand how to:

  • Work with the Watson Natural Language Understanding service through API calls
  • Work with the IBM Cloud Object Store service through the SDK to hold data and results
  • Perform statistical analysis on the results from Watson Natural Language Understanding
  • Explore the archetypes through graphical interpretation of the data in a Jupyter Notebook or a web interface

Flow

flow

  1. The user downloads the custom medical dictation data set from ezDI and prepares the text data for processing.
  2. The user interacts with the Watson Natural Language Understanding service through the provided application user interface or the Jupyter Notebook.
  3. The user runs a series of statistical analysis on the result from Watson Natural Language Understanding.
  4. The user uses the graphical display to explore the archetypes that the analysis discovers.
  5. The user classifies a new dictation by providing it as input and sees which archetype it is mapped to.

Instructions

Find the detailed steps for this pattern in the README file. The steps show you how to:

  1. Clone the repository.
  2. Create IBM Cloud services.
  3. Download and prepare the data.
  4. Run the Jupyter Notebook.
  5. Run the web user interface.