Identify information in document images

Summary

If you’re interested in high-performing image classification methodology, this code pattern is for you. We will extract text using optical character recognition, use the IBM Watson™ Natural Language Understanding API to extract entities from documents using Jupyter Notebooks, and use a configuration file to build configurable and layered classification grammar.

Description

Let’s recap the use case described in “Image classification using convolutional neural networks.” We considered usecases where application forms are submitted with supporting documents. In terms of rental agreements and purchase agreements, application forms for these agreements require supporting documents, such as ID, passport, etc. These documents, along with completed application forms, are digitally scanned and stored. To process the applications further, these documents should be recognized and classified, and the relevant information retrieved from application forms. Processing this manually is cumbersome and error-prone. This code pattern provides a methodology for extraction and identification of information, from scanned images, by processing systems.

This code pattern covers the following aspects:

  • Classifying images to separate out the application form documents
  • Extracting text from application form documents
  • Identifying entities (information) from application form documents and determining what the application form is for using configuration files

After completing this pattern, you will have learned how to:

  • Extract text using OCR
  • Extract entities from documents using the IBM Watson Natural Language Understanding API and Jupyter Notebooks
  • Use a configuration file to build configurable and layered classification grammar
  • Use the combination of grammatical classification and regex patterns from a configuration file to extract information

We will use Python, Jupyter Notebooks, the Python NLTK, the Watson Natural Language Understanding API, and IBM Cloud Object Storage.

Flow

flow

  1. Code pattern identifies application form document image.
  2. Text from image is extracted by running Python code in Jupyter Notebooks in Watson Studio.
  3. Extracted text is stored in IBM Cloud Object Storage.
  4. Python code running in Jupyter Notebooks pulls text from IBM Cloud Object Storage.
  5. Entities are extracted from text using the Watson Natural Language Understanding service.

Instructions

Please see the README for detailed instructions.