Extract relevant text content from an image and derive insights

Summary

In some business scenarios, information from specific sections or areas in a scanned document needs to be extracted for further processing. For example, real estate companies scan newspaper classifieds to extract the individual classifieds. In a bank, the supporting documents for a loan are scanned and uploaded.

It’s tedious to manually go through the document and extract the required sections, especially when there are thousands of such pages. What if you could programmatically extract information from different sections in a document while simultaneously gaining insights about those sections?

This code pattern shows you how to derive insights from scanned documents where the needed information is scattered in various sections or layouts.

Description

In this code pattern, you will follow a methodology to gain relevant insights from scanned documents. You will see how to pre-process scanned images to find relevant sections, extract the relevant text, feed the extracted text to Watson Language Translator service for language translation, and then use the Watson Natural Language Understanding service to find key insights from the text.

This code pattern shows you how to use various Appsody stacks to build required microservices and deploy them to a Red Hat OpenShift cluster on IBM Cloud. A master application deployed on Watson Studio is used to orchestrate between the microservices that help process and extract information from the scanned documents.

After completing this code pattern, you will understand how to:

  • Containerize OpenCV, Tesseract, and an IBM Cloud Object Storage client using an Appsody stack and deploy them on a Red Hat OpenShift cluster on IBM Cloud.
  • Pre-process images to separate them into different sections using OpenCV.
  • Use Tesseract to extract text from an image.
  • Use Watson Language Translation to translate the text from Hindi to English.
  • Use Watson Natural language Understanding to derive insights on the text.

Flow

Architecture flow

  1. The classifieds image is stored in IBM Cloud Object Storage, and the Jupyter notebook execution is triggered.
  2. The Object storage operations microservice is invoked.
  3. The classifieds image from the Object storage is retrieved.
  4. The Image pre-processor service is invoked. The different sections in the image are identified and extracted into separate images, with each image containing only one single classified.
  5. The individual classified image is sent to the text extractor service where the address text is extracted.
  6. The extracted address text is sent to Watson Language Translator where the content is translated to English.
  7. The translated text in English is sent to Watson Natural Language Understanding where the entities of interest is extracted to generate the required insights.

Included components

  • IBM Cloud account: IBM Cloud is a set of cloud computing services for business.
  • Jupyter Software: Project Jupyter is a nonprofit organization created to “develop open-source software, open-standards, and services for interactive computing across dozens of programming languages”.
  • Appsody CLI: Appsody enables you to quickly build and deploy cloud-native applications.
  • Red Hat OpenShift Container Platform: OpenShift offers a consistent hybrid cloud foundation for building and scaling containerized applications.
  • Cloud: Accessing computer and information technology resources through the Internet.
  • Containers: Virtual software objects that include all the elements that an app needs to run.
  • Python 3: Python is an interpreted, high-level, general-purpose programming language.

Next steps

Check out the rest of the content in this series: