Taxonomy Icon

Artificial Intelligence

Enrich multimedia files using services

Get the code View the demo

Summary

Multimedia files are increasingly essential for any type of web communication, whether marketing, instructional, or entertainment. This pattern shows you how to use IBM Watson® Node.js SDK to create a web UI app that includes speech-to-text conversion, tone analysis, natural language understanding, and visual recognition processing to enrich multimedia files.

Description

If you use the web – and who doesn’t? – you know that multimedia files are now essential for building an audience. Whether you’re retailing, marketing, instructing, or entertaining, a flat web page is no longer an option. You need audio and video.

Most developers know how to include multimedia content in their apps. But anyone with a lot of video content knows how difficult it is to quickly perform a granular search and pull data from those video files. What’s the essential information covered in the video? How do you find related videos? Can you quickly provide recommendations for other videos to a user? A developer who knows how to quickly search and derive information from video content will have an edge on the pack.

This pattern will help you do more with multimedia. It shows you how to use the IBM Watson Node.js SDK to create a web UI app that includes speech-to-text conversion, tone analysis, natural language understanding, and visual recognition processing to enrich multimedia files. By performing visual recognition every few seconds, you can find information in the video faster and make it readily available for any purpose.

You’ll create two Node.js apps. The first app processes multimedia files using IBM Watson Speech-to-Text, Tone Analyzer, Natural Language Understanding, and Visual Recognition services, plus a Cloudant NoSQL database. The multimedia processor will extract enriched data from the media files and store the result in the NoSQL DB. The second is a web UI app that displays the enriched data, enabling you to view the results in real time as the media file is played back in a series of time segments.

When you complete this pattern, you will understand how to:

  • Create Node.js apps that extract and display enriched data from multi-media files using Watson services
  • Use Watson Speech-to-Text to extract text from video files
  • Use Watson Tone Analyzer to detect emotion in a conversation
  • Identify entities with Watson Natural Language Understanding
  • Extract classifications, facial recognition, and words from video files using Watson Visual Recognition
  • Store enriched data in a Cloudant NoSQL DB

If you’re looking to work on dynamic multimedia content and separate yourself from the development pack, this pattern is for you.

Flow

flow

  1. A multimedia file is passed into the Media Processor enrichment process.
  2. The Watson Speech to Text service translates the audio to text. The text is broken up into scenes based on a timer, a change in speaker, or a significant pause in speech.
  3. The Watson Natural Language Understanding service pulls out keywords, entities, concepts, and taxonomy for each scene.
  4. The Watson Tone Analyzer service extracts top emotions and social and writing tones for each scene.
  5. The Watson Visual Recognition service takes a screen capture every 10 seconds and creats a “moment.” Classifications, faces and words are extracted from each screen shot.
  6. All scenes and moments are stored in the Watson Cloudant NoSQL database.
  7. The app UI displays stored scenes and moments.

Instructions

Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.