Build a custom speech-to-text model with speaker diarization capabilities


In this code pattern, learn how to train a custom language and acoustic speech-to-text model to transcribe audio files to get speaker diarized output when given a corpus file and audio recordings of a meeting or classroom.


One feature of the IBM® Watson™ Speech to Text service is the capability to detect different speakers from the audio file, also known as speaker diarization. This code pattern shows this capability by training a custom language model with a corpus text file, which then trains the model with ‘Out of Vocabulary’ words as well as a custom acoustic model with the audio files, which train the model with ‘Accent’ detection in a Python Flask run time.

After completing the code pattern, you understand how to:

  • Train a custom language model with a corpus file
  • Train a custom acoustic model with audio files from the bucket
  • Transcribe the audio files from the bucket and get a speaker diarized textual output
  • Store the transcript in the bucket


Custom speech-to-text model diarization flow

  1. The user uploads a corpus file to the application.
  2. The extracted audio from the previous code pattern is retrieved from IBM Cloud Object Storage.
  3. The corpus file as well as the extracted audio are uploaded to the Watson Speech To Text service to train the custom model.
  4. The downloaded audio file from the previous code pattern is transcribed with the custom speech-to-text model, and the text file is stored in IBM Cloud Object Storage.


Get detailed instructions in the README file. Those steps explain how to:

  1. Clone the GitHub repository.
  2. Create the Watson Speech to Text service.
  3. Add the credentials to the application.
  4. Deploy the application.
  5. Run the application.

This code pattern is part of the Extracting insights from videos with IBM Watson use case series, which showcases the solution on extracting meaningful insights from videos using Watson Speech to Text, Watson Natural Language Processing, and Watson Tone Analyzer services.