Train a speech-to-text model


This code pattern explains how to create a custom Watson Speech to Text model for handling specialized domain data. To improve the accuracy of the service, the code pattern uses transfer learning by training the existing model with new data from the medical industry.


The Watson Speech to Text service is among the best in the industry. However, like other Cloud speech services, it was trained with general conversational speech for general use. Therefore, it might not perform well in specialized domains such as medicine, law, or sports. To improve the accuracy of the speech-to-text service, you can use transfer learning by training the existing AI model with new data from your domain.

In this code pattern, we use a medical speech data set to illustrate the process. The data is provided by ezDI and includes 16 hours of medical dictation in both audio and text files.

When you have completed this code pattern, you will understand how to:

  • Prepare audio data and transcription text for training a speech-to-text model
  • Work with the Watson Speech to Text service through API calls
  • Train a custom speech-to-text model with a data set
  • Enhance the model with continuous user feedback


Customize and train your own speech-to-text model flow diagram

  1. The user downloads the custom data set and prepares the audio and text data for training.
  2. The user sets up access to the Watson Speech to Text service by configuring the credentials.
  3. The user uses the provided application GUI or command line to run training with the batch of data.
  4. The user interactively tests the new custom speech model by speaking phrases to the computer microphone and verifying the text transcription returned from the model.
  5. If the text transcription is not correct, the user can make corrections and resubmit the updated data for training.
  6. Several users can work on the same custom model at the same time.


Find the detailed steps for this pattern in the readme file. The steps will show you how to:

  1. Clone the repo.
  2. Create IBM Cloud services.
  3. Configure the credentials.
  4. Download and prepare the data.
  5. Train the models.
  6. Transcribe your dictation.
  7. Correct the transcription.