Overview

Skill Level: Intermediate

In this tutorial, we will walk thru all steps required to use IBM Watson Speech to Text Acoustic Model customization service to improve accuracy for Speech to Text Service

Ingredients

 

  1. Download all required files here. You will see two files. A tar.gz file which we will use for our training data and a .wav file which we will use to test our models. Download both these files.
  2. Sign up or login to your IBM Cloud account 
  3. Provision the IBM Watson Speech to Text Service by following instructions here
  4. Once the service is provisioned, go to the ‚ÄúService Credentials‚ÄĚ tab to get your username and password for the service. Save these credentials locally. You will need to use them below.

Step-by-step

  1. Run test audio files thru the standard IBM Watson Speech to Text service

    For this tutorial, we will use cURL to decode the audio files using our Watson Speech to Text service. There are various other ways to call the service in real-time or post batch proecssing. More information can be found in our documentation here.

    Step 1: Open Terminal or cygwin tool

    cURL is typically available for Linux-based systems. If you use a Windows machine, you can download ‚Äėcygwin‚Äô which allows you to open a Linux-like command window in your Windows machine.

    For Mac, open up your “Terminal” application. For Windows, open up the ‘cygwin’ application. CD into the home directory where you have downloaded the two folders described in the Pre-requisite.

    Step 2: Save your IBM Watson Speech to Text Service credentials

    For ease of use, create a variable to save your username and password that you recorded earlier.

    /*export CREDS="<your username:password>"*/

     

    Step 3: Use the API to decode the audio files

    The cURL command to decode a single audio file (of type WAV) using the US English Narrowband model is as shown below. Note here:

    1) The Filename should be the file you are testing

    2) The Model you will refer to should algin with the audio file you are transcribing. Here we are using the en-US_NarrowbandModel. If your audio file is 16 khz or higher, we recommend you use the en-US_BroadbandModel. 

    Request

    /*curl --u $CREDS -X POST -H "Content-Type: audio/wav" --header 
    "Transfer-Encoding: chunked" --data-binary @129_162.wav
    "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=en-US_NarrowbandModel"*/

     

    Response 

    /*
    {

       "results": [

          {

             "alternatives": [

                {

                   "confidence": 0.825,

                   "transcript": "Jenny had always wanted a pair of her own she wanted him to take care of someone to be
    there when she came back home and most of all and animals to be best friends with Jenny would constantly ask her dad to
    buy her a pet but I heard that it always you fuse Jenny then you having a pet was a big responsibility he did not think
    Jimmy was ready to own a pet at ten years old and that was exactly what he told her when she asked for one one would be
    ready Jenny asks when they weigh into your twelve Jenny that as a dad replied this was good news for Jenny all she had
    to do now was to be patient "

                }

             ],

             "final": true

          }

       ],

       "result_index": 0
    */

     

    Ok excellent.¬†You’ve just decoded your first audio file using the IBM Watson Speech to Text Service!!¬†

  2. Gather Audio data to train your custom acoustic model

    One of the most important steps in designing an highly effective custom acoustic model is gathering relevant data.

    The key is to use as much representative audio you can get hands on to create custom acoustic model. The more aligned your training audio is to your use case, the more improvements in accuracy you will see. Although you can create a custom acoustic model with as little as 10 minutes of audio we highly recommend that you use at least 10 to 20 hours of audio to create the custom acoustic model. 

    For example:

    • If your use case is transcribing audio for call center analytics, try to get recorded audio you have where your customers are speaking to agents, preferably from¬†that call center you are analyzing.
    • If your use case is real-time transcription of audio for close captioning media files, try to get previously recorded media file of the same genera and use that to build out your custom acoustic model.¬†

     

    Note: If you have transcription already for the training audio, we recommend you first build out the language model as described here and then build out the acoustic model as described in this tutorial. As you are building the custom acoustic model, we will show you below on how you can tie the two together. For further information and with more sample code, see the detail documentation here. It is also highly recommended that you try to get the audio transcribe to see the full potential for custmization. 

     

  3. Create custom Acoustic Model

    Now that you have gathered your audio data, next step is to create an empty custom Acoustic Model to start building it out.

    To create a custom acoustic model, use the below cURL command. Provide 1) the name of the model 2) description of the model 3) the base model you will be building against

    /*curl -X POST -u $CREDS --header "Content-Type: application/json" 
    --data "{\"name\": \"Acoustic model Test\", \"base_model_name\": \"en-US_NarrowbandModel\",
    \"description\": \"Acoustic model Test Description\"}"
    "https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations"*/

     

    In the response you will get an empty custom acoustic model created and a GUUID for this model. Save this GUUID as a variable 

    /*export AM_CUSTID="ID that was just returned"*/

     

  4. Build your custom Acoustic Model

    Now that you have an empty custom Acoustic Model, we will go thru all steps required to train this model.

     

    Step 1: Add training data to your custom acoustic model. For this tutorial we will use the training data set that is here. Note that this zip file contains only about 10 minutes of audio. We are using it now for training purposes but in real use cases, you want to have at least 10 hours or more of audio to train the initial model with to see a good accuracy improvement.

    /*curl -X POST -u $CREDS --header "Content-Type: application/gzip" 
    --header "Contained-Content-Type: audio/wav;rate=8000"
    --data-binary @/latin-train-10min.tar.gz
    "https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations/$AM_CUSTID/audio/audio1"*/

     

    Step 2: Monitor your request. 

    /*curl -X GET -u $CREDS 
    https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations/$AM_CUSTID/audio/audio1*/

     

    Step 3: Now that your data is added to the model, you can start training it.

    If you have transcription, it is strongly recommended that before you start the TRAIN process, you first create a custom Language Model and then combine it with this custom acoustic model. Attaching the acoustic model with a language model is optional but highly recommended.

    Using a custom language model to train a custom acoustic model is effective only if the custom language model was built with direct transcriptions of the audio data or contains words from the same domain as the audio. If the audio contains many OOV words, it is wise to use a custom language model during training, even if the custom language model merely adds a list of custom words. For more information, see Creating a custom language model.

    Use the optional custom_language_model_id query parameter of the POST /v1/acoustic_customizations/{customization_id}/train method to train your custom acoustic model with a custom language model, as in the below example. Pass the GUID of the acoustic model with the customization_id parameter and the GUID of the custom language model with the custom_language_model_id parameter. Both models must be owned by the service credentials passed with the request.

    To train the acoustic model without any language model, use this command:

    /*curl -X POST -u $CREDS 
    "https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations/$AM_CUSTID/train"*/

    To train the acoustic model with language model, use this command:

    /*curl -X POST -u $CREDS
    "https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations/$AM_CUSTID/
    train?custom_language_model_id=<your_prebuilt_language_model_id>"
    */

     

    Step 4:  Check if the training is complete and the model is available using this command. Training times vary. In general it will take about 2 times the amount of audio for training. For example if you are training with 10 hours of audio, the training time could take up to 20 hours.

    /*curl -X GET -u $CREDS 
    https://stream.watsonplatform.net/speech-to-text/api/v1/acoustic_customizations/$AM_CUSTID*/

     

    When you see as status: available as seen below your model is ready to be used. 

    /*
    {  
    "owner": "<your GUUID>",  
    "base_model_name": "en-US_NarrowbandModel",  
    "customization_id": "<your customization ID>",  
    "versions": ["en-US_NarrowbandModel.v2017-11-15"],  
    "created": "2018-02-22T19:46:37.997Z",  
    "name": "<custom model name>",   
    "description": "<custom model description>",  
    "progress": 100,  
    "language": "en-US",   
    "status": "available"
    } */

     

  5. Re-run the audio using the custom Acoustic Model

    Now that we have built the custom models, lets re-run the test audio using the custom models and see the results. 

    /*curl -u $CREDS -X POST -H "Content-Type: audio/wav" --header "Transfer-Encoding: chunked" 
    --data-binary @129_162.wav
    "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?&model=en-US_NarrowbandModel
    &acoustic_customization_id="$AM_CUSTID"
    */

    Response

    /*
    {

       "results": [

          {

             "alternatives": [

                {

                   "confidence": 0.922,

                   "transcript": "Jenny had always wanted a pair of her own she wanted him to take care of someone to be
    there when she came back home and most of all and animals to be best friends with Jenny would constantly ask her dad
    to buy her a pet but I heard that it always you fuse Jenny then you having a pet was a big responsibility he did not
    think Jenny was ready to own a pet at ten years old and that was exactly what he told her when she asked for one one
    would be ready Jenny asks when they weigh into your twelve Jenny that as a dad replied this was good news for Jenny
    all she had to do now was to be patient "

                }

             ],

             "final": true

          }

       ],

       "result_index": 0
    */
  6. Compare Results

    As you can see, in this hypothetical case with only 10 minutes of training audio, although we did not see any new text output, we did see our confidance level improve dramatically from 82.5% confidance to 92.2% condance. Imagine the improvements you will see when you create custom acoustic models with 10-20-50 hours of audio !!

     

     

     

     

  7. Conclusion

    Hope you enjoyed this how to guide. Let us know if you have any questions in the comments bleow. For further help here are some key pointers.

     

2 comments on"Improve your Speech to Text accuracy by using IBM Watson Speech to Text Acoustic Model Customization Service"

  1. tim.tuhh5 May 29, 2018

    Hi,

    is there anyone who would be so kind an explain to me the first step in more detail? Im pretty new to this but I still want to try.
    So I downloaded the files and I installed Cygwin64 Terminal. Then I took the code, replaced “your username” and “password” with my my data and pasted it into the command line of cygwin.

    /*export CREDS=””*/
    /*curl –u $CREDS -X POST -H “Content-Type: audio/wav” –header
    “Transfer-Encoding: chunked” –data-binary @129_162.wav
    “https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=en-US_NarrowbandModel”*/

    But I didnt get a response. I saved the audio file onto my desktop. I also tried to replace the “@” by the directory path to my desktop. Then got a different error. What did I do wrong?
    It would be great if someone was going to help me.

Join The Discussion