Overview

Skill Level: Any Skill Level

In this tutorial, we will walk thru all steps required to use IBM Watson Speech to Text Language Model customization service to improve accuracy for Speech to Text Service

Ingredients

  1. Create an account on Bluemix here
  2. Provision the "Speech to Text" service and copy the credentials for your record (username:password)
  3. Make sure you can run cURL commands on your machine. For Mac users you can do so in your "Terminal" window; For Windows users, download https://www.cygwin.com/
  4. Download all required files here

Step-by-step

  1. Create Bluemix Account and provision Speech to Text service

    1. Create an account on Bluemix here
    2. Click on “Add Service” icon to Add Watson Speech to Text service
    3. Provision the “Speech to Text” service
    4. Go to the “Service Credentials” tab to get your username and password for the service

      Screen-Shot-2017-03-06-at-2.45.42-PM  Screen-Shot-2017-03-06-at-2.46.06-PM Screen-Shot-2017-03-06-at-2.46.30-PM

  2. Run test audio files thru the standard Speech to Text service and store the output

    For this Test, we will use the audio file located here РPlease download this audio file to your desktop. Note. You will need a login to access this folder.

    Now, we will run the audio file using our base IBM Watson Speech to Text service. 

    The easiest way to decode your audio files is by using cURL. This tool is typically available for Linux-based systems. If you use a Windows machine, you can download ‘cygwin’ which allows you to open a Linux-like command window in your Windows machine.

    The cURL command to decode a single audio file (of type WAV) using the US English Broadband model is as shown below

    Request

    /*
    curl --user username:password -X POST -H "Content-Type: audio/wav" --header
    "Transfer-Encoding: chunked" --data-binary @audio/audio_001.wav
    "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?
    continuous=true&model=en-US_BroadbandModel"
    */

    Response

     

    /*
    {
       "results": [
          {
             "alternatives": [
                {
                   "confidence": 0.887,
                   "transcript": "what is rough some disease "
                }
             ],
             "final": true
          }
       ],
       "result_index": 0
    }
    */

     

  3. Gather text data to create custom language model

    The key is to anticipate all possible things that the user could say, put these sentences into a file and send to the LM Customization Service, and the STT performance will greatly improve for your new task. But how do you create this corpus of sentences? Theoretically the best sentences are those spoken by real users using the app. But you may bootstrap the LM through the following sources of data.

    1. Use¬†Your company’s existing Data
      • Human-Human Dialog – where the use case is customers speaking to an agent and you are transcribing the conversations
      • Human-Computer Dialog – where the use case is customer is speaking to a digital agent (Watson). *Note The key difference is to understand that humans speak differently when they speak to another human vs a digital agent.
    2. Developer Designed:
      • Use your dialog / conversation workspace to design a custom Language Model. Use the dialog flow, intents, and entities, including various examples that you use to design anticipated conversations.
      • Here is a script that will help you get the Conversation workspace data to create custom Speech Language Model.¬†
    3. Text Data Collection:
      • You can also create all anticipated conversations and directly create a custom model using these. It is better to have multiple people (real users) help you create the anticipated question to help you minimize any bias.
  4. Create custom language model

    Step 1. Create an empty custom model

    /*
    curl -X POST -u username:password --header "Content-Type: application/json" --data
    "{\"name\": \"Example Custom model\", \"base_model_name\": \"en-US_BroadbandModel\",
    \"description\": \"Example custom language model\"}"
    "https://stream.watsonplatform.net/speech-to-text/api/v1/customizations"
    */

    A customization id (GUID) will be returned in the response.

    Step 2. setup environment variables

    /*
    export CREDS="your username:password"
    export custID="your customization_id"
    */

    Step 3. Add the corpus file to the custom model (see here for more details about adding corpus files)

    We will use the downloaded corpus file named ‘healthcare.txt’ and assign it to the corpus named ‘corpus1’

    /*
    Request:
    curl -u $CREDS -X POST --data-binary @healthcare.txt
    https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/$custID/corpora/corpus1

    Response: you will get a blank response here
    {}
    */

    Step 4. Check the status to makes sure corpus analysis is completed. Status should change from ‘being_processed’ to ‘analyzed’

    /*
    curl -u $CREDS -X GET https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/$custID/corpora
    {"corpora": [{
    "out_of_vocabulary_words": 0,
    "total_words": 0,
    "name": "corpus1",
    "status": "being_processed"
    }]}
    [wait a few seconds to try again and continue until the status changes, as shown below]curl -u $CREDS -X GET https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/$custID/corpora
    {"corpora": [{
    "out_of_vocabulary_words": 6,
    "total_words": 10617,
    "name": "corpus1",
    "status": "analyzed"
    }]}
    */

    Step 5. After the corpus is analyzed, check if the system found any OOV words.

    /*
    curl -u $CREDS -X GET
    https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/$custID/words?sort=count
    */

    Sample JSON Response

    /*
    {"words": [
       {
          "display_as": "PE",
          "sounds_like": [
             "P. E.",
             "PE"
          ],
          "count": 31,
          "source": ["corpus1"],
          "word": "PE"
       },
       ...
       {
          "display_as": "Echocardiography",
          "sounds_like": ["Echocardiography"],
          "count": 27,
          "source": ["corpus1"],
          "word": "Echocardiography"
       }, ...
    ]}
    */

    Step 6. Train the custom model. This is the final step which creates the actual custom model patch. You MUST do this step before you can use the custom model for recognition. If this step is skipped, using the custom model at recognition time will lead to an error!

    This is done via a POST API as shown below. Since this step is time-consuming, the user needs to poll the state of the custom model until the model status goes from ‘training’ to ‘available’. See example below:

    >> Start training
    Request:
    curl -u $CREDS -X POST -H "Content-type: application/json" --data "{}"
    https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/$custID/train

    Response: You will get an empty response
    {}

    >> Check model status *(wait for status to change from 'training' to 'available')*

    curl -u $CREDS -X GET https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/$custID
    {
    "owner": "3315e1ef-f26d-43be-b573-da087cb542ca",
    "base_model_name": "en-US_NarrowbandModel",
    "customization_id": "24400140-ecbc-11e6-9f7b-9dd7346ffae2",
    "created": "2017-02-06T22:32:28.756Z",
    "name": "Testing with Humana corpus",
    "description": "Testing Humana",
    "progress": 0,
    "language": "en-US",
    "status": "training"
    }
    ...
    curl -u $CREDS -X GET
    https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/$custID
    {
    "owner": "3315e1ef-f26d-43be-b573-da087cb542ca",
    "base_model_name": "en-US_NarrowbandModel",
    "customization_id": "24400140-ecbc-11e6-9f7b-9dd7346ffae2",
    "created": "2017-02-06T22:32:28.756Z",
    "name": "Testing with Humana corpus",
    "description": "Testing Humana",
    "progress": 100,
    "language": "en-US",
    "status": "available"
    }

    At this point, the custom model is ready to be used. In the next section, we will test the custom model to see how well it does to improve the recognition accuracy.

  5. Use custom language model on your audio

    /*
    curl --user $CREDS -X POST -H "Content-Type: audio/wav" --header "Transfer-Encoding: chunked"
    --data-binary @audio/audio_001.wav
    "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?
    continuous=true&model=en-US_BroadbandModel&customization_id=$custID"

    {
       "results": [
          {
             "alternatives": [
                {
                   "confidence": 0.999,
                   "transcript": "what is Refsum disease "
                }
             ],
             "final": true
          }
       ],
       "result_index": 0
    }
    */
  6. Compare Results

    Here are the results without using custom model for the three test audio files

    /*
    audio_001.wav: "transcript": "what is rough some disease "
    audio_002.wav: "transcript": "what is the prognosis for Adreno leukodystrophy "
    audio_003.wav: "transcript": "what is hydrants have fully "
    */

    Here are the results using custom model for the three test audio files

    /*

    audio_001.wav: "transcript": "what is Refsum disease "
    audio_002.wav: "transcript": "what is the prognosis for Adrenoleukodystrophy "
    audio_003.wav: "transcript": "what is Hydranencephaly "

    */

     

  7. Appendix

     

4 comments on"How to use IBM Watson Speech to Text Language Model Customization service"

  1. AArunAnnamalai September 01, 2017

    This is a really good post

  2. Michelle Teodoro November 05, 2017

    Almost all the time the phrases come wrong, with different words spoken. Even going over https://speech-to-text-demo.mybluemix.net/ , I’m getting misspelled words.

    Does anyone know how do I improve that? How to listen/show the output exactly the same have spoken?

  3. Michelle Teodoro November 05, 2017

    manually works but when the service goes through nodeJS, the words are misspelled.. does anyone have a suggestion?

    $ curl –user user:password -X POST -H “Content-Type: audio/wav” –header “Transfer-Encoding: chunked” –data-binary @brian.wav “https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?continuous=true&model=en-US_BroadbandModel”
    {
    “results”: [
    {
    “alternatives”: [
    {
    “confidence”: 0.978,
    “transcript”: “hi I’m Brian one of the available high quality text to speech voices ”
    }
    ],
    “final”: true
    },
    {
    “alternatives”: [
    {
    “confidence”: 0.932,
    “transcript”: “select download not to install my voice ”
    }
    ],
    “final”: true
    }
    ],
    “result_index”: 0,
    “warnings”: [
    “Unknown arguments: continuous.”
    ]
    }

  4. dineshpapineni January 12, 2018

    This is an excellent article. I have a small edit suggestion. The link provided in fourth point under the header Ingredients is broken. Maybe you should change the access level on box to public.

Join The Discussion