Converting speech or audio to text has all kinds of uses and can provide applications with a wide range of advanced capabilities. Imagine you’re running a call center that handles thousands of simultaneous calls. You’d like to identify and analyze various trends, such as whether the callers are having problems with a particular product or feature, or if the callers sound frustrated or unhappy about something.

You might also be looking to identify and determine the frequency of repeated words in the call center conversations. It’s vital for businesses to be able to analyze that kind of information. For example, if you identified that callers sound frustrated and the word “broken” is continually repeated, you can take actions to improve the user experience. You might first quickly teach the support team how to help with this particular problem and offer a solution or a workaround. Next, you can fix or improve the product that is repeatedly breaking.

Almost any audio can be converted to text, where the text is then analyzed for trends and analytic insights that are important to you. One tool that you can use to analyze text is the Watson Tone Analyzer service.

Speech to Text API

How do you transcribe audio to text? One option is to use Watson Speech to Text API. The API is easy to use; you point to an audio file and get JSON text and some additional metadata information as output. That’s the simplest and fastest way to use the API.

You can can also use more advanced features, such as uploading a custom model. A custom model helps to transcribe audio from a specific domain. For example, let’s say you need to transcribe audio from a medical field. The audio might use field- or domain-specific words (such as disease names) that the out-of-the-box API might not fully understand. By uploading a custom model, you can teach the API to transcribe better and, most importantly, correctly.

Let’s look at a few examples; I’ll cover a custom model in a future blog post.

Creating a Speech to Text service

In this section, we’ll create a new Watson Speech to Text service.

  1. Register for a free IBM Cloud account or sign into your existing account.
  2. Go to the Services Catalog.
  3. From the left menu, click AI.
  4. Locate and click the Text to Speech API box.
  5. On the next page you’ll see the service name, which can change if you want. Click Create to create a Text to Speech service.

When the service is created you’ll see the following page. You can click Show to display the service credentials.

Speech to Text service
Speech to Text service

Now that you’ve created a service, it’s time to try it!

Running the Speech to Text service

The fastest way to run the service is from a command line using the cURL program, which we’ll do next. Keep in mind that Watson offers 10 SDKs for various languages. You can see and try the SDKs on the IBM Watson APIs GitHub page.

You first need an audio file. For testing you can download this sample file.

From a terminal window, navigate to the directory where you saved the file and run the following cURL command. You need to replace the username and password with the information from your service. You can see this information by clicking Show on the service page.

curl -X POST -u {username}:{password} \
 --header "Content-Type: audio/flac" \
 --data-binary @audio-file.flac \
 "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

This example is not using any extra parameters.

You should see the following output:

{
 "results": [
    {
       "alternatives": [
          {
             "confidence": 0.889,
             "transcript": "several tornadoes touch down as a 
                           line of severe thunderstorms swept 
                           through Colorado on Sunday "
          }
       ],
       "final": true
   }
 ],
 "result_index": 0
}

  • The transcript field is the text that was transcribed.
  • The confidence field is the service’s confidence in the transcript in the range of 0 to 1. The closer the number to 1, the more confident the service that the transcription is correct.
  • The alternatives field might show alternative transcriptions (although there are none in this example).

If you want to try another example, you can download a longer audio file from Wikimedia and then run the command again (note that I have renamed the file):

curl -X-X POST -u {username}:{password} \
 --header "Content-Type: audio/flac" \
 --data-binary @Tim.oga \
 "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

The output in this case would be:

 {
    "results": [
       {
          "alternatives": [
             {
                "confidence": 0.845, 
                "transcript": "what is not a replacement for the web the web continues but when you think of the files on your computer the the documents into them the email messages and letters and things are things that you can put on the web now but then but their data files on their light kept calendars and downloaded spread sheets and things which you can't really put on the web because if you put a condom on the way we have to put it up on this document and "
             }
          ], 
          "final": true
       }, 
       {
          "alternatives": [
             {
                "confidence": 0.787, 
                "transcript": "with a computer you got all the things you need to do with data live with the kind of the need to get a look at in a debut month you need to compare the other calendars and see what you're doing at same time so the problem is that the moment that the data that's out there isn't in a form that we can actually post as it and use it so we not using it powerful enough and it's sort of in Dayton form for day to day life but also its but also for scientists and people use lots of data "
             }
          ], 
       "final": true
       }
    ], 
    "result_index": 0
}

The output from this call has two transcriptions and two slightly different confidence scores.

The cURL command is a developer’s best friend for running and testing APIs. But, if you want to use a more visual interface, I recommend you download and install the Postman program.

This is how Postman looks running the same request as above:

Postman client
Postman client

Note that I don’t have the username and password in the URL. The username and password are entered on the Authentication tab (just below the service URL).  As Speech to Text uses Basic Authentication, once you enter the username and password, you can switch to the Headers tab and see the Authorization header value there.

These examples are fun to try, but let’s have a look at a more real-world example.

Voice recording and transcription with Nexmo

Nexmo is a Communication as a Service platform that offers services such as Voice, Messaging, and Authentication to make it easy to build applications with built-in communication.

Michael Heap, Nexmo Developer Advocate, published a very nice tutorial on how to record calls with the Nexmo Voice API and then transcribe the calls with the Speech to Text API. I encourage you to read the post and try the Voice API.

Here’s a short excerpt; you can then use the link below to jump to the complete blog post:

As part of our Voice API offering, Nexmo allows you to record parts (or all) of a call and fetch the audio once the call has completed. Today, we’re happy to announce a new enhancement to this functionality: split recording. Split recording makes common tasks such as call transcription even easier.

When split recording is enabled, the downloaded recording will contain participant A (let’s call her Alice) in the left channel, and participant B (let’s call him Bob) in the right channel. This allows you to work with the audio from a single participant easily.

In this post, we’re going to walk through a simple use case. Alice calls the bank to find out information about her account, and Bob is the customer support agent who answers the call.

I hope you find this post helpful. Make sure you sign up for a free IBM Cloud account to try these steps and then jump to the Nexmo blog to learn more. Stay tuned for future blog posts on how to customize models, and don’t hesitate to reach out to me if you have any questions!

Continue reading the Nexmo blog.

   

Join The Discussion

Your email address will not be published. Required fields are marked *