Speech recognition technology made enormous strides over the last five years. For example, it is now possible to use speech to input text into smartphones with very high accuracy. This is a critical usability feature given the difficulties that are involved in inputting text on tiny keyboards.

However, such applications give the mistaken impression that speech recognition is now a “solved problem”. Nothing can be further from the truth! Speech recognition accuracy on casually spoken speech – for example, speech in conversations and meetings – is still dismally low. Even state of the art technology from top laboratories can have difficulty getting more than 50% of the words correct in such challenging environments.

IBM Watson is proud to announce a major advance in the transcription of conversational speech. Watson researcher George Saon along with colleagues Jeff Kuo and Steve Rennie built a system capable of very low error rates on a popular scientific benchmark that consists of telephone conversations – the NIST Switchboard corpus (“EvalSet-2”). Furthermore, they achieved this by using only publicly available data (details available on request) to train the underlying models. The performance of our new system – an 8% word error rate – is 36% better than previously reported external results.

The performance breakthrough was enabled by applying new advances in deep learning to both acoustic modeling, and language modeling (see Watson’s  earlier blog for background information) on top of Watson’s existing state-of-the-art speech recognition system. However, human performance has been measured to be about 4% word error rate on this task, so there is still plenty of room for improvement!

Please see Saon’s paper that was just released on the “arXiv” for technical details.

32 comments on"IBM Watson announces breakthrough in Conversational Speech Transcription"

  1. Congrats! Very cool.

  2. These results prove that IBM still remains in the cutting edge of Speech Technologies.. Well done!

  3. Markus Nussbaum-Thom May 27, 2015

    Wow this is a huge improvement !!

  4. Michelle Unger May 28, 2015

    Well done team! Thanks for all the effort that went into this.

  5. Exciting progress with this technology!

  6. This is awesome. So when can I hit #8 on my conference call and get an email afterwards with a reasonable transcription of the discussion? (I don’t need something to distinguish between speakers, though it’s a nice to have…just need the raw searchable text.)

  7. This is interesting news. What does it mean that the system is 36% better than previous results? Is the error rate 36% lower? In other words, is the current 8% rate being compared to a previous error rate of ~ 10.9%?

  8. This is only going to get better and faster. Imagine if speakers wore lapel mics being captured by audio mixers and audio saved as clear and distinct with radio TV broadcast quality? How much more would accuracy increase today silica ad in a courtroom environment where presumably there is more control over court decorum

  9. Wonderful work. Technology enablement for physically challenged. I fractured my hand two weeks ago and realize the importance of this critical work. Bring this to the masses please.

  10. Very interested to see how this capability could support people recovering from Aphasia, due to stroke or brain injury.

    • Interesting you mention this. Another person I know brought up the same question. I imagine what could be done is to recognition speech in a conversation and combine that with predictive language modeling to prompt possible alternatives that the user might be temporarily blocked on. Whether or not the predictions would be valuable would of course have to be tested.

  11. This is incredibly exciting, well done! Is this rolled out to the Watson Cloud Services speech to text platform? It would be really helpful to me for a new application I am working on. Any advice/timeline on if this will be generally usable or how I can use it would be really appreciated! I am very excited about this.

    Thanks so much,

    • Thanks for your interest. We continue to work hard on advancing the underlying technologies and making them available as quickly as possible. Please keep following our posts for new developments!

  12. Michael,
    Hi. Do you think it would work for recordings of a distant speaker as well? I have quite a few lecture recordings that I would like to transcribe (though not only English).

  13. Name *Judd Campbell September 21, 2015

    can you have a person from sales contact me?

  14. Name Max Tahir December 31, 2015

    It’s what to expect from such great talent. I am honored to have been part of your team.

  15. Name *James March 11, 2016

    If I want to recognize speech that consists of old “Shakespeare” English, ie, someone reading from an old poetry book, what is the best solution to the fact that these models are trained on contemporary English?

  16. what ratio of power or energy does watson use to achieve 8% error rate compared to the amount of caloric energy used by humans to achieve 4% error rate? How soon will watson’s energy consumption be less than human energy consumption and a better yield of correctness than 96% human capabilities?

    • Jeb,

      We have always taken the view that our initial goal is to devise a set of technologies that can achieve the desired target performance. We have always been able to significantly reduce computation costs after the fact. For example, IBM’s Synapse is a very low power highly parallel architecture that may be suitable down the line to run such algorithms.

Join The Discussion

Your email address will not be published. Required fields are marked *