Last year we announced a major milestone  in English conversational speech recognition: a system that achieved an 8% word error rate (WER) on a very popular benchmark called the Switchboard database. The IBM Watson team composed of Tom Sercu, Steven Rennie, Jeff Kuo and myself are pleased to report a new record performance of 6.9% WER on the same task.

To put this result in perspective, back in 1995 a “high-performance” IBM recognizer achieved a 43% error rate. Spurred by a series of DARPA-sponsored speech recognition evaluations in the late ’90s and early ’00s, our system improved steadily and won the 2004 EARS Rich Transcription evaluation with a WER of 15.2%. Most recently, the advent of deep neural networks was critical in helping us achieve the 8% and 6.9% results. The ultimate goal is to reach or exceed human accuracy which is estimated to be around 4% WER on this task.

This 6.9% error rate has been made possible by technological improvements in both acoustic and language modeling (please refer to for background information about speech recognition systems). On the acoustic side, we use a fusion of two powerful deep neural networks that predict context-dependent phones from the input audio. The models were trained on 2000 hours of publicly available transcribed audio from the Switchboard, Fisher and CallHome corpora.

The first model is a recurrent neural net [1] that has a memory of past acoustic-phonetic events. This model has been improved since last year by replacing the commonly used sigmoid nonlinearity with a maxout activation function [2] which implements spatial pooling of neurons from the preceding layer. In contrast to sigmoid neurons, maxout neurons trained with a novel form of annealed dropout that we introduced in [3] specialize early on in detecting relevant features during training.

Our second model, called very deep convolutional neural net (or CNN), has its origins in image classification [4]. Speech can be viewed as an image if we consider the spectral representation of the audio signal with the two dimensions being time and frequency. As opposed to the classic CNN architectures employed in our previous system [5] that have only one or two convolutional layers with large (typically 9-by-9) kernels, our very deep CNN [6] has up to ten convolutional layers with small 3-by-3 kernels which preserve the dimensionality of the input. By stacking many of these convolutional layers with Rectified Linear Units nonlinearities before pooling layers, the same receptive field is created with less parameters and more nonlinearity. These two models which differ radically in architecture and input representation show good complementarity and their combination leads to additional gains over the best individual model.

On the language modeling side, we use a sequence of language models (LMs) that are progressively more refined. The baseline is an n-gram LM estimated on a variety of publicly available corpora such as Switchboard, Fisher, Gigaword, and Broadcast News and Conversations. The hypotheses obtained by decoding with this LM are reranked with an exponential class-based language model called model M [7]. The M stands for medium meaning that this model is in the “Goldilocks” region of language models: it’s neither too big nor too small, it’s just right. Lastly, we rescore the candidate sentences with a neural network LM [8] to obtain the final output.

We are currently working on integrating these technologies into IBM Watson’s state-of-the-art speech to text service. By exposing our acoustic and language models to increasing amounts of real-world data, we expect to bridge the gap in performance between the “lab setting” and the deployed service.

Please refer to our paper [9] that was released on arXiv for additional details.

1. G. Saon, H. Soltau, A. Emami, and M. Picheny, “Unfolded recurrent neural networks for speech recognition”, in Proc. Interspeech, 2014.

2. I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks”, arXiv preprint arXiv:1302.4389, 2013.

3. S. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of deep networks”, in Spoken Language Technology (SLT) IEEE Workshop, 2014.

4. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, CoRR arXiv:1409.1556, 2014.

5. G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny. “The IBM 2015 English conversational telephone speech recognition system.” arXiv preprint arXiv:1505.05899, 2015.

6. T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, “Very deep multilingual convolutional neural networks for LVCSR”, Proc. ICASSP, 2016.

7. S. F. Chen, “Shrinking exponential language models”, in Proc. NAACL-HLT, 2009.

8. H.-K. J. Kuo, E. Arisoy, A. Emami, and P. Vozila, “Large scale hierarchical neural network language models”, in Proc. Interspeech, 2012.

9. G. Saon, T. Sercu, S. Rennie and H.-K. J. Kuo, “The IBM 2016 English conversational telephone speech recognition system.” arXiv preprint arXiv:1604.08242, 2016.

20 comments on"Recent Advances in Conversational Speech Recognition"

  1. Name *Dimitrios Dimitriadis April 28, 2016

    Good job guys pushing the ASR performance to new highs

  2. Zach Tomlinson April 29, 2016

    Just fascinating…thanks for sharing your work with us. Personally, this type of sharing is the best way someone like me – having little/no CS experience – can begin to understand what our services are REALLY doing and, more importantly, gain a bit of depth on how they do it! Cheers, Z

    • George Saon April 29, 2016

      Thanks for the kind words Zach, I’m glad you liked it.

      • Philippe Comte May 04, 2016

        Hello ,
        That’s great news ! Are those improvements already available in the Speech to Text service on Bluemix ?
        Which audio format do you use ? FLAC, OGG , AAA ?
        How do you get the punctuation right for the transcribed text ?

        Thank you .

        • George Saon May 04, 2016

          Hi Philippe,

          Thanks for the interest.
          The improvements are not yet integrated into the Bluemix STT service (we’re working hard on getting them in as fast as we can). Supported audio formats are FLAC, WAV, PCM and OGG I believe. Punctuation is currently done based on silence duration.

  3. From 43% to 8% word error rate. Good job!

    Looking forward to the next steps.

  4. Great work, George. This is an exciting new improvement.
    One concern people have raised is the per minute cost of STT. Does the new approach add significant computational overhead to the improve solution? Seems like it would. Will that significantly impact cost?

    • George Saon May 04, 2016

      Thanks Steven.
      There is usually a tradeoff between speed (or computation) and accuracy, so a more accurate solution will generally require more computation. That being said, we have to keep the computation in check so that STT can still run in real-time. Therefore I don’t expect the improvements to impact cost significantly but, to be honest, I have no knowledge of our pricing strategy.

  5. Name *Tom Grey May 06, 2016

    This is good news, George — from 8% to 6.9% WER… 5 more years to get under 4%? We’ve been waiting for HAL for so long …

    Is it possible to have Watson get trained for better accuracy for a specific voice? For instance, by taking all the videos Ginny has made, and using their transcriptions to train Watson to recognize … (his?) master’s voice. I would have expected 200 hours of voice specific training to allow a 4% rate already. For a VR personal assistant, higher single voice accuracy after training is far more attractive to me, and many if not most others, so that I can be comfy in talking my own little Tommy J digital helper.

    I’m also wondering whether there is more discussion about personification of Watson, where previously there was a desire to avoid it, but the customers seem more comfy having it.

    Finally, I think teaching Watson to teach English would be the best real world method to rapidly increase the variety of STT inputs with immediate correction.

    • George Saon May 06, 2016

      Thanks for the great comments/questions Tom. Alright, let’s get started …
      In [5] we claim that we can reach human parity on the Switchboard task in about a decade provided that we can collect a lot more training data (say 200,000 hours instead of 2000). The time horizon could be shortened with some unexpected advances in machine learning.

      You’re right that we can achieve 4% today on a native US English speaker if we had 200 hours of accurately transcribed speech collected in a relatively clean environment. The problem is that we want to reach 4% across a variety of speakers which are unseen in the training data. This is a much harder problem and requires unsupervised speaker adaptation techniques where we iteratively refine the hypotheses based on previous decoded outputs (sort of like a feedback loop if you wish).

      There is a lot of discussion about personification (or customization) of Watson STT and we are actively working on it. As for Watson teaching English, never thought of that 🙂 I think it would have a hard time figuring out whether the student pronounced the sentence correctly but with a strong accent or whether he made a mistake.

  6. Iain Hart May 23, 2016

    6.9% error rate sounds remarkable.
    Just to understand better where this is headed, besides rockets to Mars, perhaps you have a list of targets that you want to achieve. A list of Real World issues to be solved, and if so, maybe you could publicise the link. My apologies if I missed it somewhere.

    • George Saon May 23, 2016

      Thanks Iain.
      For the time being, the list has only one item: achieve a 4% word error rate on this task by 2025. Obviously we would also like our system to be more robust to noise, foreign accents, distant microphones, domain changes etc. but we haven’t set any concrete targets yet. As for practical applications, being able to hold a natural conversation with Watson on a variety of topics seems like a reasonable one.

  7. Jordi Thomas Rubio July 11, 2016

    Amazing work, George. Anyway, in a recent engagement with a client in Spain we failed to compete in the Speech-to-Text comparison with other competitor’s API because our error rate was, by far, higher. Are we addressing the Spanish language the same way as English?

    • George Saon July 11, 2016

      Thanks Jordi. Unfortunately, we had very little data for building the Spanish model. We rolled out customization capabilities for English and I expect that to happen for Spanish as well at some point. If your client has some application-specific text data, the recognition performance is likely to improve.

  8. Name *Vickie Dorris October 12, 2016

    More work to do..On the Microsoft Blog Microsoft Research has announced that their AI efforts has hit a new milestone, achieving an industry-leading score of 6.3% Word Error Rate on a standardized speech recognition test, the Switchboard speech recognition task.

    “Our best single system achieves an error rate of 6.9% on the NIST 2000 Switchboard set. We believe this is the best performance reported to date for a recognition system not based on system combination. An ensemble of acoustic models advances the state of the art to 6.3% on the Switchboard test data,” the scientist noted in a research paper .

    • George Saon October 19, 2016

      Yes, we are aware of the latest MSR results. We will try to catch up with them.

  9. Melroy van den Berg October 19, 2016

    Ow no! Microsoft has now 5,9% word error. IBM you can beat them, I have faith in you guys!

Join The Discussion

Your email address will not be published. Required fields are marked *