Last year we announced a major milestone in English conversational speech recognition: a system that achieved an 8% word error rate (WER) on a very popular benchmark called the Switchboard database. The IBM Watson team composed of Tom Sercu, Steven Rennie, Jeff Kuo and myself are pleased to report a new record performance of 6.9% WER on the same task.
To put this result in perspective, back in 1995 a “high-performance” IBM recognizer achieved a 43% error rate. Spurred by a series of DARPA-sponsored speech recognition evaluations in the late ’90s and early ’00s, our system improved steadily and won the 2004 EARS Rich Transcription evaluation with a WER of 15.2%. Most recently, the advent of deep neural networks was critical in helping us achieve the 8% and 6.9% results. The ultimate goal is to reach or exceed human accuracy which is estimated to be around 4% WER on this task.
This 6.9% error rate has been made possible by technological improvements in both acoustic and language modeling (please refer to https://developer.ibm.com/watson/blog/2015/02/09/ibm-watson-now-brings-cognitive-speech-capabilities-developers/ for background information about speech recognition systems). On the acoustic side, we use a fusion of two powerful deep neural networks that predict context-dependent phones from the input audio. The models were trained on 2000 hours of publicly available transcribed audio from the Switchboard, Fisher and CallHome corpora.
The first model is a recurrent neural net  that has a memory of past acoustic-phonetic events. This model has been improved since last year by replacing the commonly used sigmoid nonlinearity with a maxout activation function  which implements spatial pooling of neurons from the preceding layer. In contrast to sigmoid neurons, maxout neurons trained with a novel form of annealed dropout that we introduced in  specialize early on in detecting relevant features during training.
Our second model, called very deep convolutional neural net (or CNN), has its origins in image classification . Speech can be viewed as an image if we consider the spectral representation of the audio signal with the two dimensions being time and frequency. As opposed to the classic CNN architectures employed in our previous system  that have only one or two convolutional layers with large (typically 9-by-9) kernels, our very deep CNN  has up to ten convolutional layers with small 3-by-3 kernels which preserve the dimensionality of the input. By stacking many of these convolutional layers with Rectified Linear Units nonlinearities before pooling layers, the same receptive field is created with less parameters and more nonlinearity. These two models which differ radically in architecture and input representation show good complementarity and their combination leads to additional gains over the best individual model.
On the language modeling side, we use a sequence of language models (LMs) that are progressively more refined. The baseline is an n-gram LM estimated on a variety of publicly available corpora such as Switchboard, Fisher, Gigaword, and Broadcast News and Conversations. The hypotheses obtained by decoding with this LM are reranked with an exponential class-based language model called model M . The M stands for medium meaning that this model is in the “Goldilocks” region of language models: it’s neither too big nor too small, it’s just right. Lastly, we rescore the candidate sentences with a neural network LM  to obtain the final output.
We are currently working on integrating these technologies into IBM Watson’s state-of-the-art speech to text service. By exposing our acoustic and language models to increasing amounts of real-world data, we expect to bridge the gap in performance between the “lab setting” and the deployed service.
Please refer to our paper  that was released on arXiv for additional details.
1. G. Saon, H. Soltau, A. Emami, and M. Picheny, “Unfolded recurrent neural networks for speech recognition”, in Proc. Interspeech, 2014.
2. I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks”, arXiv preprint arXiv:1302.4389, 2013.
3. S. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of deep networks”, in Spoken Language Technology (SLT) IEEE Workshop, 2014.
4. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, CoRR arXiv:1409.1556, 2014.
5. G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny. “The IBM 2015 English conversational telephone speech recognition system.” arXiv preprint arXiv:1505.05899, 2015.
6. T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, “Very deep multilingual convolutional neural networks for LVCSR”, Proc. ICASSP, 2016.
7. S. F. Chen, “Shrinking exponential language models”, in Proc. NAACL-HLT, 2009.
8. H.-K. J. Kuo, E. Arisoy, A. Emami, and P. Vozila, “Large scale hierarchical neural network language models”, in Proc. Interspeech, 2012.
9. G. Saon, T. Sercu, S. Rennie and H.-K. J. Kuo, “The IBM 2016 English conversational telephone speech recognition system.” arXiv preprint arXiv:1604.08242, 2016.