Speech recognition technology made enormous strides over the last five years. For example, it is now possible to use speech to input text into smartphones with very high accuracy. This is a critical usability feature given the difficulties that are involved in inputting text on tiny keyboards.
However, such applications give the mistaken impression that speech recognition is now a “solved problem”. Nothing can be further from the truth! Speech recognition accuracy on casually spoken speech – for example, speech in conversations and meetings – is still dismally low. Even state of the art technology from top laboratories can have difficulty getting more than 50% of the words correct in such challenging environments.
IBM Watson is proud to announce a major advance in the transcription of conversational speech.Â Watson researcher George Saon along with colleagues Jeff Kuo and Steve Rennie built a system capable of very low error rates on a popular scientific benchmark that consists of telephone conversations – the NIST Switchboard corpus (“EvalSet-2”). Furthermore, they achieved this by using only publicly available data (details available on request) to train the underlying models. The performance of our new system – an 8% word error rate – is 36% better than previously reported external results.
The performance breakthrough was enabled by applying new advances in deep learning to both acoustic modeling, and language modeling (see Watson’sÂ earlier blog for background information) on top of Watson’s existing state-of-the-art speech recognition system. However, human performance has been measured to be about 4% word error rate on this task, so there is still plenty of room for improvement!
Please see Saon’s paper that was just released on the “arXiv” for technical details.