We are pleased to take our first steps in bringing the ability to recognize speech (“Speech to Text”) and produce speech (“Text to Speech”) to IBM Watson developers. These services allow you to build applications that can take speech as input and return speech as output. It uses exactly the same programming models as other cognitive services and we’ve made it available to the development community through the Watson Developer Cloud.
The services were created by a team of speech researchers and developers in IBM’s Watson Group. I have worked at IBM in speech for over 30 years. Back in 1980 I heard IBM was trying this really novel approach of using data driven statistical methods to do speech recognition and had the best results. When I first joined IBM, I had the privilege of working with giants in the field like Fred Jelinek, Lalit Bahl, and Bob Mercer. The technical skills of my colleagues still amaze me and I consider myself very lucky to be able to work with such a great team.
With speech services, you can build a spoken conversational interface to Watson, or produce transcripts from running speech that can be processed by other Watson services, such as Machine Translation, Question and Answer, and Relationship Extraction. Speech is so compelling that even last year our ecosystem partners had already started experimenting with some Speech Interfaces to Watson; see
– Majestyk/Elemental Path
– GenieMD (about 8:20 into the presentation)
for some examples.
The two speech technologies we are now exposing via the cloud are “Speech to Text” and “Text to Speech”.Â I’ll say a few words about each.
“Speech to Text”
IBM is a pioneer in this area, with initial efforts going back to IBM’s “Shoebox” recognizer in the early 1960s.Â Speech to text, commonly called “speech recognition” is a deceptively complex problem. People can immediately convert speech into a string of internalized words and do not appreciate the huge variability that exists in a speech signal. Early attempts assumed that speech could be recognized by rules but such systems were unsuccessful. Scientists at IBM in the 1970s realized that statistical modeling principles could be applied that learned from lots of data, revolutionizing the field and setting the stage for today’s machine learning technology explosion.
A good overview of IBM’s early work in this area can be found in . Since then, IBM has continued to produce a stream of groundbreaking research and also a set of speech products for dictation (“ViaVoice”), telephony speech recognition (“Websphere Voice Server”), and embedded applications (“eVV’ – embedded ViaVoice). In 2009 IBM was awarded the IEEE Corporate Innovation Award  for its long term contributions to the field.
The basic components of a speech recognizer are a “feature extractor”,Â an “acoustic model”, a “language model”, and a “speech engine”. The feature extractor extracts critical features from the speech signal to simplify recognition. The acoustic model describes how different words are realized as sequences of features from the feature extractor. The language model assigns probabilities to different words and strings of words. For example, although “Austin” and “Boston” sound alike, “Austin, Texas” is a much more likely phrase than “Boston, Texas”. The speech engine itself combines information from the feature extractor, the acoustic model, and the language model to arrive at the best sequence of words. Last, sometimes the output of the recognizer is used to update the models (“Adaptation”) which often results in improved performance. For more technical information on how a speech recognition system operates, see  for the basics,  for more recent developments, and  for IBM’s most recent published results on a popular benchmark task.
Speech recognition systems are trained from lots of data. A system only trained on short phrases will tend to do poorly on long sentences and vice versa.Â A system trained on strings of numbers will not work well on strings of letters. A system for medical dictation will not work well on news broadcasts. High performing speech systems are typically trained on thousands of hours of speech and hundreds of millions of words of text from the domain of interest.
Since we expect a wide variety of uses for Watson, the Speech to Text system is pretty generic. It should work reasonably well on common conversational interactions but may have room for improvement with tasks that have very specialized vocabularies. Our long term goal is to build a system that can learn over time. We hope you will try our system and give us feedback about how you plan to use it so it can be continuously improved. Â
“Text to Speech”
Text-to-speech (TTS) is the generation of synthesized speech from text. Our goal is to make synthesized speech as intelligible, natural and pleasant to listen to as human speech and have it communicate just as meaningfully.
We have developed a novel TTS system, built on IBM’s successful work in data-driven methodologies (described above) for speech recognition. Our system obtains its parameters through automated training on a few hours of speech data, which is acquired by recording a specially prepared script. During synthesis very small segments of recorded human speech are optimally selected and concatenated together to produce the synthesized speech. The system also uses sophisticated text processing technology to disambiguate pronunciations (e.g., is “St” pronounced “street”, like in “Main Street”, or “Saint”, as in “Saint Peter”) and machine learning techniques to predict prosody. For a more detailed description for how this works, see . The same TTS system was the “Voice of Watson”Â when it played and won the Jeopardy game back in 2011. For an article on how this was done see .
We have initially put out two systems that should produce high quality output for general text inputs for both English and Spanish. Hard to pronounce things like unusual names (“Picheny” ðŸ™‚ ) or acronyms (“WYSIWYG”, typically pronounced as “wih-zee-wig”) may present challenges. In the future we plan to enable user customization of the pronunciations in addition to continuous work on improving the TTS quality and naturalness.
Hillary Clinton was fond of saying “It takes a village to raise a child”.Â Analogously, we hope that you – the “village” of developers – will help improve Watson’s abilities in processing speech to make Watson more useful to you in getting your work done. Look forward to “speaking” with you!
 F. Jelinek. “The Development of an Experimental Discrete Dictation Recognizer.” Proceedings IEEE, 73:11 (1985) pp1616-1624, Nov. 1985
 Padmanabhan, Mukund, and Michael Picheny. “Large-vocabulary speech recognition algorithms.” Computer 35.4 (2002): 42-50.
 Picheny, Michael, et al. “Trends and advances in speech recognition.” IBM Journal of Research and Development 55.5 (2011): 2-1
 Hagen Soltau, George Saon, and Tara N. Sainath. “Joint Training of Convolutional and Non-Convolutional Neural Networks.”Â in Proc. ICASSP(2014).
 Rosenberg, Andrew, Raul Fernandez, and Bhuvana Ramabhadran. “” What is… Dengue Fever?”-Modeling and Predicting Pronunciation Errors in a Text-to-Speech System.” INTERSPEECH. 2011.