IBM Watson is a system for reasoning over unstructured information. Initially, all this information came in as text and all interactions were typed or GUI-based. Information was presented back to the user via a GUI. No hands-free, no spoken interactions.

We are pleased to take our first steps in bringing the ability to recognize speech (“Speech to Text”) and produce speech (“Text to Speech”) to IBM Watson developers. These services allow you to build applications that can take speech as input and return speech as output. It uses exactly the same programming models as other cognitive services and we’ve made it available to the development community through the Watson Developer Cloud.

The services were created by a team of speech researchers and developers in IBM’s Watson Group. I have worked at IBM in speech for over 30 years. Back in 1980 I heard IBM was trying this really novel approach of using data driven statistical methods to do speech recognition and had the best results. When I first joined IBM, I had the privilege of working with giants in the field like Fred Jelinek, Lalit Bahl, and Bob Mercer. The technical skills of my colleagues still amaze me and I consider myself very lucky to be able to work with such a great team.

With speech services, you can build a spoken conversational interface to Watson, or produce transcripts from running speech that can be processed by other Watson services, such as Machine Translation, Question and Answer, and Relationship Extraction. Speech is so compelling that even last year our ecosystem partners had already started experimenting with some Speech Interfaces to Watson; see

Majestyk/Elemental Path
GenieMD (about 8:20 into the presentation)

for some examples.

The two speech technologies we are now exposing via the cloud are “Speech to Text” and “Text to Speech”.  I’ll say a few words about each.

“Speech to Text”

IBM is a pioneer in this area, with initial efforts going back to IBM’s “Shoebox” recognizer in the early 1960s.  Speech to text, commonly called “speech recognition” is a deceptively complex problem. People can immediately convert speech into a string of internalized words and do not appreciate the huge variability that exists in a speech signal. Early attempts assumed that speech could be recognized by rules but such systems were unsuccessful. Scientists at IBM in the 1970s realized that statistical modeling principles could be applied that learned from lots of data, revolutionizing the field and setting the stage for today’s machine learning technology explosion.

A good overview of IBM’s early work in this area can be found in [1]. Since then, IBM has continued to produce a stream of groundbreaking research and also a set of speech products for dictation (“ViaVoice”), telephony speech recognition (“Websphere Voice Server”), and embedded applications (“eVV’ – embedded ViaVoice). In 2009 IBM was awarded the IEEE Corporate Innovation Award [2] for its long term contributions to the field.

ASR Block Diagram

The basic components of a speech recognizer are a “feature extractor”,  an “acoustic model”, a “language model”, and a “speech engine”. The feature extractor extracts critical features from the speech signal to simplify recognition. The acoustic model describes how different words are realized as sequences of features from the feature extractor. The language model assigns probabilities to different words and strings of words. For example, although “Austin” and “Boston” sound alike, “Austin, Texas” is a much more likely phrase than “Boston, Texas”. The speech engine itself combines information from the feature extractor, the acoustic model, and the language model to arrive at the best sequence of words. Last, sometimes the output of the recognizer is used to update the models (“Adaptation”) which often results in improved performance. For more technical information on how a speech recognition system operates, see [3] for the basics, [4] for more recent developments, and [5] for IBM’s most recent published results on a popular benchmark task.

Speech recognition systems are trained from lots of data. A system only trained on short phrases will tend to do poorly on long sentences and vice versa.  A system trained on strings of numbers will not work well on strings of letters. A system for medical dictation will not work well on news broadcasts. High performing speech systems are typically trained on thousands of hours of speech and hundreds of millions of words of text from the domain of interest.

Since we expect a wide variety of uses for Watson, the Speech to Text system is pretty generic. It should work reasonably well on common conversational interactions but may have room for improvement with tasks that have very specialized vocabularies. Our long term goal is to build a system that can learn over time. We hope you will try our system and give us feedback about how you plan to use it so it can be continuously improved.  

“Text to Speech”

Text-to-speech (TTS) is the generation of synthesized speech from text. Our goal is to make synthesized speech as intelligible, natural and pleasant to listen to as human speech and have it communicate just as meaningfully.

We have developed a novel TTS system, built on IBM’s successful work in data-driven methodologies (described above) for speech recognition. Our system obtains its parameters through automated training on a few hours of speech data, which is acquired by recording a specially prepared script. During synthesis very small segments of recorded human speech are optimally selected and concatenated together to produce the synthesized speech. The system also uses sophisticated text processing technology to disambiguate pronunciations (e.g., is “St” pronounced “street”, like in “Main Street”, or “Saint”, as in “Saint Peter”) and machine learning techniques to predict prosody. For a more detailed description for how this works, see [6]. The same TTS system was the “Voice of Watson”  when it played and won the Jeopardy game back in 2011. For an article on how this was done see [7].

We have initially put out two systems that should produce high quality output for general text inputs for both English and Spanish. Hard to pronounce things like unusual names (“Picheny” 🙂 ) or acronyms (“WYSIWYG”, typically pronounced as “wih-zee-wig”) may present challenges. In the future we plan to enable user customization of the pronunciations in addition to continuous work on improving the TTS quality and naturalness.

An Invitation

Hillary Clinton was fond of saying “It takes a village to raise a child”.  Analogously, we hope that you – the “village” of developers – will help improve Watson’s abilities in processing speech to make Watson more useful to you in getting your work done. Look forward to “speaking” with you!

[1] F. Jelinek. “The Development of an Experimental Discrete Dictation Recognizer.” Proceedings IEEE, 73:11 (1985) pp1616-1624, Nov. 1985
[3] Padmanabhan, Mukund, and Michael Picheny. “Large-vocabulary speech recognition algorithms.” Computer 35.4 (2002): 42-50.
[4] Picheny, Michael, et al. “Trends and advances in speech recognition.” IBM Journal of Research and Development 55.5 (2011): 2-1
[5] Hagen Soltau, George Saon, and Tara N. Sainath. “Joint Training of Convolutional and Non-Convolutional Neural Networks.” in Proc. ICASSP(2014).
[7] Rosenberg, Andrew, Raul Fernandez, and Bhuvana Ramabhadran. “” What is… Dengue Fever?”-Modeling and Predicting Pronunciation Errors in a Text-to-Speech System.” INTERSPEECH. 2011.

16 comments on"IBM Watson now brings cognitive speech capabilities to developers"

  1. Truly amazing technology! Thank you guys for opening this up to developers, we’ve already started putting it to use as an alternative to human voice-overs at

  2. Alexandre Rademaker February 17, 2015

    How hard would it be to add Portuguese support?

    • There is always some work involved in adding a new language. We would be interested in hearing if you are speaking generically or you have a specific application in mind. Please feel free to contact me offline to discuss further.

  3. Does the speech to text only support English? If not is there a list of supported languages? I’m looking for Swedish by the way.

  4. […] advances in applications of deep learning to both acoustic modeling, and language modeling (see our earlier blog for background information). However, human performance is measured to be about 4% word error rate […]

  5. Manish Yadav January 12, 2016

    Can it recognise Indian accent english or Is it posible that we trained it to understand recognise Indian accent english?

    • The system is tailored to native speakers of American English. It does handle some forms of accented speech. You should try the service for your use case and see if the performance meets your needs.

  6. Sean Faulkner March 04, 2016

    Watson should be the new operating system that runs on all devices, like Iron Man / Tony Stark’s JARVIS, so it will make all our lives easier.

  7. […] made possible by technological improvements in both acoustic and language modeling (please refer to… for background information about speech recognition systems). On the acoustic side, we use a […]

  8. […] developed with text-only input and output. In 2015, however, IBM announced the addition of speech capabilities (speech-to-text and text-to-speech services) to the Watson Developer Cloud. For an in-depth look at […]

  9. Sir, Does Speech to text recognize different accent?

  10. Sir, I am using Speech-to-Text API. Does it recognize different accent?

Join The Discussion

Your email address will not be published. Required fields are marked *