We all know that chatbots are AI’s answer to improved customer service and cost savings. If this could be combined with speech capability, then the customer service department would be completely transformed and taken to a whole new level. Watson Assistant is a conversational AI solution for businesses. Watson Speech to Text and Watson Text to Speech are speech recognition APIs to convert speech to text and text to speech. By combining these, you can create a voice-enabled chatbot. In this blog, I explain the details of building a web-based voice bot. You can also look at the code pattern to follow along and build your own chatbot.

The basic flow is:

  1. The user speaks through the web UI.
  2. The user’s voice is recorded using an appropriate web-based recorder.
  3. The recording is passed to the Watson Speech to Text service on the cloud where it is transcribed into text.
  4. The text is then passed to Watson Assistant, which identifies the intents and entities to give an appropriate response.
  5. This text response is passed to Watson Text to Speech, which provides the audio response in a natural voice.

flow

The code on the client side is primarily in Javascript, jQuery, and Flask. All of the REST API calls to the different Watson services are made from Flask.

The user clicks the mic image on the screen to start the recording process, then clicks it again to stop recording and indicate the end of the voice stream. I have used a third-party script that uses the Web Audio API to record the audio. The code for it and the associated documentation can be found in this link.

After the audio recording is obtained, a WebSocket connection is opened to obtain a persistent link with the speech to text service. The WebSocket interface, unlike the REST interface, provides a single-socket, full-duplex communication channel. The interface lets the client send requests and audio to the service and receive results over a single connection in an asynchronous fashion. The audio can also be transcribed through REST APIs.

The result of the audio transcription is passed to Watson Assistant, again through a REST API call. Based on the intents and entities identified, it gives an appropriate text response in JSON format.

The text is passed to the Watson Text to Speech service through a REST API. It synthesizes the voice in the required language and voice. The voice can also be transformed using the appropriate SSML (Speech Synthesis Markup Language) tags. Thus, the user gets an audio-based response.