Tutorial

Convert speech to text, and extract meaningful insights from data

Combine Watson Speech to Text with the watson_nlp library to transcribe speech data and get insights from that data.

The IBM Watson Speech to Text Service is a speech recognition service that offers many functions such as text recognition, audio preprocessing, noise removal, background noise separation, and semantic sentence conversation. It lets you convert speech into text by using AI-powered speech recognition and transcription.

In this tutorial, walk through the steps of starting a Watson Speech to Text Service, connecting to it through Docker in your local system, preprocessing a speech data set, and using the Watson Speech to Text Service to transcribe speech data. The tutorial also shows you how to extract meaningful insights from data by combining the functions of the Watson Speech to Text Service with the watson_nlp library, a common library for natural language processing, document understanding, translation, and trust.

Prerequisites

To follow this tutorial, you must have:

Note: Podman provides a Docker-compatible command-line front end. Unless otherwise noted, all of the Docker commands in this tutorial should work for Podman if you simply alias the Docker CLI with the alias docker=podman shell command.

Steps

Step 1. Setup the environment

Step 1.1. Log in to the IBM Entitled Registry

The IBM Entitled Registry contains various container images for the Watson Speech to Text Service. After you obtain the entitlement key from the container software library, you can log in to the registry with the key and pull the container images to your local machine. Use the following command to log in to the registry.

echo $IBM_ENTITLEMENT_KEY | docker login -u cp --password-stdin cp.icr.io

Step 1.2. Clone the sample code repository

  1. Clone the sample code repository.

    git clone https://github.com/ibm-build-lab/Watson-Speech.git
    
  2. Go to the directory that contains the sample code for this tutorial.

    cd Watson-Speech/single-container-stt
    

Step 1.3. Build the container image

Build a container image with the provided Dockerfile with two pretrained models (en-us-multimedia and fr-fr-multimedia), which include support for two different languages: English (en_US) and French (fr_FR). You can add more models to support other languages by updating the provided Dockerfile as well as the env_config.json and sessionPools.yaml files in the chuck_var directory.

docker build . -t speech-standalone

Step 1.4. Run the container to start the service

Run the container on Docker using the container image that was created in the previous step to start the Watson Speech to Text Service.

docker run --rm --publish 1080:1080 speech-standalone

The service runs in the foreground. Now, you can access this service in your notebook or local machine.

Step 2. Watson Speech To Text analysis

Step 2.1. Data loading and setting up the service

  1. Import and initialize some helper libraries that are used throughout the tutorial.

    from matplotlib import pyplot as plt
    import IPython.display as ipd
    import librosa
    import pandas as pd
    %matplotlib inline
    import soundfile as sf
    
  2. Load the voice data.

     file_name = './Sample_dataset/harvard.wav'
    
  3. Create a custom function to plot the amplitude frequency.

    def print_plot_play(fileName, text=''):
        x, Fs = librosa.load(fileName, sr=None)
        print('%s Fs = %d, x.shape = %s, x.dtype = %s' % (text, Fs, x.shape, x.dtype))
        plt.figure(figsize=(10, 5))
        plt.plot(x, color='blue')
        plt.xlim([0, x.shape[0]])
        plt.xlabel('Time (samples)')
        plt.ylabel('Amplitude')
        plt.tight_layout()
        plt.show()
        ipd.display(ipd.Audio(data=x, rate=Fs))
    

    Load Data

  4. Set up the parameters for using the Watson Speech to Text Service.

    # Setting up the headers for post request to service 
    headers = {"Content-Type": "audio/wav"}
    # Setting up params
    params ={'model':'en-US_Multimedia'}
    speech_to_text_url ='http://localhost:1080/speech-to-text/api/v1/recognize?'
    
  5. Create a function to get the values from the Watson Speech to Text Service.

    def getTextFromSpeech(headers,params,file_name):
        r = requests.post(speech_to_text_url,headers=headers,params =params,data=open(file_name, 'rb'))
        return r.text
    

Step 2.2 Speech data processing

Step 2.2.1. Background audio suppression
  1. Load the speech data, and print the amplitude frequency.

    back_audio ='./Sample_dataset/samples_audio-files_11-ibm-culture-2min.wav'
    print_plot_play(back_audio, text='WAV file: ')
    

    Raw data

  2. Create a custom function to get the transcribed result without processing.

    def show_result(result):
        json_obj = json.loads(result)
        results_data = json_obj['results']
        for result1 in results_data:
            for transcript in result1['alternatives']:
                print("Transcript ---  ", transcript['transcript'])
    
  3. Remove background noise from the data by using the background_audio_suppression parameter with the URL.

    params ={'model':'en-US_Telephony',"background_audio_suppression":"0.5"}
    result = getTextFromSpeech(headers,params,back_audio)
    show_result(result)
    

    Raw Data

    You can see that after suppressing background audio, STT is returning a clean processed transcript.

Step 2.2.2. Speech audio parsing
  1. Use the end_of_phrase_silence_time parameter for speech audio parsing.

    params ={'model':'en-US_Multimedia',"end_of_phrase_silence_time":"0.2"}
    result = getTextFromSpeech(headers,params,file_name)
    

    Raw Data

    You can see that after speech audio parsing, STT is returning a clean processed transcript.

Step 2.2.3 Speaker labels
  1. Set the speaker_labels parameter to find the number of speakers in the speech data.

    params ={'model':'en-US_Telephony',"speaker_labels":"true"}
    speaker_audio = './Sample_dataset/samples_audio-files_07-ibm-earnings-2min.wav'
    result_speaker = getTextFromSpeech(headers,params,speaker_audio)
    
  2. Create a custom function to find the number of speakers in the speech data.

    def get_speaker_data(result_speaker):
        json_obj = json.loads(result_speaker)
        results_data = json_obj['results']
        speaker_data =json_obj['speaker_labels']
        speaker_dict =[]
        # Detect how many speaker in chat 
        i=0
        for speaker in speaker_data:
            if i ==0:
                temp_speaker = speaker['speaker']
                start_time = speaker['from']
                end_time = speaker['to']
                i=i+1
            elif temp_speaker == speaker ['speaker']:
                end_time = speaker['to']
                i=i+1
            elif temp_speaker != speaker ['speaker']:
                speaker_dict.append({'Speaker':temp_speaker, 'start_time':start_time,'end_time':end_time})
                temp_speaker = speaker['speaker']
                start_time = speaker['from']
                end_time = speaker['to']
                i=i+1
        speaker_dict.append({'Speaker':temp_speaker, 'start_time':start_time,'end_time':end_time})
        for result1 in results_data:
            data =result1['alternatives']
            for time in data:
                i =0
                for t in time['timestamps']:
                    if i==0:
                        start_time = t[1]
                    elif i == len(time['timestamps'])-1:
                        end_time = t[2]
                    i=i+1 
                for speaker in speaker_dict:
                     if speaker['end_time'] >= end_time:
                            print("Speaker ",speaker['Speaker'],"  ",time['transcript'])
                            break
    

    Raw Data

    You can see the speakers in the transcript after adding Speaker label arguments to the STT API call.

Step 2.2.4 Response formatting and filtering

The Watson Speech to Text Service provides features that you can use to parse transcription results. You can format a final transcript to include more conventional representations of certain strings and to include punctuation. You can redact sensitive numeric information from a final transcript.

  1. Use the smart_formatting parameter to get conventional results.

    params ={'model':'en-US_Telephony',"smart_formatting":"true","background_audio_suppression":"0.5"}
    result = getTextFromSpeech(headers,params,back_audio)
    

    Smart formatting

    You can see that with smart_formatting=True, the date, punctuation, time, number, and email address have been formatted correctly. Therefore, response formatting and filtering helps in getting a cleaner and processed transcript.

Step 3. Microphone recognition

To record real-time voice, the SpeechRecognition and PyAudio v0.2.12 open source Python libraries are used.

  1. Install the open source libraries.

    • pip install SpeechRecognition or pip3 install SpeechRecognition from the terminal or !pip3 install SpeechRecognition from the Jupyter Notebook
    • brew install portaudio
    • pip install pyaudio or pip3 install pyaudio from the terminal or !pip3 install pyaudio from the Jupyter Notebook
  2. Use a microphone to record the audio.

    r = sr.Recognizer()
    with sr.Microphone() as source:
        print("Say something!")
        audio1 = r.listen(source)
    
  3. Use the Watson Speech to Text Service to transcribe the recorded audio.

     wav_data = audio1.get_wav_data(
                convert_rate=None if audio1.sample_rate >= 8000 else 8000,  # audio samples must be at least 8 kHz
                convert_width=2  # audio samples should be 16-bit
            )
    ipd.display(ipd.Audio(wav_data))    
    
    r = requests.post(speech_to_text_url,headers=headers,params =params,data=wav_data)
    

Step 4. Transcribe customer call and extract meaningful insights using the watson_nlp library

You can use the Watson Speech to Text Service to transcribe calls from the customer care centers. These transcripts can then be used to extract insights by using the watson_nlp library.

  1. Load the customer care call data. The data is available in the same Watson Speech GitHub repo.

    path = "./conusmer_speech_data"
    call_center_list = os.listdir(path)
    print(call_center_list)
    
  2. Create a function to combine the transcripts into one document.

    def get_result(result):
        output =""
        json_obj = json.loads(result)
        results_data = json_obj['results']
        for result1 in results_data:
            for transcript in result1['alternatives']:
                output = output+" "+transcript['transcript']
        return output
    
  3. Process all call center voice data, and create a list of documents.

    call_center_text_list=[]
    for file_name in call_center_list:
        result = getTextFromSpeech(headers,params,path+"/"+file_name)
        call_center_text_list.append(get_result(result))
    
  4. Load the relevant models from the watson_nlp library.

    import watson_nlp
    noun_phrases_model = watson_nlp.load(watson_nlp.download('noun-phrases_rbr_en_stock'))
    keywords_model = watson_nlp.load(watson_nlp.download('keywords_text-rank_en_stock'))
    syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
    
  5. Extend the stop words list to filter out the common stop words from analysis.

    stop_words = list(wnlp_stop_words)
    stop_words.extend(["gimme", "lemme", "cause", "'cuz", "imma", "gonna", "wanna", 
                       "gotta", "hafta", "woulda", "coulda", "shoulda", "howdy","day"])
    
  6. Remove the stop words, and lowercase the text in the transcripts.

    # Pre-processing steps for document level only remove stop words  & Patterns which is find they are comman
    def clean(doc):
        stop_free = " ".join([word.replace('X','').replace('/','') for word in doc.split() if word.lower() not in stop_words])
        return stop_free
    
  7. Extract the keywords and phrases from the transcribed document.

    def extract_keywords(text):
        # Run the Syntax and Noun Phrases models
        syntax_prediction = syntax_model.run(text, parsers=('token', 'lemma', 'part_of_speech'))
        noun_phrases = noun_phrases_model.run(text)
        # Run the keywords model
        keywords = keywords_model.run(syntax_prediction, noun_phrases, limit=5)  
        keywords_list =keywords.to_dict()['keywords']
        key_list=[]
        for i in range(len(keywords_list)):
            key_list.append(keywords_list[i]['text'])
        return {'Complaint data':text,'Phrases':key_list}
    

    Transcript dataframe

  8. Remove unigram and bigrams from the data set, and plot the most frequent phrases.

    Frequent phrases

    You can see that some of the most frequent phrases in the recorded calls were accurate information customers and concern file alert equifacts. You can use these types of insights to understand the pain points and major areas of improvement. For example, the customer service team can create self-service content or direct support to help customers rather than trying to determine who needs to speak to the customer. A customer call that includes the words ‘loan’, ‘mortgage’, ‘loan servicing’, or ‘loan payment issues’ could be sent to the loan department for resolution.

Conclusion

This tutorial walked you through the steps of starting a Watson Speech to Text Service, connecting to it through Docker in your local system, preprocessing the speech data set, and using the Watson Speech to Text Service to transcribe speech data. This tutorial also showed you how to extract meaningful insights from data by combining the functions of the Watson Speech to Text Service with the watson_nlp library. To try out the service, work through the Watson Speech To Text Analysis notebook.

For more examples of using embeddable AI, see the IBM Developer Embeddable AI page.