Overview
This model converts speech into text form. The model takes a short (~5 second), single channel WAV
file containing
English language speech as an input and returns a string containing the predicted speech.
The model expects 16kHz audio, but will resample the input if it is not already 16kHz. Note this will likely negatively impact the accuracy of the model.
The code for this model comes from Mozilla’s Project DeepSpeech and is based on Baidu’s Deep Speech research paper.
Model Metadata
Domain | Application | Industry | Framework | Training Data | Input Data Format |
---|---|---|---|---|---|
Audio | Speech Recognition | General | TensorFlow | Mozilla Common Voice | Audio (16 bit, 16 kHz, mono WAV file) |
References
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng, “Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567
- Mozilla DeepSpeech
Licenses
Component | License | Link |
---|---|---|
Model Github Repository | Apache 2.0 | LICENSE |
Model Weights | Mozilla Public License 2.0 | Mozilla DeepSpeech |
Model Code (3rd party) | Mozilla Public License 2.0 | DeepSpeech LICENSE |
Test assets | Various | Samples README |
Options available for deploying this model
This model can be deployed using the following mechanisms:
Deploy from Dockerhub:
To run the docker image, which automatically starts the model serving API, run:
docker run -it -p 5000:5000 codait/max-speech-to-text-converter
This will pull a pre-built image from Docker Hub (or use an existing image if already cached locally) and run it. If you’d rather checkout and build the model locally you can follow the run locally steps below.
Deploy on Red Hat OpenShift:
Follow the instructions for the OpenShift web console or the OpenShift Container Platform CLI in this tutorial and specify
codait/max-speech-to-text-converter
as the image name.Deploy on Kubernetes:
You can also deploy the model on Kubernetes using the latest docker image on Docker Hub.
On your Kubernetes cluster, run the following commands:
kubectl apply -f https://raw.githubusercontent.com/IBM/max-speech-to-text-converter/master/max-speech-to-text-converter.yaml
A more elaborate tutorial on how to deploy this MAX model to production on IBM Cloud can be found here.
The model will be available internally at port
5000
, but can also be accessed externally through theNodePort
.Locally: follow the instructions in the model README on GitHub
Example Usage
You can test or use this model
Test the model using cURL
Once deployed, you can test the model from the command line. For example if running locally:
curl -F "audio=@samples/8455-210777-0068.wav" -X POST http://localhost:5000/model/predict
{"status": "ok", "prediction": "your power is sufficient i said"}
Test the model in a serverless app
You can utilize this model in a serverless application by following the instructions in the Leverage deep learning in IBM Cloud Functions tutorial.
Resources and Contributions
If you are interested in contributing to the Model Asset Exchange project or have any queries, please follow the instructions here.