TensorFlow Speech Commands


The TensorFlow Speech Commands dataset contains 65,000 audio clips, each 1 second in length, for 30 common words. 20 of the words are core words with numerous examples per speaker, while 10 words are auxiliary words with fewer examples per speaker (in many cases just a single example). The audio clips were originally collected by Google, and recorded by volunteers in uncontrolled locations around the world. Due to some randomness during recording, varying amounts of background noise and speaker accents are present in the samples, boosting the diversity of the dataset.

The core words in this dataset are; “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, “Go”, “Zero”, “One”, “Two”, “Three”, “Four”, “Five”, “Six”, “Seven”, “Eight”, and “Nine”. The auxiliary words are “Bed”, “Bird”, “Cat”, “Dog”, “Happy”, “House”, “Marvin”, “Sheila”, “Tree”, and “Wow”. The core words can be used to train a speech commands classifier while the auxiliary words were included to help potential algorithms distinguish from unrecognized words.

Included along with the 30 words is a collection of background noise audio files. The background noise is either recorded or mathematically generated and may be useful to mix in with some of the words to further vary the dataset.

Dataset Metadata

Format License Domain Number of Records Size
CC BY 4.0 Audio 65,000 WAV files
1.49 GB