Think 2021: The premier hybrid cloud and AI event, May 11-12 Register now

IBM Debater® Recorded Debating #1

Overview

Engaging in a competitive debate requires Project Debater to effectively rebut arguments raised by the human opponent. The system must listen to an argumentative speech in real-time, understand the main arguments, and produce persuasive counter-arguments.

The nature of the argumentation domain and the characteristics of competitive debates make the understanding of such spoken content challenging. Expressed ideas often span multiple non-consecutive sentences and many arguments are alluded to rather than explicitly stated. Further difficulty stems from the requirement to identify and rebut the most important parts of a speech that is several minutes long. This contrasts with today’s conversational agents, which aim at understanding a single functional command from short inputs. The goal of this dataset is to form a basis for the development of listening comprehension algorithms in this challenging setting.

Release #1 of the dataset contains 60 recorded speeches from 16 controversial topics, and details the recordings process.

The recorded debates are provided in various formats:

  • The recorded audio (wav files)
  • Text produced from the audio using an automatic speech recognition (ASR) system (text files)
  • A manually corrected transcript of the ASR text, created by expert annotators (text files)

Both the ASR and transcript texts are given in their raw form, designating also the time within the audio in which each utterance was said, and in another “NLP-friendly” clean version containing only the spoken words.

Get this Dataset

Data Description Zipped File Name
Full (Original) Dataset, 1.75 GB recorded-debating-1.tar.gz
Sample Dataset, 213.6 MB sample.tar.gz

Dataset Metadata

Field Value
Format WAV
TXT
License CC-BY-SA 3.0
Domain Natural Language Processing
Number of Records 60 speeches
Data Split NA
Size 1.6 GB
Author Shachar Mirkin, Michal Jacovi, Tamar Lavee, Hong-kwang Kuo, Samuel Thomas, Leslie Sager, Lili Kotlerman, Elad Venezian, Noam Slonim
Dataset Origin [IBM Research(https://www.research.ibm.com/artificial-intelligence/project-debater/)
Dataset Version Update Version 2 – June 29, 2020
Version 1 – August 3, 2019
Data Coverage dataset contains 60 recorded speeches from 16 controversial topics, and details the recordings process
Business Use Case Debate The dataset can be used to engage in a competitive debate and to effectively rebut arguments raised by the human opponent.
Government – Analyze sentiment of political topics and conversations.

Dataset Archive Content

File or Folder Description
.wav folder Audio files (speeches)
.wav.asr folder Raw ASR transcripts
.wav.asr.txt folder Post-processed (“clean”) ASR transcripts
.trs folder Manual transcripts (Transcriber’s format)
.trs.txt folder Processed (clean) manual transcripts (“references”)
LICENSE.txt Plaintext version of the CC-BY-SA 3.0 license
README.txt Text file with the file names, file folders and description

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook

Citation

@InProceedings{MIRKIN18.66,
author = {Shachar Mirkin and Michal Jacovi and Tamar Lavee and Hong-Kwang Kuo and Samuel Thomas and Leslie Sager and Lili Kotlerman and Elad Venezian and Noam Slonim},
title = "{Recorded Debating Speeches}",
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018}
}
  • Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.
Legend