IBM Debater® Claim Sentences Search

Overview

Claims are short phrases that an argument aims to prove. The goal of the Claim Sentence Search task is to detect sentences containing claims in a large corpus, given a debatable topic or motion. The dataset contains results of the q_mc query – sentences containing a certain topic, as described in the paper – containing 1.49M sentences. In addition, the dataset contains a claim sentence test set containing 2.5k top predicted sentences of our model, along with their labels. The sentences were retrieved from Wikipedia 2017.

Dataset Metadata

Field Value
Format CSV
License CC-BY-SA 3.0
Domain Natural Language Processing
Number of Records 1,492,080 records
Data Split 844,303 train records
319,513 validation records
328,264 test records
Size 571MB
Author Ran Levy, Ben Bogin
Data Origin IBM Research
Dataset Version Update Version 1 – August 20, 2018
Data Coverage Wikipedia May 2017 dump
Business Use Case Retail
The dataset can be used to train a model to recommend books based on the book description data.”

Dataset Archive Content

File or Folder Description
readme_mc_queries.txt Readme of the claim sentence search results
readme_test_set.txt Readme of the test set
q_mc_train.csv Sentences retrieved by the q_mc query on 70 train topics
q_mc_heldout.csv Sentences retrieved by the q_mc query on 30 heldout topics
q_mc_test.csv Sentences retrieved by the q_mc query on 50 test topics
test_set.csv Top predictions of our system along with their labels

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook

Citation

@inproceedings{levy-etal-2018-towards,
title = "Towards an argumentative content search engine using weak supervision",
author = "Levy, Ran and
Bogin, Ben and
Gretz, Shai and
Aharonov, Ranit and
Slonim, Noam",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
month = aug,
year = "2018",
address = "Santa Fe, New Mexico, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/C18-1176",
pages = "2066--2081",
abstract = "Searching for sentences containing claims in a large text corpus is a key component in developing an argumentative content search engine. Previous works focused on detecting claims in a small set of documents or within documents enriched with argumentative content. However, pinpointing relevant claims in massive unstructured corpora, received little attention. A step in this direction was taken in (Levy et al. 2017), where the authors suggested using a weak signal to develop a relatively strict query for claim{--}sentence detection. Here, we leverage this work to define weak signals for training DNNs to obtain significantly greater performance. This approach allows to relax the query and increase the potential coverage. Our results clearly indicate that the system is able to successfully generalize from the weak signal, outperforming previously reported results in terms of both precision and coverage. Finally, we adapt our system to solve a recent argument mining task of identifying argumentative sentences in Web texts retrieved from heterogeneous sources, and obtain F1 scores comparable to the supervised baseline.",
}
  • Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.
Legend