IBM Debater® Mention Detection Benchmark

Overview

A large, high-quality benchmark dataset for mention detection. The goal of Mention Detection is to map entities/concepts mentioned in text to the correct concept in a knowledge base. The benchmark contains annotations of both named entities as well as other types of entities, annotated on different types of text, ranging from clean text taken from Wikipedia, to noisy spoken data. The benchmark was built through a highly controlled crowd sourcing process to ensure its quality. There are 3000 sentences with a total of 6375 Mentions in the Wikipedia sentences and 6239 Mentions in the spoken sentences.

Dataset Metadata

Field Value
Format ANN
License CC-BY-SA 3.0
Domain Natural Language Processing
Number of Records 3,000 sentences, 6375 mentions in the wikipedia sentences and 6239 mentions in the spoken sentences.
Data Split Train – 1,500 sentences and mentions & Test – 1,500 sentences and mentions
Size 1.8MB
Author Yosi Mass, Lili Kolterman
Data Origin IBM Research
Dataset Version Update Version 1 – January 25, 2018
Data Coverage This dataset contains 3000 sentences taken from Wikipedia articles (1000), cleansed manual transcription (1000) and output of an automated speech recognition engine (1000) discussing different topics. Some of the topics covered are in Debatabase.
Business Use Case News & Entertainment
This dataset can be use for powering content recommendation like news articles, shows, and so on.

Dataset Archive Content

File or Folder Description
README.txt Readme of the mention detection dataset
topics.csv Contains the topic files that were associated with the sentences
data Data directory contains 6 folders for the 3 datasets.
attribution Contains attribution files for the Wikipedia sentences with pointers to the articles from where they were taken

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook

Citation

@misc{mass2018did,
      title={What did you Mention? A Large Scale   Mention Detection Benchmark for Spoken and Written Text}, 
      author={Yosi Mass and Lili Kotlerman and Shachar Mirkin and Elad Venezian and Gera Witzling and Noam Slonim},
      year={2018},
      eprint={1801.07507},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
  • IBM Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.
Legend