Overview
A large, high-quality benchmark dataset for mention detection. The goal of Mention Detection is to map entities/concepts mentioned in text to the correct concept in a knowledge base. The benchmark contains annotations of both named entities as well as other types of entities, annotated on different types of text, ranging from clean text taken from Wikipedia, to noisy spoken data. The benchmark was built through a highly controlled crowd sourcing process to ensure its quality. There are 3000 sentences with a total of 6375 Mentions in the Wikipedia sentences and 6239 Mentions in the spoken sentences.
Dataset Metadata
Field | Value |
---|---|
Format | ANN |
License | CC-BY-SA 3.0 |
Domain | Natural Language Processing |
Number of Records | 3,000 sentences, 6375 mentions in the wikipedia sentences and 6239 mentions in the spoken sentences. |
Data Split | Train – 1,500 sentences and mentions & Test – 1,500 sentences and mentions |
Size | 1.8MB |
Author | Yosi Mass, Lili Kolterman |
Data Origin | IBM Research |
Dataset Version Update | Version 1 – January 25, 2018 |
Data Coverage | This dataset contains 3000 sentences taken from Wikipedia articles (1000), cleansed manual transcription (1000) and output of an automated speech recognition engine (1000) discussing different topics. Some of the topics covered are in Debatabase. |
Business Use Case | News & Entertainment This dataset can be use for powering content recommendation like news articles, shows, and so on. |
Dataset Archive Content
File or Folder | Description |
---|---|
README.txt |
Readme of the mention detection dataset |
topics.csv |
Contains the topic files that were associated with the sentences |
data |
Data directory contains 6 folders for the 3 datasets. |
attribution |
Contains attribution files for the Wikipedia sentences with pointers to the articles from where they were taken |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook
Citation
@misc{mass2018did,
title={What did you Mention? A Large Scale Mention Detection Benchmark for Spoken and Written Text},
author={Yosi Mass and Lili Kotlerman and Shachar Mirkin and Elad Venezian and Gera Witzling and Noam Slonim},
year={2018},
eprint={1801.07507},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Related Links
- IBM Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.