SimpleQuestions Relation Detection


The SimpleQuestions Relation Detection dataset is a set of relation extraction annotations derived from the SimpleQuestions dataset. Each entry in this dataset follows the order of questions listed in the SimpleQuestions dataset and corresponds to the following format: gold_relations \t negative_relation_pool \t question. The relation ids are mapped in a separate file titled relation.2M.list where the index of the ids starts at 1. The dataset is split into train, validation, and test sets to match the split used by the SimpleQuestions data.

Dataset Metadata

Field Value
Format TSV
License CDLA-Permissive
Domain Natural Language Processing
Number of Records 108,442 questions
Data Split 77,524 training questions
10,309 validation questions
20,609 test questions
Size 7.7 MB
Dataset Origin Original SimpleQuestions dataset from Facebook Research, derived annotations by IBM Research
Dataset Version Update Version 1 – May 07, 2020
Data Coverage Randomized facts from Knowledge Base Freebase
Business Use Case Linguistics: Train a relationship extraction model that can be used to build a family tree graph autobiographical text.

Dataset Archive Content

File or Folder Description
train.replace_ne.withpool Questions in the training subset
valid.replace_ne.withpool Questions in the validation subset
test.replace_ne.withpool Questions in the testing subset
relation.2M.list Relation id mappings
LICENSE.txt Plaintext version of the CDLA-Permissive license
README.txt Text file with the file names and description

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook


 title={Improved Neural Relation Detection for Knowledge Base Question Answering},
 author={Yu, Mo and Yin, Wenpeng and Hasan, Kazi Saidul and dos Santos, Cicero and Xiang, Bing and Zhou, Bowen},
 booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},