Overview
The Wikipedia Oriented Relatedness Dataset, or WORD, is a new type of concept relatedness dataset, composed of 19,276 pairs of Wikipedia concepts. This is the first human annotated dataset of Wikipedia concepts, whose purpose is twofold. On the one hand, it can serve as a benchmark for evaluating concept-relatedness methods. On the other hand, it can be used as supervised data for developing new models for concept relatedness prediction. Among the advantages of this dataset compared to its term-relatedness counterparts, are its built-in disambiguation solution, and its richness with meaningful multiword terms.
Dataset Metadata
| Field | Value |
|---|---|
| Format | CSV |
| License | CC-BY-SA 3.0 |
| Domain | Natural Language Processing |
| Number of Records | 19,276 concept pairs |
| Data Split | NA |
| Size | 3.4 MB |
| Dataset Origin | IBM Project Debater |
| Dataset Version Update | Version 1 – June 06, 2017 |
| Data Coverage | Random concept pairs based on 38,552 randomly selected Wikipedia articles |
| Business Use Case | Automated Customer Service: Train a chatbot to label and compare user query’s concept type with list of available concepts the chatbot is capable of discussing. |
Dataset Archive Contents
| File or Folder | Description |
|---|---|
AnnotationGuidelines.docx |
The labeling task guidelines used to label concept pairs |
WORD.csv |
Raw data |
LICENSE.txt |
Terms of Use |
README.txt |
Explains dataset information |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by a data exploration notebook to help you get started:
Related Links
- Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.
Citation
@inproceedings{dor2018semantic,
title={Semantic Relatedness of Wikipedia Concepts--Benchmark Data and a Working Solution},
author={Ein-Dor, Liat and Halfon, Alon and Kantor, Yoav and Levy, Ran and Mass, Yosi and Rinott, Ruty and Shnarch, Eyal and Slonim, Noam},
booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)},
year={2018}
}