IBM Debater® Wikipedia Oriented Relatedness – IBM Developer

Join the Digital Developer Conference: AIOps & Integration to propel your AI-powered automation skills Register for free

IBM Debater® Wikipedia Oriented Relatedness


The Wikipedia Oriented Relatedness Dataset, or WORD, is a new type of concept relatedness dataset, composed of 19,276 pairs of Wikipedia concepts. This is the first human annotated dataset of Wikipedia concepts, whose purpose is twofold. On the one hand, it can serve as a benchmark for evaluating concept-relatedness methods. On the other hand, it can be used as supervised data for developing new models for concept relatedness prediction. Among the advantages of this dataset compared to its term-relatedness counterparts, are its built-in disambiguation solution, and its richness with meaningful multiword terms.

Dataset Metadata

Field Value
Format CSV
License CC-BY-SA 3.0
Domain Natural Language Processing
Number of Records 19,276 concept pairs
Data Split NA
Size 3.4 MB
Dataset Origin IBM Project Debater
Dataset Version Update Version 1 – June 06, 2017
Data Coverage Random concept pairs based on 38,552 randomly selected Wikipedia articles
Business Use Case Automated Customer Service: Train a chatbot to label and compare user query’s concept type with list of available concepts the chatbot is capable of discussing.

Dataset Archive Contents

File or Folder Description
AnnotationGuidelines.docx The labeling task guidelines used to label concept pairs
WORD.csv Raw data
LICENSE.txt Terms of Use
README.txt Explains dataset information

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration notebook to help you get started:

  • Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.


title={Semantic Relatedness of Wikipedia Concepts--Benchmark Data and a Working Solution},
author={Ein-Dor, Liat and Halfon, Alon and Kantor, Yoav and Levy, Ran and Mass, Yosi and Rinott, Ruty and Shnarch, Eyal and Slonim, Noam},
booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)},