Overview

Wikipedia is a very popular source of encyclopedic knowledge which provides highly reliable articles in a variety of domains. This richness and popularity created a strong motivation among NLP researchers to develop relatedness measures between Wikipedia concepts. In this paper, we introduce WORD (Wikipedia Oriented Relatedness Dataset), a new type of concept relatedness dataset, composed of 19,276 pairs of Wikipedia concepts. This is the first human annotated dataset of Wikipedia concepts, whose purpose is twofold. On the one hand, it can serve as a benchmark for evaluating concept-relatedness methods. On the other hand, it can be used as supervised data for developing new models for concept relatedness prediction. Among the advantages of this dataset compared to its term-relatedness counterparts, are its built-in disambiguation solution, and its richness with meaningful multiword terms. Based on this benchmark we developed a new tool, named WORT (Wikipedia Oriented Relatedness Tool), for measuring the level of relatedness between pairs of concepts.

Dataset Metadata

Format License Domain Number of Records Size Originally Published
CSV
CC-BY-SA 3.0 Natural Language Processing 19,276 records
3.4MB June 01, 2017

Example Records

# source article URI,concept 1,concept 2,score,concept1 URI,concept2 URI,Train/Test
https://en.wikipedia.org/wiki/Organic_food,Organic farming,Organic food,1,https://en.wikipedia.org/wiki/Organic_farming,https://en.wikipedia.org/wiki/Organic_food,Test
https://en.wikipedia.org/wiki/Video_game_controversies,Video game development,Filmmaking,0.1,https://en.wikipedia.org/wiki/Video_game_development,https://en.wikipedia.org/wiki/Filmmaking,Train
https://en.wikipedia.org/wiki/Multiculturalism,Yugoslav Partisans,Asia,0,https://en.wikipedia.org/wiki/Yugoslav_Partisans,https://en.wikipedia.org/wiki/Asia,Train
https://en.wikipedia.org/wiki/School_voucher,Sweden,Michelle Rhee,0,https://en.wikipedia.org/wiki/Sweden,https://en.wikipedia.org/wiki/Michelle_Rhee,Train
https://en.wikipedia.org/wiki/Intact_dilation_and_extraction,Pain,Henrietta Lacks,0,https://en.wikipedia.org/wiki/Pain,https://en.wikipedia.org/wiki/Henrietta_Lacks,Train

Citation

@inproceedings{dor2018semantic,
title={Semantic Relatedness of Wikipedia Concepts--Benchmark Data and a Working Solution},
author={Ein-Dor, Liat and Halfon, Alon and Kantor, Yoav and Levy, Ran and Mass, Yosi and Rinott, Ruty and Shnarch, Eyal and Slonim, Noam},
booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)},
year={2018}
}
  • Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.