2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

IBM Debater® Concept Abstractness


Abstractness quantifies the degree to which an expression denotes an entity that can be directly perceived by human senses. As an example, the word “feminism” is usually perceived as abstract, but the word “screwdriver” is associated with a concrete meaning. The Concept Abstractness dataset contains 300K Wikipedia concepts automatically rated for their degree of abstractness.

Dataset Metadata

Field Value
Format CSV
License CC-BY-SA 3.0
Domain Natural Language Processing
Number of Records 300,000 words or phrases
Data Split 100,00 each of Unigrams, Bigrams and Trigrams
Size 3.6 MB
Author Ella Rabinovich, Benjamin Sznajder, Artem Spector, Ilya Shnayderman, Ranit Aharonov, David Konopnicki, Noam Slonim
Dataset Origin IBM Research – Project Debater
Dataset Version Update Version 1.0.2 – 2018-10-27
Data Coverage 300K concepts from Wikipedia comprised of 1-3 worded phrases/words.
Business Use Case Document Understanding Automatically tag document titles for their degree of abstractness. This can be used when maintaining a catalogue of documents and can aid recommender systems or retrieval systems.

Dataset Archive Contents

File or Folder Description
predictions_unigrams.csv Concepts and abstractness scores for unigrams (single worded concepts)
predictions_bigrams.csv Concepts and abstractness scores for bigrams (two word concepts)
predictions_trigrams.csv Concepts and abstractness scores for trigrams (three word concepts)
LICENSE.txt Terms of Use
README.txt Description of files and the data

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration Python notebook to help you get started:

Quick access in Python (requires the pardata pypi package):

$ pip install pardata

import pardata
data = pardata.load_dataset('concept_abstractness')
  • Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.


    author    = {Ella Rabinovich and
                 Benjamin Sznajder and
                 Artem Spector and
                 Ilya Shnayderman and
                 Ranit Aharonov and
                 David Konopnicki and
                Noam Slonim},
    title     = {Learning Concept Abstractness Using Weak Supervision},
    journal   = {CoRR},
    volume    = {abs/1809.01285},
    year      = {2018},
    url       = {http://arxiv.org/abs/1809.01285},
    archivePrefix = {arXiv},
    eprint    = {1809.01285},
    timestamp = {Fri, 05 Oct 2018 11:34:52 +0200},
    biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1809-01285},
    bibsource = {dblp computer science bibliography, https://dblp.org}