Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Overview

During the last decades, the influence of psycholinguistic properties of words on cognitive processes has become a major topic of scientific inquiry. Among the most studied psycholinguistic attributes are concreteness, familiarity, imagery, and average age of acquisition. Abstractness quantifies the degree to which an expression denotes an entity that can be directly perceived by human senses. As an example, the word “feminism” is usually perceived as abstract, but the word “screwdriver” is associated with a concrete meaning.

We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The released dataset contains 300K Wikipedia concepts automatically rated for their degree of abstractness.

Dataset Metadata

Format License Domain Number of Records Size Originally Published
CSV
CC-BY-SA 3.0 Natural Language Processing Unigrams (100,000 concepts)
Bigrams (100,000 concepts)
Trigrams (100,000 concepts)
3.6 MB October 27, 2018

Example Records

Concept        Score
a baby story        0.330729888
a beautiful lie        0.237211137
labor rights        0.789824506
neck deep        0.158579799
inauguration        0.354591237
markdown        0.211181579

Citation

@article{DBLP:journals/corr/abs-1809-01285,
    author    = {Ella Rabinovich and
                 Benjamin Sznajder and
                 Artem Spector and
                 Ilya Shnayderman and
                 Ranit Aharonov and
                 David Konopnicki and
                Noam Slonim},
    title     = {Learning Concept Abstractness Using Weak Supervision},
    journal   = {CoRR},
    volume    = {abs/1809.01285},
    year      = {2018},
    url       = {http://arxiv.org/abs/1809.01285},
    archivePrefix = {arXiv},
    eprint    = {1809.01285},
    timestamp = {Fri, 05 Oct 2018 11:34:52 +0200},
    biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1809-01285},
    bibsource = {dblp computer science bibliography, https://dblp.org}
  }
  • Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.