Overview

Text clustering is a widely-studied NLP problem. Clustering can be applied to texts at different levels, from single words to full documents, and can vary with respect to the clustering goal. In thematic clustering, the aim is to cluster texts based on thematic similarity between them, namely grouping together texts that discuss the same theme.

Thematic clustering of sentences is important for various use cases. For example, in multi-document summarization, one often extracts sentences from multiple documents that should be organized into meaningful sections and paragraphs. Similarly, within the emerging field of computational argumentation, arguments may be found in a widespread set of articles, which further require thematic organization to generate a compelling argumentative narrative.

Evaluation of thematic clustering methods requires a ground truth dataset of sentence clustering.  Unfortunately, sentence clustering is considered a very difficult task for humans. As a result, there is no standard human annotated sentence clustering dataset.

In the dataset “Thematic Clustering of Sentences” sentences are annotated for their thematic clusters. This annotation enables to evaluate sentence clustering methods. The dataset was generated automatically by leveraging the partition of Wikipedia articles into sections. The sentences of each article are the clustered objects, and their partition into sections is the ground truth for their thematic clustering. The dataset contains 692 articles, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Dataset Metadata

Format License Domain Number of Records Size
CSV
CC-BY-SA 3.0 Natural Language Processing 692 records
10.6MB

Example Records

# article,sentence,section title,article link
Moeller High School,"Moeller's student-run newspaper, The Crusader, is consistently recognized as being one of the top in the region.", School publications, https://en.wikipedia.org/wiki/Moeller_High_School

Citation

@inproceedings{dor2018learning,
title={Learning Thematic Similarity Metric from Article Sections Using Triplet Networks},
author={Ein-Dor, Liat and Mass, Yosi and Halfon, Alon and Venezian, Elad and Shnayderman, Ilya and Aharonov, Ranit and Slonim, Noam},
booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
pages={49--54},
year={2018}
}
  • Project Debater Project Debater is the first AI system that can debate humans on complex topics. The goal is to help people build persuasive arguments and make well-informed decisions. This dataset contributed to training the models in Project Debater.