Forum Summarization


The crawled web-pages were preprocessed and all the pages belonging to the same thread were identified and processed to identify different structural units and their associated metadata (title, posts, user IDs etc.). Stemming was performed using Porter’s stemmer and stop words were removed using a general stop word list of 429 words used in the Onix Test Retrieval Toolkit.

Dataset Metadata

Format License Domain Number of Records Size
CC BY-SA 4.0 Natural Language Processing 113,277 discussion threads
25 queries
104 MB (compressed)


author="Sumit Bhatia
and Prakhar Biyani
and Prasenjit Mitra",
title="Classifying User Messages For Managing Web Forum Data",
booktitle="International Work- shop on the Web and Databases",