2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Forum Summarization


The crawled web-pages were preprocessed and all the pages belonging to the same thread were identified and processed to identify different structural units and their associated metadata (title, posts, user IDs etc.). Stemming was performed using Porter’s stemmer and stop words were removed using a general stop word list of 429 words used in the Onix Test Retrieval Toolkit.

Dataset Metadata

Field Value
Format XML
License CC BY-SA 4.0
Domain Natural Language Processing
Number of Records 113,277 discussion threads
25 queries
Data Split NA
Size 104 MB (compressed)
Dataset Origin IBM Research
Dataset Version Version 2 – September 12, 2019
Version 1 – July 16, 2019
Data Coverage Randomly sampled 100 threads from the dataset of discussion threads from Ubuntu Forums used in previous research.
Business Use Case Social Media moderation: This dataset can help in improving information extraction and intelligent assistance techniques

Dataset Archive Contents

File or Folder Description
Gold_Summaries It contains summaries created by 2 human annotators.
Scored_Posts It contains a list of ranked posts based on their similarity with the gold summaries. For details, please refer the original paper.
LICENSE.txt Terms of Use
README.md Explains data collection, processing details, and steps for splitting dataset

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by starter notebooks that will help you get started:


author="Sumit Bhatia
and Prakhar Biyani
and Prasenjit Mitra",
title="Classifying User Messages For Managing Web Forum Data",
booktitle="International Work- shop on the Web and Databases",