Overview
The crawled web-pages were preprocessed and all the pages belonging to the same thread were identified and processed to identify different structural units and their associated metadata (title, posts, user IDs etc.). Stemming was performed using Porter’s stemmer and stop words were removed using a general stop word list of 429 words used in the Onix Test Retrieval Toolkit.
Dataset Metadata
Field | Value |
---|---|
Format | XML |
License | CC BY-SA 4.0 |
Domain | Natural Language Processing |
Number of Records | 113,277 discussion threads 25 queries |
Data Split | NA |
Size | 104 MB (compressed) |
Dataset Origin | IBM Research |
Dataset Version | Version 2 – September 12, 2019 Version 1 – July 16, 2019 |
Data Coverage | Randomly sampled 100 threads from the dataset of discussion threads from Ubuntu Forums used in previous research. |
Business Use Case | Social Media moderation: This dataset can help in improving information extraction and intelligent assistance techniques |
Dataset Archive Contents
File or Folder | Description |
---|---|
Gold_Summaries |
It contains summaries created by 2 human annotators. |
Scored_Posts |
It contains a list of ranked posts based on their similarity with the gold summaries. For details, please refer the original paper. |
LICENSE.txt |
Terms of Use |
README.md |
Explains data collection, processing details, and steps for splitting dataset |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by starter notebooks that will help you get started:
Citation
@conference{Bhatia2012,
author="Sumit Bhatia
and Prakhar Biyani
and Prasenjit Mitra",
title="Classifying User Messages For Managing Web Forum Data",
booktitle="International Work- shop on the Web and Databases",
year="2012",
pages="13-18
}
Legend