2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more


The dataset consists of 100 discussion threads crawled from Ubuntu Forums discussions. Each message in each individual thread is assigned a dialog label out of following eight classes: question, repeat question, clarification, further details, solution, positive feedback, negative feedback, junk.

Dataset Metadata

Field Value
Format XML
License CC BY 4.0
Domain Natural Language Processing
Number of Records 529 messages
Data Split NA
Size 104 MB (compressed)
Author Sumit Bhatia, Prakhar Biyani, Prasenjit Mitra
Dataset Origin IBM Reseach, India
Dataset Version Version 2 – September 12, 2019
Version 1 – July 16, 2019
Data Coverage The dataset consists of 100 discussion threads crawled from Ubuntu Forums discussions
Business Use Case Social Media moderation – This dataset can help train a model to classify comments on forums or social media platforms and help moderate discussions on such platforms.

Dataset Archive Contents

File or Folder Description
Ubuntu folder This folder contains .xml files which are discussion threads crawled from Ubuntu Forums Discussions
LICENSE.txt Terms of Use
README.md Explains data collection, processing details, and steps for splitting dataset

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by starter notebooks that will help you get started:


author="Sumit Bhatia
and Prakhar Biyani
and Prasenjit Mitra",
title="Identifying the Role of Individual User Messages in an Online Discussion and its Applications in Thread Retrieval",
journal="Journal of the Association for Information Science and Technology",