Overview
The dataset consists of 100 discussion threads crawled from Ubuntu Forums discussions. Each message in each individual thread is assigned a dialog label out of following eight classes: question
, repeat question
, clarification
, further details
, solution
, positive feedback
, negative feedback
, junk
.
Dataset Metadata
Field | Value |
---|---|
Format | XML |
License | CC BY 4.0 |
Domain | Natural Language Processing |
Number of Records | 529 messages |
Data Split | NA |
Size | 104 MB (compressed) |
Author | Sumit Bhatia, Prakhar Biyani, Prasenjit Mitra |
Dataset Origin | IBM Reseach, India |
Dataset Version | Version 2 – September 12, 2019 Version 1 – July 16, 2019 |
Data Coverage | The dataset consists of 100 discussion threads crawled from Ubuntu Forums discussions |
Business Use Case | Social Media moderation – This dataset can help train a model to classify comments on forums or social media platforms and help moderate discussions on such platforms. |
Dataset Archive Contents
File or Folder | Description |
---|---|
Ubuntu folder |
This folder contains .xml files which are discussion threads crawled from Ubuntu Forums Discussions |
LICENSE.txt |
Terms of Use |
README.md |
Explains data collection, processing details, and steps for splitting dataset |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by starter notebooks that will help you get started:
Citation
@article{ahu61This,
author="Sumit Bhatia
and Prakhar Biyani
and Prasenjit Mitra",
title="Identifying the Role of Individual User Messages in an Online Discussion and its Applications in Thread Retrieval",
journal="Journal of the Association for Information Science and Technology",
volume="67",
year="2015",
pages="276-288",
}
Legend