2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

COVID-19 Questions


The dataset contains categorised questions which were frequently asked by the public during the COVID-19 pandemic period. It was created to ramp-up a dialogue system that provides answers to questions frequently asked by the public. The dataset made publicly available here in the hopes of further promoting research on semantic utterance classification for goal-oriented dialogue systems.

Dataset Metadata

Field Value
Format TSV
License CDLA-Sharing
Domain Natural Language Processing
Number of Records 844
Size 49KB
Author Naama Tepper, Esther Goldbraich
Dataset Origin IBM
Dataset Version Version 1 – Oct 1, 2020
Data Coverage COVID-19 related enquires
Business Use Case COVID-19 chatbot

Dataset Archive Content

File or Folder Description
LICENSE.txt Terms of Use
covid_19_questions.tsv Full version of raw dataset.

Data Glossary and Preview

For a full view of this dataset’s metadata, data glossary, and a set of sample records click on the Preview the dataset button displayed above or follow the link here.

Use the Dataset

This dataset is complemented by starter notebooks that will help you get started:

Quick access in Python (requires the pardata pypi package):

$ pip install pardata

import pardata
data = pardata.load_dataset('covid19_questions')


  title={Balancing via Generation for Multi-Class Text Classification Improvement},
  author={Tepper, Naama and Golbraich, Esther and Zwerdling, Naama and Kour, George and Anaby-Tavor, Ateret and Carmeli, Boaz},
  journal={Findings of EMNLP 2020},