2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more



Note: this dataset is hosted on a third-party site and not on the Data Asset Exchange. Clicking on the “Get this dataset” link above will direct you to physionet.org.

This dataset contains medical information and requires the user to complete a training course before accessing the dataset.

Natural Language Inference (NLI) is one of the critical tasks for understanding natural language. The objective of NLI is to determine if a given hypothesis can be inferred from a given premise. NLI systems have made significant progress over the years, and has gained popularity since the recent release of datasets such as the Stanford Natural Language Inference (SNLI) (Bowman et al. 2015) and Multi-NLI (Nangia et al. 2017).

We introduce MedNLI – a dataset annotated by doctors, performing a natural language inference task), grounded in the medical history of patients. We present strategies to: 1) leverage transfer learning using datasets from the open domain, (e.g. SNLI) and 2) incorporate domain knowledge from external data and lexical sources (e.g. medical terminologies). Our results demonstrate performance gains using both strategies.

Dataset Metadata

Format License Domain Number of Records Size
JSON Lines
Special Access Medical Training (11,232 pairs)
Development (1,395 pairs)
Test (1,422 pairs)
14 MB