MedNLI

Overview

Note: this dataset is hosted on a third-party site and not on the Data Asset Exchange. Clicking on the "Get this dataset" link above will direct you to physionet.org.

This dataset contains medical information and requires the user to complete a training course before accessing the dataset.

Natural Language Inference (NLI) is one of the critical tasks for understanding natural language. The objective of NLI is to determine if a given hypothesis can be inferred from a given premise. NLI systems have made significant progress over the years, and has gained popularity since the recent release of datasets such as the Stanford Natural Language Inference (SNLI) (Bowman et al. 2015) and Multi-NLI (Nangia et al. 2017).

We introduce MedNLI - a dataset annotated by doctors, performing a natural language inference task), grounded in the medical history of patients. We present strategies to: 1) leverage transfer learning using datasets from the open domain, (e.g. SNLI) and 2) incorporate domain knowledge from external data and lexical sources (e.g. medical terminologies). Our results demonstrate performance gains using both strategies.

Dataset Metadata

Format	License	Domain	Number of Records	Size
JSON Lines	Special Access	Medical	Training (11,232 pairs) Development (1,395 pairs) Test (1,422 pairs)	14 MB

MedNLI Website website providing more information about MedNLI

MedNLI

Overview

Dataset Metadata

Related Links