Overview
The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia.
Dataset Metadata
Field | Value |
---|---|
Format | Text |
License | CC BY-SA 3.0 |
Domain | Natural Language Processing |
Number of Records | 101,880,768 tokens |
Data Split | 101,425,671 training tokens 213,886 validation tokens 241,211 test tokens |
Size | 181 MB |
Author | Salesforce |
Origin | Raw text from Wikipedia collected by Salesforce Research |
Dataset Version Update | Version 1 – March 17, 2020 |
Data Coverage | The dataset contains tokens from 28,588 Wikipedia articles verified ‘Good’ or ‘Featured’ |
Business Use Case | Document Analysis: Use heading and subheading labels to train a model capable of document structure recognition to organize freeform text. |
Dataset Archive Content
File or Folder | Description |
---|---|
wiki.train.tokens |
Tokens in the training subset |
wiki.valid.tokens |
Tokens in the validation subset |
wiki.test.tokens |
Tokens in the testing subset |
LICENSE.txt |
Plaintext version of the CDLA-Permissive license |
README.txt |
Text file with the file names and description |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook
Legend