The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia.

Dataset Metadata

Field Value
Format Text
License CC BY-SA 3.0
Domain Natural Language Processing
Number of Records 101,880,768 tokens
Data Split 101,425,671 training tokens
213,886 validation tokens
241,211 test tokens
Size 181 MB
Author Salesforce
Origin Raw text from Wikipedia collected by Salesforce Research
Dataset Version Update Version 1 – March 17, 2020
Data Coverage The dataset contains tokens from 28,588 Wikipedia articles verified ‘Good’ or ‘Featured’
Business Use Case Document Analysis: Use heading and subheading labels to train a model capable of document structure recognition to organize freeform text.

Dataset Archive Content

File or Folder Description
wiki.train.tokens Tokens in the training subset
wiki.valid.tokens Tokens in the validation subset
wiki.test.tokens Tokens in the testing subset
LICENSE.txt Plaintext version of the CDLA-Permissive license
README.txt Text file with the file names and description

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook

Quick access in Python (requires the pardata pypi package):

$ pip install pardata

import pardata
data = pardata.load_dataset('wikitext103')