2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Groningen Meaning Bank – Modified


The Groningen Meaning Bank (GMB) is a dataset of multi-sentence texts, together with annotations for parts-of-speech, named entities, lexical categories and other natural language structural phenomena.

Dataset Metadata

Field Value
Format IOB format
License CDLA-Sharing
Domain Natural Language Processing
Number of Records 1,314,115 terms
Size 10 MB
Origin University of Groningen
Dataset Version Update Version 2 – May 14, 2020
Version 1 – December 19, 2019
Data Coverage The dataset contains only documents authored by Voice of America VOA, together with documents from the MASC dataset and the CIA World Factbook.
Business Use Case Linguistics: Can be used to train a model to perform named entity recognition or part-of-speech tagging, as well as to generate new text features.

Dataset Archive Contents

File or Folder Description
gmb_subset_full.txt A full version of the raw dataset. Used to train MAX model – Named Entity Tagger.
LICENSE.txt Terms of Use
README.txt Explains dataset information

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by data exploration, data visualization, and modeling Python notebooks to help you get started:

Quick access in Python (requires the pardata pypi package):

$ pip install pardata

import pardata
data = pardata.load_dataset('gmb')


   title     = {The Groningen Meaning Bank},
   author    = {Bos, Johan and Basile, Valerio and Evang, Kilian and Venhuizen, Noortje and Bjerva, Johannes},
   booktitle = {Handbook of Linguistic Annotation},
   editor    = {Ide, Nancy and Pustejovsky, James},
   publisher = {Springer},
   volume    = {2},
   pages     = {463--496},
   year      = {2017}