Now available! Red Hat OpenShift Container Platform for Linux on IBM Z and LinuxONE Learn more

Groningen Meaning Bank – Modified


The Groningen Meaning Bank (GMB) is a dataset of multi-sentence texts, together with annotations for parts-of-speech, named entities, lexical categories and other natural language structural phenomena. The dataset was developed at the University of Groningen and comprises documents taken from 5 sources, predominantly news articles from the Voice of America (VOA) website. This subset of the GMB dataset consists of documents that were verified to be in the public domain by means of a computer script. The subset contains only documents authored by VOA, together with documents from the MASC dataset and the CIA World Factbook.

Dataset Metadata

Format License Domain Number of Records Size
IOB format) CDLA-Sharing Natural Language Processing 1,314,115 (sentences) 10MB

Example Records

Masked O
assailants O
with O
grenades O
and O
automatic O
weapons O
attacked O
a O
wedding O
party O
in O
southeastern O
Turkey B-GEO
, O
killing O
45 O
people O
and O
wounding O
at O
least O
six O
others O
. O

Turkish B-GPE
officials O


   title     = {The Groningen Meaning Bank},
   author    = {Bos, Johan and Basile, Valerio and Evang, Kilian and Venhuizen, Noortje and Bjerva, Johannes},
   booktitle = {Handbook of Linguistic Annotation},
   editor    = {Ide, Nancy and Pustejovsky, James},
   publisher = {Springer},
   volume    = {2},
   pages     = {463--496},
   year      = {2017}