Groningen Meaning Bank – Modified


The Groningen Meaning Bank (GMB) is a dataset of multi-sentence texts, together with annotations for parts-of-speech, named entities, lexical categories and other natural language structural phenomena. The dataset was developed at the University of Groningen and comprises documents taken from 5 sources, predominantly news articles from the Voice of America (VOA) website. This subset of the GMB dataset consists of documents that were verified to be in the public domain by means of a computer script. The subset contains only documents authored by VOA, together with documents from the MASC dataset and the CIA World Factbook.

Dataset Metadata

Format License Domain Number of Records Size
IOB format) CDLA-Sharing Natural Language Processing 1,314,115 (sentences) 10MB

Example Records

Masked O
assailants O
with O
grenades O
and O
automatic O
weapons O
attacked O
a O
wedding O
party O
in O
southeastern O
Turkey B-GEO
, O
killing O
45 O
people O
and O
wounding O
at O
least O
six O
others O
. O

Turkish B-GPE
officials O


