PubLayNet

Overview

PubLayNet is a dataset for document layout analysis. It contains images of research papers and articles and annotations for various elements in a page such as “text”, “list”, “figure” etc in these research paper images. The dataset was obtained by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central.

Dataset Metadata

Field Value
Format JPG, JSON
License CDLA-Permissive
Domain Computer Vision
Number of Records 358,353 images
Data Split 335,703 training images
11,245 validation images
11,405 test images
Size 102 GB
Author Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes
Origin Images of research papers from PubMed and annotations generated by IBM Research Australia.
Dataset Version Update Version 1 – August 07, 2019
Data Coverage The dataset contains images of research papers from the medical domain.
Business Use Case The dataset can be used to train a model to extract various elements of a document such as tables, figures, texts etc.
This can aid businesses dealing with a large number of documents to easily categorize the various elements in their documents.

Dataset Archive Content

File or Folder Description
train/ Images in the training subset
val/ Images in the validation subset
test/ Images in the testing subset
train.json Annotations for training images
val.json Annotations for validation images
LICENSE.txt Plaintext version of the CDLA-Permissive license
README.txt Text file with the file names and description

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook

Citation

@article{zhong2019publaynet,
title={PubLayNet: largest dataset ever for document layout analysis},
author={Zhong, Xu and Tang, Jianbin and Yepes, Antonio Jimeno},
journal={arXiv preprint arXiv:1908.07836},
year={2019}
}
Legend