Overview
PubLayNet is a dataset for document layout analysis. It contains images of research papers and articles and annotations for various elements in a page such as “text”, “list”, “figure” etc in these research paper images. The dataset was obtained by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central.
Dataset Metadata
Field | Value |
---|---|
Format | JPG, JSON |
License | CDLA-Permissive |
Domain | Computer Vision |
Number of Records | 358,353 images |
Data Split | 335,703 training images 11,245 validation images 11,405 test images |
Size | 102 GB |
Author | Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes |
Origin | Images of research papers from PubMed and annotations generated by IBM Research Australia. |
Dataset Version Update | Version 1 – August 07, 2019 |
Data Coverage | The dataset contains images of research papers from the medical domain. |
Business Use Case | The dataset can be used to train a model to extract various elements of a document such as tables, figures, texts etc. This can aid businesses dealing with a large number of documents to easily categorize the various elements in their documents. |
Dataset Archive Content
File or Folder | Description |
---|---|
train/ |
Images in the training subset |
val/ |
Images in the validation subset |
test/ |
Images in the testing subset |
train.json |
Annotations for training images |
val.json |
Annotations for validation images |
LICENSE.txt |
Plaintext version of the CDLA-Permissive license |
README.txt |
Text file with the file names and description |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by a data exploration notebook to help you get started : Try the completed notebook
Citation
@article{zhong2019publaynet,
title={PubLayNet: largest dataset ever for document layout analysis},
author={Zhong, Xu and Tang, Jianbin and Yepes, Antonio Jimeno},
journal={arXiv preprint arXiv:1908.07836},
year={2019}
}
Related Links
PubTabNet – a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables.
Model Asset eXchange (MAX) – A place for developers to find and use free and open source deep learning models.
Center for Open-Source Data & AI Technologies (CODAIT) – Improving the Enterprise AI Lifecycle in Open Source.”