Overview
PubTabNet contains heterogeneous tables in both image and HTML format. PubTabNet can be used to train and evaluate image-based table recognition models. The model needs to recognize both the structure and the content of the tables, and be able to reconstruct the HTML representation of the tables solely relying on the table images. The HTML representation encodes both the structure of the tables and the content in each table cell. Position (bounding box) of table cells is also provided to support more diverse model designs. The source of the tables is PubMed Central Open Access Subset (commercial use collection). The tables (in both image and HTML format) are automatically extracted by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.
Dataset Metadata
Field | Value |
---|---|
Format | PNG JSON |
License | CDLA-Permissive |
Domain | Computer Vision |
Number of Records | 516k+ images |
Size | 30GB |
Author | Xu Zhong, Elaheh ShafieiBavani, Antonio Jimeno Yepes |
Dataset Origin | Images of research papers from PubMed and annotations from IBM Research Australia. |
Dataset Version Update | Version 2 – July 20, 2020 Version 1 – November 11, 2019 |
Data Coverage | The dataset contains images of research papers from the medical domain. |
Business Use Case | Document Understanding: The dataset can be used to train a model to extract various elements of a document such as tables, figures, texts etc. This can aid businesses dealing with a large number of documents to easily categorize the various elements in their documents. |
Dataset Archive Contents
File or Folder | Description |
---|---|
train folder |
Train data folder |
test folder |
Test data folder |
val folder |
Validation data folder |
PubTabNet_2.0.0.jsonl |
Data glossary of the three folders above. |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by data exploration, data analysis, and modeling Python notebooks to help you get started:
Citation
@article{zhong2019pubtabnet,
title={Image-based table recognition: data, model, and evaluation},
author={Xu Zhong and Elaheh ShafieiBavani and Antonio Jimeno Yepes},
journal={arXiv preprint arXiv:1911.10683},
year={2019}
}
Related Links
- PubLayNet – largest dataset ever for document layout analysis PubLayNet is a large dataset of document images from PubMed Central Open Access Subset. Each document’s layout is annotated with both bounding boxes and polygonal segmentations. While PubTabNet contains the labels for the tabular elements, PubLayNet contains labels for general semantic understanding of a paper.