PubTabNet contains heterogeneous tables in both image and HTML format. PubTabNet can be used to train and evaluate image-based table recognition models. The model needs to recognize both the structure and the content of the tables, and be able to reconstruct the HTML representation of the tables solely relying on the table images. The HTML representation encodes both the structure of the tables and the content in each table cell. The source of the tables is PubMed Central Open Access Subset (commercial use collection). The tables (in both image and HTML format) are automatically extracted by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.
|Format||License||Domain||Number of Records||Size||Originally Published|
|CDLA – Permissive||Computer Vision||
||30GB||November 01, 2019|
- PubLayNet – largest dataset ever for document layout analysis PubLayNet is a large dataset of document images from PubMed Central Open Access Subset. Each document’s layout is annotated with both bounding boxes and polygonal segmentations. While PubTabNet contains the labels for the tabular elements, PubLayNet contains labels for general semantic understanding of a paper.
- Data Asset eXchange (DAX) Explore useful and relevant data sets for enterprise data science.
- Model Asset eXchange (MAX) A place for developers to find and use free and open source deep learning models.
- Center for Open-Source Data & AI Technologies (CODAIT) Improving the Enterprise AI Lifecycle in Open Source.