This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition.
To generate the cell structure labels, we use token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations.
89,646 pages comprising 112,887 tables with cell structure
Data Split
Tables are split as Train - 91596, Test - 10656 and Val - 10635 ]
Size
16GB
Author
Nancy Wang, Peter Zhong
Dataset Origin
Publically available earnings reports from S&P500 companies and annotations from IBM Research
Dataset Version Update
Nov 2020 - Version 1.0.0
Data Coverage
The dataset consists of annotated earning's reports from S&P500 companies
Business Use Case
The dataset can be used to train a model to extract data from complex tables. This can aid businesses dealing with a large number of tabular documents to easily extract information and load it into their databases.
Dataset Archive Contents
File or Folder
Description
pdf/
Folder containing all the PDFs sorted by company stock ticker and year (subfolder)
FinTabNet_1.0.0_cell_train.jsonl
Contains the full table bounding box and structure annotation for each table in the PDFs. Similar file for val and test with corresponding names.
FinTabNet_1.0.0_table_train.jsonl
Subsets of the annotation files that only contain cell and table annotations for pages where all the tables are annotated.
LICENSE.txt
Terms of Use
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by a data exploration and data analysis Python notebook to help you get started:
@article{zheng2020global,
title={Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context},
author={Zheng, Xinyi and Burdick, Doug and Popa, Lucian and Zhong, Peter and Wang, Nancy Xin Ru},
journal={Winter Conference for Applications in Computer Vision (WACV)},
year={2021}
}
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.