Overview
This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. To generate the cell structure labels, we use token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations.
Dataset Metadata
Field | Value |
---|---|
Format | PDF, JSON |
License | CDLA-Permissive |
Domain | Computer Vision |
Number of Records | 89,646 pages comprising 112,887 tables with cell structure |
Data Split | Tables are split as Train – 91596, Test – 10656 and Val – 10635 ] |
Size | 16GB |
Author | Nancy Wang, Peter Zhong |
Dataset Origin | Publically available earnings reports from S&P500 companies and annotations from IBM Research |
Dataset Version Update | Nov 2020 – Version 1.0.0 |
Data Coverage | The dataset consists of annotated earning’s reports from S&P500 companies |
Business Use Case | The dataset can be used to train a model to extract data from complex tables. This can aid businesses dealing with a large number of tabular documents to easily extract information and load it into their databases. |
Dataset Archive Contents
File or Folder | Description |
---|---|
pdf/ |
Folder containing all the PDFs sorted by company stock ticker and year (subfolder) |
FinTabNet_1.0.0_cell_train.jsonl |
Contains the full table bounding box and structure annotation for each table in the PDFs. Similar file for val and test with corresponding names. |
FinTabNet_1.0.0_table_train.jsonl |
Subsets of the annotation files that only contain cell and table annotations for pages where all the tables are annotated. |
LICENSE.txt |
Terms of Use |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by a data exploration and data analysis Python notebook to help you get started:
Citation
@article{zheng2020global,
title={Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context},
author={Zheng, Xinyi and Burdick, Doug and Popa, Lucian and Zhong, Peter and Wang, Nancy Xin Ru},
journal={Winter Conference for Applications in Computer Vision (WACV)},
year={2021}
}