CDLA – Permissive | PDF, JSON

FinTabNet

A dataset for Financial Report Tables with corresponding ground truth location and structure.

By

Nancy Wang,

Peter Zhong

Overview

This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. To generate the cell structure labels, we use token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations.

Dataset Metadata

FieldValue
FormatPDF, JSON
LicenseCDLA-Permissive
DomainComputer Vision
Number of Records89,646 pages comprising 112,887 tables with cell structure
Data SplitTables are split as Train - 91596, Test - 10656 and Val - 10635 ]
Size16GB
AuthorNancy Wang, Peter Zhong
Dataset OriginPublically available earnings reports from S&P500 companies and annotations from IBM Research
Dataset Version UpdateNov 2020 - Version 1.0.0
Data CoverageThe dataset consists of annotated earning's reports from S&P500 companies
Business Use CaseThe dataset can be used to train a model to extract data from complex tables.
This can aid businesses dealing with a large number of tabular documents to easily extract information and load it into their databases.

Dataset Archive Contents

File or FolderDescription
pdf/Folder containing all the PDFs sorted by company stock ticker and year (subfolder)
FinTabNet_1.0.0_cell_train.jsonlContains the full table bounding box and structure annotation for each table in the PDFs. Similar file for val and test with corresponding names.
FinTabNet_1.0.0_table_train.jsonlSubsets of the annotation files that only contain cell and table annotations for pages where all the tables are annotated.
LICENSE.txtTerms of Use

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration and data analysis Python notebook to help you get started:

Citation

  @article{zheng2020global,
  title={Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context},
  author={Zheng, Xinyi and Burdick, Doug and Popa, Lucian and Zhong, Peter and Wang, Nancy Xin Ru},
  journal={Winter Conference for Applications in Computer Vision (WACV)},
  year={2021}
}