FinTabNet – IBM Developer

Join the Digital Developer Conference: AIOps & Integration to propel your AI-powered automation skills Register for free

FinTabNet

Overview

This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. To generate the cell structure labels, we use token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations.

Dataset Metadata

Field Value
Format PDF, JSON
License CDLA-Permissive
Domain Computer Vision
Number of Records 89,646 pages comprising 112,887 tables with cell structure
Data Split Tables are split as Train – 91596, Test – 10656 and Val – 10635 ]
Size 16GB
Author Nancy Wang, Peter Zhong
Dataset Origin Publically available earnings reports from S&P500 companies and annotations from IBM Research
Dataset Version Update Nov 2020 – Version 1.0.0
Data Coverage The dataset consists of annotated earning’s reports from S&P500 companies
Business Use Case The dataset can be used to train a model to extract data from complex tables.
This can aid businesses dealing with a large number of tabular documents to easily extract information and load it into their databases.

Dataset Archive Contents

File or Folder Description
pdf/ Folder containing all the PDFs sorted by company stock ticker and year (subfolder)
FinTabNet_1.0.0_cell_train.jsonl Contains the full table bounding box and structure annotation for each table in the PDFs. Similar file for val and test with corresponding names.
FinTabNet_1.0.0_table_train.jsonl Subsets of the annotation files that only contain cell and table annotations for pages where all the tables are annotated.
LICENSE.txt Terms of Use

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration and data analysis Python notebook to help you get started:

Citation

  @article{zheng2020global,
  title={Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context},
  author={Zheng, Xinyi and Burdick, Doug and Popa, Lucian and Zhong, Peter and Wang, Nancy Xin Ru},
  journal={Winter Conference for Applications in Computer Vision (WACV)},
  year={2021}
}
Legend