Finance Proposition Bank

Get this dataset

Overview

FinProp 1.0 was developed by researchers at IBM Almaden Research Center, San Jose, CA, USA. It consists of proposition bank-style annotations from approximately 1000 english finance sentences obtained from IBM’s public annual financial reports. These sentences are obtained from sections such as “Management’s Discussion and Analysis of Financial Condition and Results of Operations” and “Quantitative and Qualitative Disclosures About Market Risk”.

To suit the need in finance domain, a list of about 40 predicates specific to the domain were chosen. These were predicates that occur commonly in the finance sentences with context / verb sense specific to finance usage and at the same time does not occur very commonly in general domain. A list of 150 frequent predicates from finance sentences was sent to two SMEs and the overlap between their curated lists was used to select these 40 domain specific predicates. The work in [1] served as an inspiration for candidate predicates section process.

Around 1000 sentences with these predicates are chosen from all the sentences extracted from the financial reports. These selected sentences are then parsed using Fuji [3] parser and labeled using a semantic role labeling classifier [2] trained on general domain and medical domain sentences. These semantic roles are then corrected or new semantic roles are added by two SMEs (Subject Matter Experts) in finance domain using the parse tree structure and initial semantic roles as base.

For predicates, a token is labeled as predicate and its sense is chosen by the SMEs. For arguments, an entire span is chosen by the SMEs and the span’s head as per the parse tree is chosen as the argument head for the semantic role label. In cases where the labels given by the two SMEs do no match, they are reconciled by a third expert. These labels are what give a gold standard SRL data for the domain.

Hence the overall semantic role labeling process was semi-automatic and consisted of:
(1) Selection of predicates significant to the domain by domain experts and selecting sentences containing those predicates
(2) Automatic semantic role labeling of these sentences using a pre-trained classifier [2]
(3) Verifying / adding / removing / editing these semantic role labels using 2 SMEs
(4) Reconciliation by a third expert in cases where the entries made by the 2 SMEs above don’t match.

Dataset Metadata

Format License Domain Number of Records Size
CoNLL-U
CDLA-Sharing Natural Language Processing Approx. 1,000 annotated sentences
corresponding to 50,000 words.
2.9 MB

Example Records

#Reflected in those amounts were residential real estate loans held for sale , which averaged $ 415 million in 2015 and $ 403 million in 2014 .
1        Reflected        reflect        _        VBN        _        0        root        Y        reflect.01        _        _        _        _                
2        in        in        _        IN        _        4        case        _        _        _        _        _        _                
3        those        that        _        DT        _        4        det        _        _        _        _        _        _                
4        amounts        amount        _        NNS        _        1        nmod        _        _        _        _        _        _                
5        were        be        _        VBD        _        6        cop        Y        be.01        _        _        _        _                
6        residential        residential        _        JJ        _        9        amod        _        _        _        _        _        _                
7        real        real        _        JJ        _        9        amod        _        _        _        _        _        _                
8        estate        estate        _        NN        _        9        compound        _        _        _        _        _        _                
9        loans        loan        _        NNS        _        4        acl:relcl        _        _        _        _        A1        _                
10        held        hold        _        VBN        _        9        acl        Y        hold.01        _        A2        _        _                
11        for        for        _        IN        _        12        case        _        _        _        _        _        _                
12        sale        sale        _        NN        _        10        nmod        _        _        _        _        AM-TMP        A1                
13        ,        ,        _        ,        _        12        punct        _        _        _        _        _        _                
14        which        which        _        WDT        _        15        nsubj        _        _        _        _        _        R-A1                
15        averaged        average        _        VBD        _        12        acl:relcl        Y        average.01        _        _        _        _                
16        $        $        _        $        _        15        dobj        _        _        _        _        _        A2                
17        415        415        _        CD        _        18        compound        _        _        _        _        _        _                
18        million        million        _        QT        _        16        nummod        _        _        _        _        _        _                
19        in        in        _        IN        _        20        case        _        _        _        _        _        _                
20        2015        2015        _        CD        _        15        nmod        _        _        _        _        _        AM-TMP                
21        and        and        _        CC        _        15        cc        _        _        _        _        _        _                
22        $        $        _        $        _        15        conj        _        _        _        _        _        _                
23        403        403        _        CD        _        24        compound        _        _        _        _        _        _                
24        million        million        _        QT        _        22        nummod        _        _        _        _        _        _                
25        in        in        _        IN        _        26        case        _        _        _        _        _        _                
26        2014        2014        _        CD        _        22        nmod        _        _        _        _        _        _                
27        .        .        _        .        _        1        punct        _        _        _        _        _        _        

Citation

[1] Wen-Chi  Chou,  Richard  Tzong-Han  Tsai,  Ying-ShanSu,  Wei  Ku,  Ting-Yi  Sung,  and  Wen-Lian  Hsu. 2006.   A  semi-automatic  method  for  annotating a biomedical  proposition  bank. In Proceedings of the workshop on frontiers in linguistically annotated corpora 2006. Association for Computational Linguistics, pages 5–12.
[2] Alan  Akbik  and  Yunyao  Li.  2016.    K-srl:   Instance-based learning for semantic role labeling.   In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pages 599–608.
[3] Yuta Tsuboi, Hiroshi Kanayama, Katsumasa Yoshikawa, Tetsuya Nasukawa, Akihiro Nakayama, Kei Sugano, John Richardson. 2014. Transfer of dependency parser from rule-based system to learning-based system, Proceedings of 20th Annual Meeting of the Association of Natural Language Processing (in Japanese), 2014.