Finance Proposition Bank
Text from approximately 1000 English sentences obtained from IBM's public annual financial reports, annotated with a layer of 'universal' semantic role labels.
FinProp 1.0 was developed by researchers at IBM Almaden Research Center, San Jose, CA, USA. It consists of proposition bank-style annotations from approximately 1000 English finance sentences obtained from IBM’s public annual financial reports. These sentences are obtained from sections such as “Management’s Discussion and Analysis of Financial Condition and Results of Operations” and “Quantitative and Qualitative Disclosures About Market Risk”.
To suit the need in finance domain, a list of about 40 predicates specific to the domain were chosen. These were predicates that occur commonly in the finance sentences with context / verb sense specific to finance usage and at the same time does not occur very commonly in general domain. A list of 150 frequent predicates from finance sentences was sent to two SMEs (Subject Matter Experts) and the overlap between their curated lists was used to select these 40 domain specific predicates. The work in  served as an inspiration for candidate predicates section process.
Around 1000 sentences with these predicates are chosen from all the sentences extracted from the financial reports. These selected sentences are then parsed using Fuji  parser and labeled using a semantic role labeling classifier  trained on general domain and medical domain sentences. These semantic roles are then corrected or new semantic roles are added by two SMEs in finance domain using the parse tree structure and initial semantic roles as base.
For predicates, a token is labeled as predicate and its sense is chosen by the SMEs. For arguments, an entire span is chosen by the SMEs and the span’s head as per the parse tree is chosen as the argument head for the semantic role label. In cases where the labels given by the two SMEs do no match, they are reconciled by a third expert. These labels are what give a gold standard SRL data for the domain.
Hence the overall semantic role labeling process was semi-automatic and consisted of:
- Selection of predicates significant to the domain by domain experts and selecting sentences containing those predicates
- Automatic semantic role labeling of these sentences using a pre-trained classifier 
- Verifying / adding / removing / editing these semantic role labels using 2 SMEs
- Reconciliation by a third expert in cases where the entries made by the 2 SMEs above don’t match.
|Format||License||Domain||Number of Records||Size|
||CDLA-Sharing||Natural Language Processing||Approx. 1,000 annotated sentences
corresponding to 50,000 words.
#Reflected in those amounts were residential real estate loans held for sale , which averaged $ 415 million in 2015 and $ 403 million in 2014 . 1 Reflected reflect _ VBN _ 0 root Y reflect.01 _ _ _ _ 2 in in _ IN _ 4 case _ _ _ _ _ _ 3 those that _ DT _ 4 det _ _ _ _ _ _ 4 amounts amount _ NNS _ 1 nmod _ _ _ _ _ _ 5 were be _ VBD _ 6 cop Y be.01 _ _ _ _ 6 residential residential _ JJ _ 9 amod _ _ _ _ _ _ 7 real real _ JJ _ 9 amod _ _ _ _ _ _ 8 estate estate _ NN _ 9 compound _ _ _ _ _ _ 9 loans loan _ NNS _ 4 acl:relcl _ _ _ _ A1 _ 10 held hold _ VBN _ 9 acl Y hold.01 _ A2 _ _ 11 for for _ IN _ 12 case _ _ _ _ _ _ 12 sale sale _ NN _ 10 nmod _ _ _ _ AM-TMP A1 13 , , _ , _ 12 punct _ _ _ _ _ _ 14 which which _ WDT _ 15 nsubj _ _ _ _ _ R-A1 15 averaged average _ VBD _ 12 acl:relcl Y average.01 _ _ _ _ 16 $ $ _ $ _ 15 dobj _ _ _ _ _ A2 17 415 415 _ CD _ 18 compound _ _ _ _ _ _ 18 million million _ QT _ 16 nummod _ _ _ _ _ _ 19 in in _ IN _ 20 case _ _ _ _ _ _ 20 2015 2015 _ CD _ 15 nmod _ _ _ _ _ AM-TMP 21 and and _ CC _ 15 cc _ _ _ _ _ _ 22 $ $ _ $ _ 15 conj _ _ _ _ _ _ 23 403 403 _ CD _ 24 compound _ _ _ _ _ _ 24 million million _ QT _ 22 nummod _ _ _ _ _ _ 25 in in _ IN _ 26 case _ _ _ _ _ _ 26 2014 2014 _ CD _ 22 nmod _ _ _ _ _ _ 27 . . _ . _ 1 punct _ _ _ _ _ _
 Wen-Chi Chou, Richard Tzong-Han Tsai, Ying-ShanSu, Wei Ku, Ting-Yi Sung, and Wen-Lian Hsu. 2006. A semi-automatic method for annotating a biomedical proposition bank. In Proceedings of the workshop on frontiers in linguistically annotated corpora 2006. Association for Computational Linguistics, pages 5–12.  Alan Akbik and Yunyao Li. 2016. K-srl: Instance-based learning for semantic role labeling. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pages 599–608.  Yuta Tsuboi, Hiroshi Kanayama, Katsumasa Yoshikawa, Tetsuya Nasukawa, Akihiro Nakayama, Kei Sugano, John Richardson. 2014. Transfer of dependency parser from rule-based system to learning-based system, Proceedings of 20th Annual Meeting of the Association of Natural Language Processing (in Japanese), 2014.