Overview

ConProp version 1.0 was developed by researchers at IBM Almaden Research Center, San Jose, CA, USA. It consists of proposition bank-style annotations from approximately 1000 English compliance sentences obtained from IBM’s publicly available contracts. These sentences were extracted from contract sections such as Business Partner descriptions, Agreement Terms / Structure, Intellectual Property Protection, Limitation of Liability, Warranty Terms, General Principles of Relationship, Terms of Agreement Termination, Withdrawal of Service, Third Party Claims, Charges, Service Level Agreement and many more.

To suit the need in compliance domain, a list of 60 predicates specific to the domain were chosen. These were predicates that occur commonly in the compliance sentences with context / verb sense specific to contract usage and at the same time does not occur very commonly in general domain. A list of top 150 frequent predicates from compliance sentences was sent to two SMEs (Subject Matter Experts) and the overlap between their curated lists was used to select these 60 domain specific predicates. The work in [1] served as an inspiration for candidate predicates section process.

Around 1000 sentences with one or more of these predicates are chosen from all the sentences extracted from the financial reports. These selected sentences are then parsed using Fuji [3] parser and labeled using a semantic role labeling classifier [2] trained on general domain and medical domain sentences. These semantic roles are then corrected or new semantic roles are added by two paralegal SMEs in compliance domain using the parse tree structure and initial semantic roles as base.

For predicates, a token is labeled as predicate and its sense is chosen by the SMEs. For arguments, an entire span is chosen by the SMEs and the span’s head as per the parse tree is chosen as the argument head for the semantic role label. In cases where the labels given by the two SMEs do no match, they are reconciled by a third expert. These labels are what give a gold standard SRL data for the domain.

Hence the overall semantic role labeling process was semi-automatic and consisted of: (a) Selection of predicates significant to the domain by domain experts and selecting sentences containing those predicates, (b) Automatic semantic role labeling of these sentences using a pre-trained classifier [2], (c) Verifying / adding / removing / editing these semantic role labels using 2 SMEs, and (c) Reconciliation by a third expert in cases where the entries made by the 2 SMEs above don’t match

Dataset Metadata

Format License Domain Number of Records Size
CoNLL-U
CDLA-Sharing Natural Language Processing Approx. 1,000 annotated sentences
corresponding to 50,000 words
2.3MB

Example Records

#promptly provide IBM with documents IBM may require from you or a Customer ( for example , a license agreement signed by the End User ) when applicable ; and
1  promptly  promptly  _ RB _ 2 advmod _ _ AM-MNR _ _  
2 provide provide _ VBP _ 0 root Y provide.01 _ _ _  
3 IBM ibm _ NNP _ 2 dobj _ _ A0 _ _  
4 with with _ IN _ 5 case _ _ _ _ _  
5 documents document _ NNS _ 2 nmod _ _ A1 A1 _  
6 IBM ibm _ NNP _ 8 nsubj _ _ _ A0 _  
7 may may _ MD _ 8 aux _ _ _ _ _  
8 require require _ VB _ 5 acl:relcl Y require.01 _ _ _  
9 from from _ IN _ 10 case _ _ _ _ _  
10 you you _ PRP _ 8 nmod _ _ _ A2 _  
11 or or _ CC _ 8 cc _ _ _ _ _  
12 a a _ DT _ 13 det _ _ _ _ _  
13 Customer customer _ NN _ 8 conj _ _ _ _ _  
14 ( ( _ ( _ 29 punct _ _ _ _ _  
15 for for _ IN _ 16 case _ _ _ _ _  
16 example example _ NN _ 29 nmod _ _ _ _ _  
17 , , _ , _ 29 punct _ _ _ _ _  
18 a a _ DT _ 20 det _ _ _ _ _  
19 license license _ NN _ 20 compound _ _ _ _ _  
20 agreement agreement _ NN _ 29 nsubj _ _ _ _ A1  
21 signed sign _ VBN _ 20 acl Y sign.01 _ _ _  
22 by by _ IN _ 25 case _ _ _ _ _  
23 the the _ DT _ 25 det _ _ _ _ _  
24 End end _ NN _ 25 compound _ _ _ _ _  
25 User user _ NN _ 21 nmod _ _ _ _ A0  
26 ) ) _ ) _ 20 punct _ _ _ _ _  
27 when when _ WRB _ 28 mark _ _ _ _ _  
28 applicable applicable _ JJ _ 29 amod _ _ _ _ _  
29 ; ; _ : _ 2 punct _ _ _ _ _  
30 and and _ CC _ 29 cc _ _ _ _ _    

Citation

[1] Wen-Chi  Chou,  Richard  Tzong-Han  Tsai,  Ying-ShanSu,  Wei  Ku,  Ting-Yi  Sung,  and  Wen-Lian  Hsu. (2016). A semi-automatic method for annotating a biomedical proposition bank. In Proceedings of the workshop on frontiers in linguistically annotated corpora 2006. Association for Computational Linguistics, pages 5–12.
[2] Alan  Akbik  and  Yunyao  Li.  (2016).    K-srl:   Instance-based learning for semantic role labeling.   In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pages 599–608.
[3] Yuta Tsuboi, Hiroshi Kanayama, Katsumasa Yoshikawa, Tetsuya Nasukawa, Akihiro Nakayama, Kei Sugano, John Richardson. (2014). Transfer of dependency parser from rule-based system to learning-based system, Proceedings of 20th Annual Meeting of the Association of Natural Language Processing (in Japanese), 2014.