Win $20,000. Help build the future of education. Answer the call. Learn more

IBM Split and Rephrase

Overview

This dataset contains two evaluation datasets, Wiki Benchmark (Wiki-BM) and Contract Benchmark (Cont-BM), for the Split and Rephrase task. This task takes complex sentences and then splits and rephrases them to a simpler version. Considering the limitations of existing Split and Rephrase Benchmarks, an ideal benchmark must not only be challenging with diverse patterns, but also ensure that the rewrites are strictly meaning-preserving Split and Rephrase. These two datasets are can be used as a gold standard for the evaluation of Split and Rephrase. The two datasets contains around 800 complex sentences and more than 1,300 simplified writes in total (each complex sentence can have multiple simplified writes). The dataset is accompanied by a repository.

For Wiki-BM, while the simplified rewrites in the WikiSplit dataset are not guaranteed to be meaning preserving and cannot be used in a benchmark, the original complex sentences are semantically and syntactically diverse, with adequate complexity. From the 5000 complex sentences from the WikiSplit test set, we randomly select 500 with only alphanumerical characters, whites-spaces, commas and periods, and manually inspect them to ensure that they are well-formed. For Cont-BM, we collect sentences from publicly available legal procurement contracts online, and contract templates within IBM with no confidential information. We randomly sample and inspect 500 sentences in the same manners as above.

We ask a set of crowd workers to Split and Rephrase the gathered complex sentences on Amazon Mechanical Turk, and another set to ensure their quality. We divide the crowdsourcing workflow into two phases: rewrite and rate. A detailed discussion about these datasets is available in this paper.

Dataset Metadata

Field Value
Format CSV
License CC-BY-SA-3.0
Domain Natural Language Processing
Number of Records 1,342 text samples
Data Split NA
Size 2.5 MB
Author Li Zhang, Huaiyu Zhu, Siddhartha Brahma, Yunyao Li
Dataset Origin Wikipedia and online publicly available legal procurement contracts.
Dataset Version Update Version 1 – June 1, 2019
Data Coverage Dataset contains around 800 complex sentences and more than 1,300 simplified writes in total.

Dataset Archive Contents

File or Folder Description
benchmarks/ Contains the Contract Benchmark and Wikipedia Benchmark dataset. Each has more than 600 rows of sample texts and two columns (complex – the original complex sentences and simple – the crowd sourced simplified version).
judgements/ Contains the raw and aggregated human judgements of model performances from Amazon Mechanical Turk.
LICENSE.txt Terms of Use
README.txt Text file with the file names, file folders and description

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by a data exploration and data analysis Python notebook to help you get started:

Citation

@inproceedings{zhang-etal-2020-small,
                  title = "Small but Mighty: New Benchmarks for Split and Rephrase",
                  author = "Zhang, Li  and
                    Zhu, Huaiyu  and
                    Brahma, Siddhartha  and
                    Li, Yunyao",
                  booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
                  month = nov,
                  year = "2020",
                  address = "Online",
                  publisher = "Association for Computational Linguistics",
                  url = "https://www.aclweb.org/anthology/2020.emnlp-main.91",
                  pages = "1198--1205",
              }
Legend