开源技术 * IBM 微讲堂:Kubeflow 系列(观看回放 | 下载讲义) 了解详情

IBM Debater® 观点语句搜索

概述

IBM Debater 观点语句搜索(IBM Debater® Claim Sentences Search)是从 2017 年的维基百科中检索而来的观点数据集。观点是论据所要证明的短语。“观点语句搜索”任务的目标是在给定辩论主题或议题的情况下,从一个大型语料库中检测出包含观点的语句。此数据集包含 q_mc 查询的结果,其中的语句涉及论文中所述的某个主题,总共包含 149 万条语句。 此外,此数据集还包含一个观点语句测试集,其中包含我们模型的 2500 条最佳预测语句及其标签。

此数据集包括:

  • readme_mc_queries.txt – 观点语句搜索结果的 Readme 文件
  • readme_test_set.txt – 测试集的 Readme 文件
  • q_mc_train.csv – 由 q_mc 查询检索的有关 70 个训练主题的语句
  • q_mc_heldout.csv – 由 q_mc 查询检索的有关 30 个留出主题的语句
  • q_mc_test.csv – 由 q_mc 查询检索的有关 50 个测试主题的语句
  • test_set.csv – 来自我们系统的最佳预测及其标签

q_mc_train.csvq_mc_heldout.csvq_mc_test.csv 这三个 CSV 文件中的每条语句都包含以下列:

  1. id – 主题标识(在论文附录的注释 (1) 中指定)
  2. topic – 议题的主题
  3. mc – 维基百科中有关此主题的主流观点
  4. sentence
  5. query_pattern – 与语句匹配的查询模式
  6. score – 语句的 DNN 分数(介于 0 到 1 之间)
  7. label – 语句的金级标签(1 表示支持,0 表示反对)
  8. url – 指向源维基百科文章的链接

CSV 文件 test_set.csv 中的每条语句都包含以下列:

  1. id – 主题标识(在论文附录的注释 (1) 中指定)
  2. topic – 议题的主题
  3. mc – 维基百科中有关此主题的主流观点
  4. sentence
  5. query_pattern – 与语句匹配的查询模式
  6. score – 语句的 DNN 分数(介于 0 到 1 之间)
  7. label – 语句的金级标签(1 表示支持,0 表示反对)
  8. url – 指向源维基百科文章的链接

数据集元数据

格式 许可 领域 记录数 大小 最初发布日期
CSV
CC-BY-SA 3.0 自然语言处理 149 万条记录
571MB 2018-08-20

记录示例

# From the q_mc_heldout.csv file (the q_mc_train and q_mc_test have s similar format):
# id,topic,mc,sentence,suffix,prefix,url

86,Randomized controlled trials bring more harm than good,Randomized controlled trial,"(Smith & Iadarola, 2015) Several recent studies on Floortime were cited in the article including the recent randomized clinical trial studies.", studies.,"(Smith & Iadarola, 2015) Several recent studies on Floortime were cited in the article including the recent ",https://en.wikipedia.org/wiki/Floortime

# From the test_set.csv file:
# id,topic,mc,sentence,query_pattern,score,label,url

136,The American Bar Association brings more harm than good,American Bar Association,"In 1989 the ABA's House of Delegates adopted a resolution stating that "the American Bar Association and each of its entities should use gender-neutral language in all documents establishing policy and procedure."",q_strict,0.951,0,https://en.wikipedia.org/wiki/American_Bar_Association

引用

@inproceedings{levy-etal-2018-towards,
title = "Towards an argumentative content search engine using weak supervision",
author = "Levy, Ran and
Bogin, Ben and
Gretz, Shai and
Aharonov, Ranit and
Slonim, Noam",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
month = aug,
year = "2018",
address = "Santa Fe, New Mexico, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/C18-1176",
pages = "2066--2081",
abstract = "Searching for sentences containing claims in a large text corpus is a key component in developing an argumentative content search engine.Previous works focused on detecting claims in a small set of documents or within documents enriched with argumentative content.However, pinpointing relevant claims in massive unstructured corpora, received little attention.A step in this direction was taken in (Levy et al.2017), where the authors suggested using a weak signal to develop a relatively strict query for claim{--}sentence detection.Here, we leverage this work to define weak signals for training DNNs to obtain significantly greater performance.This approach allows to relax the query and increase the potential coverage.Our results clearly indicate that the system is able to successfully generalize from the weak signal, outperforming previously reported results in terms of both precision and coverage.Finally, we adapt our system to solve a recent argument mining task of identifying argumentative sentences in Web texts retrieved from heterogeneous sources, and obtain F1 scores comparable to the supervised baseline.",
}

相关链接

  • Project Debater Project Debater 是首款能够就复杂主题与人类展开辩论的 AI 系统。其目的是帮助人们建立具有说服力的论据并做出明智的决策。此数据集有助于训练 Project Debater 中的模型。

本文翻译自:IBM Debater® Claim Sentences Search(2019-08-01)