图书情报工作 ›› 2022, Vol. 66 ›› Issue (21): 97-106.DOI: 10.13266/j.issn.0252-3116.2022.21.011

• 知识组织 • 上一篇    下一篇

项目申请书摘要文本的语步识别语料构建

赵旸1,2, 张智雄1,2, 李婕1,2   

  1. 1 中国科学院文献情报中心 北京 100190;
    2 中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
  • 收稿日期:2022-03-29 修回日期:2022-08-13 出版日期:2022-11-05 发布日期:2022-11-25
  • 通讯作者: 张智雄,研究馆员,博士,博士生导师,通信作者,E-mail:zhangzhx@mail.las.ac.cn
  • 作者简介:赵旸,博士研究生;李婕,博士研究生。
  • 基金资助:
    本文系中国科学院文献情报能力建设专项子项目"基于科技文献知识的人工智能(AI)引擎建设"(项目编号:E0290906)研究成果之一。

The Construction of Move Recognition Corpus for Project Application Abstract

Zhao Yang1,2, Zhang Zhixiong1,2, Li Jie1,2   

  1. 1 National Science Library, Chinese Academy of Sciences, Beijing 100190;
    2 Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190
  • Received:2022-03-29 Revised:2022-08-13 Online:2022-11-05 Published:2022-11-25

摘要: [目的/意义] 自动识别项目申请书摘要中的科学要素,对于揭示科技项目中的科学知识具有重要的研究意义。这些科学要素的识别依赖于结构化项目摘要文本,然而目前结构化项目摘要语料资源匮乏,严重制约着相关研究的进一步发展。拟构建项目申请书摘要文本的语步语料集,为相关研究提供数据支撑。[方法/过程] 首先将项目摘要内容归纳为背景及问题、目标及任务、方法内容、价值意义4种语步类型,总结每个语步结构中出现的标志性特征并制定语步标注规范;其次相继利用基于规则和基于深度学习的方法辅助人工进行项目摘要的语步结构标注,并对每轮标注后的语料进行质量评估。[结果/结论] 两种方法共计标注近25 000条语句,语料标注的一致性系数达到0.983 9,表明该语料集基本能够区分项目摘要内的不同语步结构,初步达到了语料库建设的基本要求。

关键词: 语步识别, 项目申请摘要文本, 语步语料集构建, 迭代标注

Abstract: [Purpose/Significance] Automatic recognition of scientific elements in project application abstracts is of great research significance for revealing scientific knowledge in science and technology projects. The recognition of these scientific elements relies on structured project abstract texts. However, the current lack of structured corpus resources for project abstract seriously restricts the further development of related research. Therefore, this paper intends to construct a move corpus of the project application abstract to provide data support for related research.[Method/Process] First, the project abstracts were summarized into four types of moves:background and problem, objective and task, methodological content, value and significance, then this paper summarized the iconic features that appear in the structure of each move and formulate a move annotation specification. Second, this study successively used rule-based and deep learning-based methods to assist in manual move structure annotation of project abstracts, and evaluate the quality of each round of annotated corpus.[Result/Conclusion] The two methods have annotated nearly 25,000 sentences, and the consistency coefficient of the corpus annotation reached 0.9839, which indicating that the corpus can basically distinguish different move structures among project abstracts and initially meet the basic requirements for corpus construction.

Key words: move recognition, project application abstract, move corpus construction, iterative annotation

中图分类号: