知识组织

项目申请书摘要文本的语步识别语料构建

  • 赵旸 ,
  • 张智雄 ,
  • 李婕
展开
  • 1 中国科学院文献情报中心 北京 100190;
    2 中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
赵旸,博士研究生;李婕,博士研究生。

收稿日期: 2022-03-29

  修回日期: 2022-08-13

  网络出版日期: 2022-11-25

基金资助

本文系中国科学院文献情报能力建设专项子项目"基于科技文献知识的人工智能(AI)引擎建设"(项目编号:E0290906)研究成果之一。

The Construction of Move Recognition Corpus for Project Application Abstract

  • Zhao Yang ,
  • Zhang Zhixiong ,
  • Li Jie
Expand
  • 1 National Science Library, Chinese Academy of Sciences, Beijing 100190;
    2 Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190

Received date: 2022-03-29

  Revised date: 2022-08-13

  Online published: 2022-11-25

摘要

[目的/意义] 自动识别项目申请书摘要中的科学要素,对于揭示科技项目中的科学知识具有重要的研究意义。这些科学要素的识别依赖于结构化项目摘要文本,然而目前结构化项目摘要语料资源匮乏,严重制约着相关研究的进一步发展。拟构建项目申请书摘要文本的语步语料集,为相关研究提供数据支撑。[方法/过程] 首先将项目摘要内容归纳为背景及问题、目标及任务、方法内容、价值意义4种语步类型,总结每个语步结构中出现的标志性特征并制定语步标注规范;其次相继利用基于规则和基于深度学习的方法辅助人工进行项目摘要的语步结构标注,并对每轮标注后的语料进行质量评估。[结果/结论] 两种方法共计标注近25 000条语句,语料标注的一致性系数达到0.983 9,表明该语料集基本能够区分项目摘要内的不同语步结构,初步达到了语料库建设的基本要求。

本文引用格式

赵旸 , 张智雄 , 李婕 . 项目申请书摘要文本的语步识别语料构建[J]. 图书情报工作, 2022 , 66(21) : 97 -106 . DOI: 10.13266/j.issn.0252-3116.2022.21.011

Abstract

[Purpose/Significance] Automatic recognition of scientific elements in project application abstracts is of great research significance for revealing scientific knowledge in science and technology projects. The recognition of these scientific elements relies on structured project abstract texts. However, the current lack of structured corpus resources for project abstract seriously restricts the further development of related research. Therefore, this paper intends to construct a move corpus of the project application abstract to provide data support for related research.[Method/Process] First, the project abstracts were summarized into four types of moves:background and problem, objective and task, methodological content, value and significance, then this paper summarized the iconic features that appear in the structure of each move and formulate a move annotation specification. Second, this study successively used rule-based and deep learning-based methods to assist in manual move structure annotation of project abstracts, and evaluate the quality of each round of annotated corpus.[Result/Conclusion] The two methods have annotated nearly 25,000 sentences, and the consistency coefficient of the corpus annotation reached 0.9839, which indicating that the corpus can basically distinguish different move structures among project abstracts and initially meet the basic requirements for corpus construction.

参考文献

[1] SWALES J M. Aspects of article introductions[M]. Birmingham:University of Aston,1981.
[2] TEUFEL S. Argumentative zoning:information extraction from scientific text[D]. Edinburgh:University of Edinburgh, 1999.
[3] TEUFEL S, MOENS M. Summarizing scientific articles:experiments with relevance and rhetorical status[J]. Computational linguistics, 2002, 28(4):409-445.
[4] International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals[J]. JAMA, 1997, 277(11):927-934.
[5] SCHULZ K F,ALTMAN D G, MOHER D, et al. CONSORT 2010 statement:updated guidelines for reporting parallel group randomized trials[J]. Annals of internal medicine, 2010,152(11):726-732.
[6] DERNONCOURT F, LEE J Y. PubMed 200k RCT:a dataset for sequential sentence classification in medical abstracts[C]//Proceedings of IJCNLP. Taipei:ACL, 2017:308-313.
[7] KIM S N, MARTINEZ D, CAVEDON L, et al. Automatic classification of sentences to support evidence-based medicine[J]. BMC Bioinformatics, 2011, 12(S2):S5.
[8] COHAN A, BELTAGY I, KING D, et al. Pretrained language models for sequential sentence classification[C]//Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on Natural Language Processing. Hong Kong:ACL, 2019:3693-3699.
[9] WALEED A, DIRK G. Construction of the literature graph in semantic scholar[C]//Proceedings of the conference of the North American Chapter of the Association for Computational Linguistics:human language technologies. New Orleans:ACL, 2018:84-91.
[10] STEAD C, SMITH S, BUSCH PA, et al. Emerald 110k:A multidisciplinary dataset for abstract sentence classification[C]//Proceedings of the the 17th annual workshop of the Australasian Language Technology Association.Sydney:ALTA, 2019:120-125.
[11] FISAS B, SAGGION H, RONZANO F. On the discoursive structure of computer graphics research papers[C]//Proceedings of the 9th linguistic annotation workshop. Denver:ACL,2015:42-51.
[12] LIAKATA M, TEUFEL S, SIDDHARTHAN A, et al. Corpora for the conceptualization and zoning of scientific papers[C]//Proceedings of the seventh international conference on language resources and evaluation. Valletta:ACL, 2010:2054-2061.
[13] LIAKATA M, SAHA S, DOBNIK S, et al. Automatic recognition of conceptualization zones in scientific articles and two life science applications[J]. Bioinformatics, 2012, 28(7):991-1000.
[14] 章成志,王玉琢,王如萍.情报学方法语料库构建[J].科技情报研究,2020,2(1):30-45.
[15] 欧石燕,陈嘉文.科学论文全文语步自动识别研究[J].现代情报,2021,41(11):3-11.
[16] GUPTA S, MANNING C. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]//Proceedings of 5th international joint conference on natural language processing. Chiang Mai:Asian Federation of Natural Language Processing,2011:1-9.
[17] RUCH P, BOYER C, CHICHESTER C, et al. Using argumentation to extract key sentences from biomedical abstracts[J]. International journal of medical informatics, 2007, 76(3):195-200.
[18] 丁良萍,张智雄,刘欢.影响支持向量机模型语步自动识别效果的因素研究[J].数据分析与知识发现,2019,3(11):16-23.
[19] HIROHATA K, OKAZAKI N, ANANIADOU S, et al. Identifying sections in scientific abstracts using conditional random fields[C]//Proceedings of the 3rd international joint conference on natural language processing:volume-I. Hyderabad:ACL,2008:381-388.
[20] LIN S, NG J P, PRADHAN S, et al. Extracting formulaic and free text clinical research articles metadata using conditional random fields[C]//Proceedings of the NAACL HLT 20102nd Louhi workshop on text and data mining of health documents. Los Angeles:ACL,2010:90-95.
[21] JIN D, SZOLOVITS P. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts[C]//Proceedings of EMNLP. Brussels:ACL, 2018:3100-3109.
[22] 沈思, 胡昊天, 叶文豪, 等.基于全字语义的摘要结构功能自动识别研究[J].情报学报,2019,38(1):79-88.
[23] 张智雄,刘欢,丁良萍, 等.不同深度学习模型的科技论文摘要语步识别效果对比研究[J].数据分析与知识发现,2019,3(12):1-9.
[24] 王末,崔运鹏,陈丽,等.基于深度学习的学术论文语步结构分类方法研究[J].数据分析与知识发现,2020,4(6):60-68.
[25] YU G H, ZHANG Z X, LIU H, et al. Masked sentence model based on BERT for move recognition in medical scientific abstracts[J]. Journal of data and information science, 2019, 4(4):1-14.
[26] DAY R A. The origins of the scientific paper:the IMRaD format[J]. Journal of the American Medical Informatics Association,1989,4(2) 16-18.
[27] 张婧,刘彦君,张炜,等.基于科研项目数据的科技前沿识别有效路径实证探索[J].科技管理研究,2019,39(16):108-119.
[28] 张策,崔永萍,郭大玮.撰写国家自然科学基金申请书的技巧及要点[J].中国科学基金,2018,32(6):596-599.
[29] DEVLIN J, CHANG M W, LEE K, et al. Bert:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT. Minneapolis:ACL, 2019:4171-4186.
[30] 罗鹏程,王一博,王继民.基于深度预训练语言模型的文献学科自动分类研究[J].情报学报,2020,39(10):1046-1059.
[31] CARLETTA J. Assessing agreement on classification tasks:the kappa statistic[J]. Computational linguistics, 1996, 22(2):249-254.
[32] 赵旸,张智雄,刘欢,等.基金项目摘要的语步识别系统设计与实现[J].情报理论与实践,2022,45(8):162-168.
[33] 张智雄, 刘欢, 于改红. 构建基于科技文献知识的人工智能引擎[J]. 农业图书情报学报, 2021, 33(1):17-31.
文章导航

/