[Purpose/Significance] Automatic recognition of scientific elements in project application abstracts is of great research significance for revealing scientific knowledge in science and technology projects. The recognition of these scientific elements relies on structured project abstract texts. However, the current lack of structured corpus resources for project abstract seriously restricts the further development of related research. Therefore, this paper intends to construct a move corpus of the project application abstract to provide data support for related research.[Method/Process] First, the project abstracts were summarized into four types of moves:background and problem, objective and task, methodological content, value and significance, then this paper summarized the iconic features that appear in the structure of each move and formulate a move annotation specification. Second, this study successively used rule-based and deep learning-based methods to assist in manual move structure annotation of project abstracts, and evaluate the quality of each round of annotated corpus.[Result/Conclusion] The two methods have annotated nearly 25,000 sentences, and the consistency coefficient of the corpus annotation reached 0.9839, which indicating that the corpus can basically distinguish different move structures among project abstracts and initially meet the basic requirements for corpus construction.
[1] SWALES J M. Aspects of article introductions[M]. Birmingham:University of Aston,1981.
[2] TEUFEL S. Argumentative zoning:information extraction from scientific text[D]. Edinburgh:University of Edinburgh, 1999.
[3] TEUFEL S, MOENS M. Summarizing scientific articles:experiments with relevance and rhetorical status[J]. Computational linguistics, 2002, 28(4):409-445.
[4] International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals[J]. JAMA, 1997, 277(11):927-934.
[5] SCHULZ K F,ALTMAN D G, MOHER D, et al. CONSORT 2010 statement:updated guidelines for reporting parallel group randomized trials[J]. Annals of internal medicine, 2010,152(11):726-732.
[6] DERNONCOURT F, LEE J Y. PubMed 200k RCT:a dataset for sequential sentence classification in medical abstracts[C]//Proceedings of IJCNLP. Taipei:ACL, 2017:308-313.
[7] KIM S N, MARTINEZ D, CAVEDON L, et al. Automatic classification of sentences to support evidence-based medicine[J]. BMC Bioinformatics, 2011, 12(S2):S5.
[8] COHAN A, BELTAGY I, KING D, et al. Pretrained language models for sequential sentence classification[C]//Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on Natural Language Processing. Hong Kong:ACL, 2019:3693-3699.
[9] WALEED A, DIRK G. Construction of the literature graph in semantic scholar[C]//Proceedings of the conference of the North American Chapter of the Association for Computational Linguistics:human language technologies. New Orleans:ACL, 2018:84-91.
[10] STEAD C, SMITH S, BUSCH PA, et al. Emerald 110k:A multidisciplinary dataset for abstract sentence classification[C]//Proceedings of the the 17th annual workshop of the Australasian Language Technology Association.Sydney:ALTA, 2019:120-125.
[11] FISAS B, SAGGION H, RONZANO F. On the discoursive structure of computer graphics research papers[C]//Proceedings of the 9th linguistic annotation workshop. Denver:ACL,2015:42-51.
[12] LIAKATA M, TEUFEL S, SIDDHARTHAN A, et al. Corpora for the conceptualization and zoning of scientific papers[C]//Proceedings of the seventh international conference on language resources and evaluation. Valletta:ACL, 2010:2054-2061.
[13] LIAKATA M, SAHA S, DOBNIK S, et al. Automatic recognition of conceptualization zones in scientific articles and two life science applications[J]. Bioinformatics, 2012, 28(7):991-1000.
[14] 章成志,王玉琢,王如萍.情报学方法语料库构建[J].科技情报研究,2020,2(1):30-45.
[15] 欧石燕,陈嘉文.科学论文全文语步自动识别研究[J].现代情报,2021,41(11):3-11.
[16] GUPTA S, MANNING C. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]//Proceedings of 5th international joint conference on natural language processing. Chiang Mai:Asian Federation of Natural Language Processing,2011:1-9.
[17] RUCH P, BOYER C, CHICHESTER C, et al. Using argumentation to extract key sentences from biomedical abstracts[J]. International journal of medical informatics, 2007, 76(3):195-200.
[18] 丁良萍,张智雄,刘欢.影响支持向量机模型语步自动识别效果的因素研究[J].数据分析与知识发现,2019,3(11):16-23.
[19] HIROHATA K, OKAZAKI N, ANANIADOU S, et al. Identifying sections in scientific abstracts using conditional random fields[C]//Proceedings of the 3rd international joint conference on natural language processing:volume-I. Hyderabad:ACL,2008:381-388.
[20] LIN S, NG J P, PRADHAN S, et al. Extracting formulaic and free text clinical research articles metadata using conditional random fields[C]//Proceedings of the NAACL HLT 20102nd Louhi workshop on text and data mining of health documents. Los Angeles:ACL,2010:90-95.
[21] JIN D, SZOLOVITS P. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts[C]//Proceedings of EMNLP. Brussels:ACL, 2018:3100-3109.
[22] 沈思, 胡昊天, 叶文豪, 等.基于全字语义的摘要结构功能自动识别研究[J].情报学报,2019,38(1):79-88.
[23] 张智雄,刘欢,丁良萍, 等.不同深度学习模型的科技论文摘要语步识别效果对比研究[J].数据分析与知识发现,2019,3(12):1-9.
[24] 王末,崔运鹏,陈丽,等.基于深度学习的学术论文语步结构分类方法研究[J].数据分析与知识发现,2020,4(6):60-68.
[25] YU G H, ZHANG Z X, LIU H, et al. Masked sentence model based on BERT for move recognition in medical scientific abstracts[J]. Journal of data and information science, 2019, 4(4):1-14.
[26] DAY R A. The origins of the scientific paper:the IMRaD format[J]. Journal of the American Medical Informatics Association,1989,4(2) 16-18.
[27] 张婧,刘彦君,张炜,等.基于科研项目数据的科技前沿识别有效路径实证探索[J].科技管理研究,2019,39(16):108-119.
[28] 张策,崔永萍,郭大玮.撰写国家自然科学基金申请书的技巧及要点[J].中国科学基金,2018,32(6):596-599.
[29] DEVLIN J, CHANG M W, LEE K, et al. Bert:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT. Minneapolis:ACL, 2019:4171-4186.
[30] 罗鹏程,王一博,王继民.基于深度预训练语言模型的文献学科自动分类研究[J].情报学报,2020,39(10):1046-1059.
[31] CARLETTA J. Assessing agreement on classification tasks:the kappa statistic[J]. Computational linguistics, 1996, 22(2):249-254.
[32] 赵旸,张智雄,刘欢,等.基金项目摘要的语步识别系统设计与实现[J].情报理论与实践,2022,45(8):162-168.
[33] 张智雄, 刘欢, 于改红. 构建基于科技文献知识的人工智能引擎[J]. 农业图书情报学报, 2021, 33(1):17-31.