[目的/意义] 学术论文贡献句是体现论文创新性和学术价值的重要形式。以学术论文全文本和MeSH主题词为数据基础,利用自然语言处理和深度学习技术,实现学术论文贡献句识别,为学术文本创新贡献内容的细粒度挖掘奠定基础,对实现基于认知计算的学术论文评价具有重要的理论和现实意义。[方法/过程] 首先,以PubMed论文全文本为数据来源,抽取论文Mesh主题词,对论文贡献句进行要素分析和特征提取。其次,采用半自动方式实现标注数据。最后,基于Albert深度学习模型实现贡献句的自动识别。[结果/结论] 通过数据一致性检验证明实验标注的训练数据的可信性,实验结果表明,相较于其他深度学习模型,训练的自动识别模型能够更有效识别学术论文中贡献句。
[Purpose/significance] Contribution sentences of academic papers are elements to reflect the novelty and academic value of papers. This study takes the full text of academic papers and MeSH terms as data sources and uses natural language processing and deep learning techniques to achieve academic paper contribution sentence recognition. This study lays the foundation for fine-grained mining of innovative contents of academic texts, which is important for realizing the evaluation of academic papers based on cognitive computing.[Method/process] Firstly, the full-text PubMed papers were used as the data source for element analysis and feature extraction of the contributed sentences. Secondly, a semi-automatic approach was used to fulfill the data annotation. Finally, the automatic recognition of contributed sentences was realized based on Albert deep learning model.[Result/conclusion] The plausibility of the experimentally labeled training data is proved by the data consistency test, and the experimental results show that the automatic recognition model trained in this paper can identify the contribution sentences in academic papers more effectively compared with other deep learning models.
[1] 新华网.习近平:在科学家座谈会上的讲话[EB/OL].[2021-05-07]. http://www.xinhuanet.com/2020-09/11/c_1126483997.htm.
[2] 国家标准化管理委员会.科学技术报告、学位论文和学术论文的编写格式:GB 7713-87[S]. 北京:中国标准出版社,1987.
[3] 李如森,彭彩红,赵福荣.科技论文创新性判断方法[J].鞍山钢铁学院学报,2001(3):234-236.
[4] 温有奎,吴广印.碎片化科研创新点动态挖掘研究[J].数字图书馆论坛,2014(7):25-32.
[5] 张帆,乐小虬.面向领域科技文献的句子级创新点抽取研究[J].现代图书情报技术,2014(9):15-21.
[6] 索传军,于果鑫.学术论文研究亮点的语言学特征与分布规律研究[J].图书情报工作,2020,64(9):104-113.
[7] 章成志,李铮.基于学术论文全文的创新研究评价句抽取研究[J].数据分析与知识发现,2019,3(10):12-19.
[8] 曹树金,闫欣阳,张倩,等.中外情报学论文创新性特征研究[J]. 图书情报工作, 2020,64(1):80-92.
[9] 温浩.科技文摘创新点语义识别与分类方法研究[J].情报学报,2019,38(3):249-256.
[10] 周海晨,郑德俊,郦天宇.学术全文本的学术创新贡献识别探索[J].情报学报,2020,39(8):845-851
[11] CHEN L L, FANG H. An automatic method for extracting innovative ideas based on the Scopus® database[J]. Knowledge organization, 2019, 46(3):171-186.
[12] ALLAN J, WADE C, BOLIVAR A. Retrieval and novelty detection at the sentence level[C]//Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. Toronto:ACM,2003:314-321.
[13] TEUFEL S, MOENS M. Summarizing scientific articles:experiments with relevance and rhetorical status[J]. Computational linguistics, 2002, 28(4):409-445.
[14] HEFFERNAN K, TEUFEL S. Identifying problems and solutions in scientific text[J]. Scientometrics, 2018, 116(2):1367-1382.
[15] 冷伏海,白如江,祝清松.面向科技文献的混合语义信息抽取方法研究[J].图书情报工作,2013,57(11):112-119.
[16] 毛琛瑜,乐小虬.领域内中文科技文献中新发现语言描述特征分析[J].现代图书情报技术,2016(5):47-55.
[17] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL].[2021-05-07]. https://arxiv.org/pdf/1301.3781v3.pdf.
[18] PENNINGTON J, SOCHER R, MANNING C D. Glove:global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing. Doha:Association for Computational Linguistics,2014:1532-1543.
[19] PETERS M, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics:human language technologies, Volume 1(long papers). New Orleans:Association for Computational Linguistics, 2018:2227-2237.
[20] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics:human language technologies, volume 1(long and short papers). Minneapolis:Association for Computational Linguistics, 2019:4171-4186.
[21] 鲁威. 基于多因素特征的文本分类的研究[D]. 成都:电子科技大学,2019.
[22] 顾亦然,霍建霖,杨海根,等.基于BERT的电机领域中文命名实体识别方法[EB/OL].[2021-05-07].https://doi.org/10.19678/j.issn.1000-3428.0058838.
[23] 廖胜兰,吉建民,俞畅,等.基于BERT模型与知识蒸馏的意图分类方法[EB/OL].[2021-05-07].https://doi.org/10.19678/j.issn.1000-3428.0057416.
[24] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT:a lite BERT for self-supervised learning of language representations[EB/OL].[2021-05-07]. https://openreview.net/pdf?id=H1eA7AEtvS.
[25] DENHOLM C J, PHILPOTT C. Making the implicit explicit:creating performance expectations for the dissertation[J]. Quality assurance in education,2009,17(2):204-206.
[26] DAHL T. Contributing to the academic conversation:a study of new knowledge claims in economics and linguistics[J]. Journal of pragmatics, 2008, 40(7):1184-1201.
[27] 李瑛,周立.科技期刊论文创新点合理呈现的价值及理想模式[J].中国科技期刊研究,2018,29(10):993-999.
[28] 李贺,杜杏叶. 基于知识元的学术论文内容创新性智能化评价研究[J]. 图书情报工作, 2020,64(1):93-104.
[29] MISHRA S, TORVIK V I. Quantifying conceptual novelty in the biomedical literature[EB/OL].[2021-05-07]. http://www.dlib.org/dlib/september16/mishra/09mishra.html.
[30] TEUFEL S, SIDDHARTHAN A,TIDHAR D. An annotation scheme for citation function[C]//Proceedings of the 7th SIGdial workshop on discourse and dialogue.New York:ACM,2006:80-87.
[31] DIETTERICH T G. Approximate statistical tests for comparing supervised classification learning algorithms[J]. Neural computation, 1998, 10(7):1895-1923.