[目的/意义]从全文本内容分析的角度对算法的学术影响力进行分析。[方法/过程]以自然语言处理领域十大数据挖掘算法使用为例,分析不同算法在特定领域的影响力。通过对1965年-2006年间发表的自然语言处理领域10 922篇学术论文的调研,从其全文内容中抽取6 001条包含十大数据挖掘算法的句子(简称算法句);针对算法句从提及论文数、总提及次数、提及位置等3个方面,对不同算法的影响力进行比较分析。[结果/结论]以不同特征作为影响力衡量标准,十大数据挖掘算法在自然语言处理领域学术论文中的影响力有明显区别,在基于论文数、提及数和提及位置的评估标准中,SVM算法表现出较高的影响力,Apriori算法的影响力则明显低于其他算法。本研究为量化评估算法的影响力提供了新思路。
[Purpose/significance] This paper analyses the influence of different algorithms in specific fields based on full-text analysis.[Method/process] This paper analyzes the usage of the top 10 data mining algorithms in the domain of natural language processing. Firstly, we use 10922 academic papers published in the field of natural language processing from 1965 to 2006, and 6001 sentences containing Top-10 data mining algorithms are extracted from its full text. We evaluate the impact of the Top-10 algorithms according to three aspects:number of papers, mention number of algorithms, location of algorithms, and compare the results of different evaluation criterion.[Result/conclusion] With different standard of assessment, the influence of ten data mining algorithms in conference papers of NLP is obviously different. SVM algorithm has higher influence on the evaluation criteria based on the number of papers and number of mention, and impact of Apriori algorithm is significantly lower than other algorithms. Our result of this paper provides a new way to quantify the influence of algorithm.
[1] WU X D. Top 10 algorithms in data mining[J]. Knowledge information system, 2008, 14(1):1-37.
[2] CHAN Y S, HWEE T.Estimating class priors in domain adaptationadaptation for word sense disambiguation[C]//Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL.Sydney:Association for Computational Linguistics,2006:89-96.
[3] YANG X F, SUB J, TAN C L.Kernel-based pronoun resolution with structured syntactic knowledge[C]//Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL.Sydney:Association for Computational Linguistics, 2006:41-48.
[4] LAN M, TAN C L, SU J, et al.Supervised and traditional term weighting methods for automatic text categorization[J]. IEEE transactions on pattern analysis and machine intelligence, 2009, 31(4):721-735.
[5] 冯志伟.自然语言处理的历史与现状[J].中国外语, 2008, 5(1):14-23.
[6] 徐琳,赵铁军.国家自然科学基金在自然语言处理领域近年来资助的已结题项目综述[J].软件学报, 2005, 16(10):1853-1859.
[7] DING Y, SONG M, JIA H, et al. Entitymetrics-measuring the impact of entities[J]. PLOS ONE, 2013, 8(8):1-14.
[8] 邱均平,文庭孝,宋艳辉.知识计量学[M].北京:科学出版社,2014:82-83.
[9] WILBANKS E G, FACCIOTTI M T. Evaluation of algorithm performance in chip-seq peak[J].PLOS ONE, 2010, 5(7):1-12.
[10] ECKLE-KOHLER J, NGHIEM T D, GUREVYCH I. Automatically assigning research methods to journal articles in the domain of social sciences[C]//Proceedings of the Association for Information Science and Technology.Montreal:American Society for Information Science, 2013, 50(1):1-8.
[11] ZHANG L G, JIANG L X, CHAO Q.C4.5 or naive bayes:a discriminative model selection approach[C]//International conference on artificial neural networks. Barcelona:Springer, 2016:419-426.
[12] SETTOUTI N, BECHAR M E A, CHIKH M A. Statistical comparisons of the top 10 algorithms in data mining for classification task[J]. International journal of interactive multimedia and artificial intelligence, 2016,4(1):46-51.
[13] CHU H T. Research methods in library and information science:a content analysis[J]. Library and information science research, 2015, 37(1):36-41.
[14] 余丰民,董珍时,汤江明.2000-2009年国内高校图书馆与公共图书馆研究热点概观——基于期刊论文关键词词频统计及共现分析[J].图书情报工作, 2010, 54(19):32-36.
[15] 汤建民.中国人文社科领域杰出人物的影响力——以论文关键词中的人名词频为指标[J].情报资料工作,2013,34(2):51-58.
[16] UDDIN S, KHAN A. The impact of author-selected keywords on citation counts[J]. Journal of informetrics, 2016, 10(4):1166-1177.
[17] 赵蓉英,魏明坤,汪少震.基于Altmetrics的开源软件学术影响力评价研究[J].中国图书馆学报,2017,43(2):80-95.
[18] 丁楠,黎娇,李文雨泽,等.基于引用的科学数据评价研究[J].图书与情报,2014(5):95-99.
[19] HOWISON J, BULLARD J. Software in the scientific literature:problems with seeing, finding, and using software mentioned in the biology literature[J]. Journal of the Association for Information Science and Technology, 2016, 67(9):2137-2155.
[20] 王雪,马胜利,余曾溧,等.科学数据的引用行为及其影响力研究[J].情报学报,2016,35(11):1132-1139.
[21] PAN X L, YAN E, WANG Q, et al. Assessing the impact of software on science:a bootstrapped learning of software entities in full-text papers[J]. Journal of informetrics, 2015, 9(4):860-871.
[22] 杨波,王雪,余曾溧.生物信息学文献中的科学软件利用行为研究[J].情报学报,2016,35(11):1140-1147.
[23] DING Y, LIU X, GUO C, et al. The distribution of references across texts:some implications for citation analysis[J]. Journal of informetrics, 2013, 7(3):583-592.
[24] WAN X J, LIU F.WL-index-leveraging citation mention number to quantify an individual's scientific impact[J]. Journal of the Association for Information Science and Technology, 2014, 65(12):2509-2517.
[25] MCCAIN K, TURNER K. Citation context analysis and aging patterns of journal articles in molecular genetics[J]. Scientometrics, 1989, 17(1/2):127-163.
[26] MARICIC S, SPAVENTI J, PAVICIC L, et al. Citation context versus the frequency counts of citation histories[J]. Journal of the Association for Information Science and Technology, 1998, 49(6):530-540.