[目的/意义] 基于作者对自身研究的描述性评价和后续研究者的评论性引用视角,利用摘要和引文语料提取突破性研究的特征词,从而了解突破性研究的摘要和引文语料特征以帮助对于突破性研究的识别。[方法/过程] 选取Science评选为"Breakthrough of the Year "的关键文献和Nobel Prize获得者的" key publications"作为突破性研究语料数据,整合论文的摘要和引文语料进行特征词提取。特征词提取中,首先利用Stanford CoreNlp工具对语料进行分词及词频统计,并结合专家意见提取特征词元。然后将特征词作为种子词,利用医学文本的语义关系对特征词进行语义拓展。最后通过查全率和查准率进一步对比摘要和引文的特征词拓展前后的检索识别效果。[结果/结论] 突破性研究语料中遴选出8个摘要语料的特征词元和8个引文语料的特征词元。特征词检索识别中,摘要和引文的拓展特征词的查全率最高,引文特征词的查准率最高,引文拓展特征词的查全率和查准率综合效果较好。
[Purpose/significance] Based on the author's descriptive evaluation of his research and the critical citations of later researchers, the abstract and citation corpus of the breakthrough research are used to extract the feature words. Feature words can be used to understand the abstract and citation corpus features of the breakthrough research and contribute to the identification of breakthrough research. [Method/process] Key documents selected by Science as "Breakthrough of the Year" and "key publications" of Nobel Prize winners were selected as breakthrough research corpus data. Feature words were extracted by integrating abstracts and citation corpus of the paper. In the feature word extraction, the Stanford CoreNlp tool was used to perform word frequency statistics on the corpus, and the feature words were filtered in combination with expert opinions. Then we used the semantic relationship of medical texts to semantically expand feature words, which were used as the seed words. Finally, the retrieval and recognition effects of the abstract and citation feature words were further compared by the recall rate and the precision rate. [Result/conclusion] In the breakthrough research corpus, we selected 8 feature tokens of abstract corpora and 8 feature tokens of citation corpora. In the retrieval and recognition of feature words, the recall rate of the extended feature words of abstracts and citations is the highest, the precision of citation feature words is the highest. The comprehensive effect of the recall rate and precision of citation expansion feature words are better.
[1] GARFIELD E. The 1976 articles most cited in 1976 and 1977. 1. Life sciences.[J]. Essays of an information scientist, 1979, 13(4):81-99.
[2] PONOMAREV I V, WILLIAMS D E, HACKETT C J, et al. Predicting highly cited papers:a method for early detection of candidate breakthroughs[J]. Technological forecasting and social change, 2014, 81(1):49-55.
[3] HUANG Y H, HSU C N, LERMAN K. Identifying transformative scientific research[C]//2013 IEEE international conference on data mining (ICDM). Melbourne:IEEE, 2013:291-300.
[4] WOLCOTT H N, FOUCH M J, HSU E R, et al. Modeling time-dependent and -independent indicators to facilitate identification of breakthrough research papers[J]. Scientometrics, 2016,107(2):807-817.
[5] 杜建, 孙轶楠, 张阳, 等. 变革性研究的科学计量学特征与早期识别方法[J]. 中国科学基金, 2019, 33(1):90-100.
[6] RADEV D R, AMJA D. Rediscovering ACL discoveries through the Lens of ACL anthology network citing sentences[C]//Proceedings of ACL 2012 special session on the 50th anniversary of ACL. Stroudsburg:Association for Computational Linguistics, 2012:1-12.
[7] SMALL H, TSENG H, PATEK M. Discovering discoveries:identifying biomedical discoveries using citation contexts[J]. Journal of informetrics, 2017, 11(1):46-62.
[8] SIOLAS G. Support vector machines based on a semantic kernel for text categorization[C]//Proceedings of the international joint conference on neural networks. Como:IEEE Computer Society, 2000:205-209.
[9] 刘丽珍, 宋瀚涛. 文本分类中的特征选取[J]. 计算机工程, 2004, 30(4):14-15.
[10] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information processing & management, 1987, 24(5):513-523.
[11] 谷俊, 严明. 基于中文专利的新技术术语识别究[J], 情报科学, 2013, 31(2):144-149.
[12] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL].[2020-02-23].https://arxiv.xilesou.top/pdf/1301.3781.pdf.
[13] CHEN C, SONG M, HEO G E. A scalable and adaptive method for finding semantically equivalent cue words of uncertainty[J]. Journal of informetrics, 2018, 12(1):158-180.
[14] PONOMAREV I V, WILLIAMS D E, HACKETT C J, et al. Predicting highly cited papers:a method for early detection of candidate breakthroughs[J]. Technological forecasting and social change, 2014, 81:49-55.
[15] Science newsletters[EB/OL].[2020-02-23]. http://www.sciencemagchina.cn/highlights141219. aspx.
[16] Breakthrough of the year[EB/OL].[2020-02-23]. http://en.wikipedia.org/wiki/Breakthrough_of_the_Year.
[17] Colil[EB/OL].[2020-02-23]. http://colil.dbcls.jp/browse/papers/.
[18] FUJIWARA T, YAMAMOTO Y. Colil:a database and search service for citation contexts in the life sciences domain[J]. Journal of biomedical semantics, 2015, 6(1):38.
[19] DING Y, ROUSSEAU R, WOLFRAM D. Text mining with the Stanford CoreNLP[J]. Replicable science of science studies, 2014(10):215-234.
[20] 刘欣,佘贤栋,唐永旺,等.基于特征词向量的短文本聚类算法[J].数据采集与处理, 2017,32(5):1052-1060.
[21] PYYSALO S, GINTER F, MOEN H, et al. Distributional semantics resources for biomedical text processing[J]. Proceedings of languages in biology and medicine, 2013.
[22] KANERVA P, KRISTOFERSON J, HOLST A. Random indexing of text samples for latent semantic analysis[J]. Proceedings of the annual meeting of the Cognitive Science Society, 2000, 22(22):1036-1036.
[23] CLEVERDON C. The cranfield tests on index language devices[J]. Aslib proceedings, 1967, 19(6):173-194.