Identifying Feature Words Based on Abstracts and Citation Text Corpus of Breakthrough Research

  • Yang Xuemei ,
  • Wang Xue ,
  • Du Jian ,
  • Tang Xiaoli
Expand
  • 1 Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100005;
    2 National Institute of Health Data Science, Peking University, Beijing 100191

Received date: 2019-10-23

  Revised date: 2019-12-28

  Online published: 2020-06-05

Abstract

[Purpose/significance] Based on the author's descriptive evaluation of his research and the critical citations of later researchers, the abstract and citation corpus of the breakthrough research are used to extract the feature words. Feature words can be used to understand the abstract and citation corpus features of the breakthrough research and contribute to the identification of breakthrough research. [Method/process] Key documents selected by Science as "Breakthrough of the Year" and "key publications" of Nobel Prize winners were selected as breakthrough research corpus data. Feature words were extracted by integrating abstracts and citation corpus of the paper. In the feature word extraction, the Stanford CoreNlp tool was used to perform word frequency statistics on the corpus, and the feature words were filtered in combination with expert opinions. Then we used the semantic relationship of medical texts to semantically expand feature words, which were used as the seed words. Finally, the retrieval and recognition effects of the abstract and citation feature words were further compared by the recall rate and the precision rate. [Result/conclusion] In the breakthrough research corpus, we selected 8 feature tokens of abstract corpora and 8 feature tokens of citation corpora. In the retrieval and recognition of feature words, the recall rate of the extended feature words of abstracts and citations is the highest, the precision of citation feature words is the highest. The comprehensive effect of the recall rate and precision of citation expansion feature words are better.

Cite this article

Yang Xuemei , Wang Xue , Du Jian , Tang Xiaoli . Identifying Feature Words Based on Abstracts and Citation Text Corpus of Breakthrough Research[J]. Library and Information Service, 2020 , 64(11) : 125 -132 . DOI: 10.13266/j.issn.0252-3116.2020.11.014

References

[1] GARFIELD E. The 1976 articles most cited in 1976 and 1977. 1. Life sciences.[J]. Essays of an information scientist, 1979, 13(4):81-99.
[2] PONOMAREV I V, WILLIAMS D E, HACKETT C J, et al. Predicting highly cited papers:a method for early detection of candidate breakthroughs[J]. Technological forecasting and social change, 2014, 81(1):49-55.
[3] HUANG Y H, HSU C N, LERMAN K. Identifying transformative scientific research[C]//2013 IEEE international conference on data mining (ICDM). Melbourne:IEEE, 2013:291-300.
[4] WOLCOTT H N, FOUCH M J, HSU E R, et al. Modeling time-dependent and -independent indicators to facilitate identification of breakthrough research papers[J]. Scientometrics, 2016,107(2):807-817.
[5] 杜建, 孙轶楠, 张阳, 等. 变革性研究的科学计量学特征与早期识别方法[J]. 中国科学基金, 2019, 33(1):90-100.
[6] RADEV D R, AMJA D. Rediscovering ACL discoveries through the Lens of ACL anthology network citing sentences[C]//Proceedings of ACL 2012 special session on the 50th anniversary of ACL. Stroudsburg:Association for Computational Linguistics, 2012:1-12.
[7] SMALL H, TSENG H, PATEK M. Discovering discoveries:identifying biomedical discoveries using citation contexts[J]. Journal of informetrics, 2017, 11(1):46-62.
[8] SIOLAS G. Support vector machines based on a semantic kernel for text categorization[C]//Proceedings of the international joint conference on neural networks. Como:IEEE Computer Society, 2000:205-209.
[9] 刘丽珍, 宋瀚涛. 文本分类中的特征选取[J]. 计算机工程, 2004, 30(4):14-15.
[10] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information processing & management, 1987, 24(5):513-523.
[11] 谷俊, 严明. 基于中文专利的新技术术语识别究[J], 情报科学, 2013, 31(2):144-149.
[12] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL].[2020-02-23].https://arxiv.xilesou.top/pdf/1301.3781.pdf.
[13] CHEN C, SONG M, HEO G E. A scalable and adaptive method for finding semantically equivalent cue words of uncertainty[J]. Journal of informetrics, 2018, 12(1):158-180.
[14] PONOMAREV I V, WILLIAMS D E, HACKETT C J, et al. Predicting highly cited papers:a method for early detection of candidate breakthroughs[J]. Technological forecasting and social change, 2014, 81:49-55.
[15] Science newsletters[EB/OL].[2020-02-23]. http://www.sciencemagchina.cn/highlights141219. aspx.
[16] Breakthrough of the year[EB/OL].[2020-02-23]. http://en.wikipedia.org/wiki/Breakthrough_of_the_Year.
[17] Colil[EB/OL].[2020-02-23]. http://colil.dbcls.jp/browse/papers/.
[18] FUJIWARA T, YAMAMOTO Y. Colil:a database and search service for citation contexts in the life sciences domain[J]. Journal of biomedical semantics, 2015, 6(1):38.
[19] DING Y, ROUSSEAU R, WOLFRAM D. Text mining with the Stanford CoreNLP[J]. Replicable science of science studies, 2014(10):215-234.
[20] 刘欣,佘贤栋,唐永旺,等.基于特征词向量的短文本聚类算法[J].数据采集与处理, 2017,32(5):1052-1060.
[21] PYYSALO S, GINTER F, MOEN H, et al. Distributional semantics resources for biomedical text processing[J]. Proceedings of languages in biology and medicine, 2013.
[22] KANERVA P, KRISTOFERSON J, HOLST A. Random indexing of text samples for latent semantic analysis[J]. Proceedings of the annual meeting of the Cognitive Science Society, 2000, 22(22):1036-1036.
[23] CLEVERDON C. The cranfield tests on index language devices[J]. Aslib proceedings, 1967, 19(6):173-194.
Outlines

/