图书情报工作 ›› 2020, Vol. 64 ›› Issue (11): 125-132.DOI: 10.13266/j.issn.0252-3116.2020.11.014

• 知识组织 • 上一篇    下一篇

基于论文摘要和引文文本语料的突破性研究特征词识别

杨雪梅1, 王雪1, 杜建2, 唐小利1   

  1. 1 中国医学科学院医学信息研究所 北京 100005;
    2 北京大学健康医疗大数据国家研究院 北京 100191
  • 收稿日期:2019-10-23 修回日期:2019-12-28 出版日期:2020-06-05 发布日期:2020-06-05
  • 通讯作者: 唐小利(ORCID:0000-0001-6946-3482),研究馆员,硕士,通讯作者,E-mail:tang.xiaoli@imicams.ac.cn
  • 作者简介:杨雪梅(ORCID:0000-0002-2927-4166),助理馆员,硕士;王雪(ORCID:0000-0001-6852-1791),硕士研究生;杜建(ORCID:0000-0002-7621-9995),副研究员,博士。
  • 基金资助:
    本文系国家社会科学基金项目"基于科学与技术交叉模型的创新前沿识别方法与应用研究"(项目编号:18BTQ064)和中国医学科学院医学与健康科技创新工程"医学科技创新评价与卫生服务体系构建研究"(项目编号:2016-I2M-3-018)研究成果之一。

Identifying Feature Words Based on Abstracts and Citation Text Corpus of Breakthrough Research

Yang Xuemei1, Wang Xue1, Du Jian2, Tang Xiaoli1   

  1. 1 Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100005;
    2 National Institute of Health Data Science, Peking University, Beijing 100191
  • Received:2019-10-23 Revised:2019-12-28 Online:2020-06-05 Published:2020-06-05

摘要: [目的/意义] 基于作者对自身研究的描述性评价和后续研究者的评论性引用视角,利用摘要和引文语料提取突破性研究的特征词,从而了解突破性研究的摘要和引文语料特征以帮助对于突破性研究的识别。[方法/过程] 选取Science评选为"Breakthrough of the Year "的关键文献和Nobel Prize获得者的" key publications"作为突破性研究语料数据,整合论文的摘要和引文语料进行特征词提取。特征词提取中,首先利用Stanford CoreNlp工具对语料进行分词及词频统计,并结合专家意见提取特征词元。然后将特征词作为种子词,利用医学文本的语义关系对特征词进行语义拓展。最后通过查全率和查准率进一步对比摘要和引文的特征词拓展前后的检索识别效果。[结果/结论] 突破性研究语料中遴选出8个摘要语料的特征词元和8个引文语料的特征词元。特征词检索识别中,摘要和引文的拓展特征词的查全率最高,引文特征词的查准率最高,引文拓展特征词的查全率和查准率综合效果较好。

关键词: 突破性研究, 特征词, 摘要文本, 引用语句

Abstract: [Purpose/significance] Based on the author's descriptive evaluation of his research and the critical citations of later researchers, the abstract and citation corpus of the breakthrough research are used to extract the feature words. Feature words can be used to understand the abstract and citation corpus features of the breakthrough research and contribute to the identification of breakthrough research. [Method/process] Key documents selected by Science as "Breakthrough of the Year" and "key publications" of Nobel Prize winners were selected as breakthrough research corpus data. Feature words were extracted by integrating abstracts and citation corpus of the paper. In the feature word extraction, the Stanford CoreNlp tool was used to perform word frequency statistics on the corpus, and the feature words were filtered in combination with expert opinions. Then we used the semantic relationship of medical texts to semantically expand feature words, which were used as the seed words. Finally, the retrieval and recognition effects of the abstract and citation feature words were further compared by the recall rate and the precision rate. [Result/conclusion] In the breakthrough research corpus, we selected 8 feature tokens of abstract corpora and 8 feature tokens of citation corpora. In the retrieval and recognition of feature words, the recall rate of the extended feature words of abstracts and citations is the highest, the precision of citation feature words is the highest. The comprehensive effect of the recall rate and precision of citation expansion feature words are better.

Key words: breakthrough research, feature words, abstract text, citing sentence

中图分类号: