图书情报工作 ›› 2020, Vol. 64 ›› Issue (6): 108-119.DOI: 10.13266/j.issn.0252-3116.2020.06.013

• 知识组织 • 上一篇    下一篇

英文科技论文摘要的语义特征词典构建

宋东桓1,2, 李晨英1, 刘子瑜1, 韩明杰1   

  1. 1. 中国农业大学图书馆 北京 100193;
    2. 中国科学院文献情报中心 北京 100190
  • 收稿日期:2019-07-09 修回日期:2019-09-20 出版日期:2020-03-20 发布日期:2020-03-20
  • 通讯作者: 李晨英(ORCID:0000-0002-1207-4336),研究馆员,硕士生导师,通讯作者,E-mail:licy@cau.edu.cn
  • 作者简介:宋东桓(ORCID:0000-0001-5671-3796),助理馆员,硕士研究生;刘子瑜(ORCID:0000-0002-5850-3079),副研究馆员,博士;韩明杰(ORCID:0000-0003-4611-1569),研究馆员,硕士生导师。

Semantic Feature Dictionary Construction of Abstract in English Scientific Journals

Song Donghuan1,2, Li Chenying1, Liu Ziyu1, Han Mingjie1   

  1. 1. China Agricultural University Library, Beijing 100193;
    2. National Science Library, Chinese Academy of Sciences, Beijing 100190
  • Received:2019-07-09 Revised:2019-09-20 Online:2020-03-20 Published:2020-03-20

摘要: [目的/意义] 论文摘要是信息组织的重要标引对象,将论文摘要按一定结构进行标引有利于科学传播、知识发现和情报分析。如何对现有非结构式摘要进行精准快速的自动标引是亟待解决的现实问题。[方法/过程] 假定不同类别的摘要具有内在一致性,即对结构式摘要的研究可为非结构式摘要自动标引提供方法和技术参考。据此,基于美国国家医学图书馆结构要素标签术语集和标签分类映射关系,提出结构要素BOMRC体系和结构式摘要的识别与规范化标引方法。其次选取研究样本并采用文本挖掘方法对样本语料中的单词、动词、三词词块、四词词块等词汇进行词频、TFIDF值等多个指标的定量统计分析,构建能够进行结构要素识别的语义特征词典。最后利用非结构式摘要测试集进行语义特征词典有效性检验。[结果/结论] 结果显示,利用语义特征词典方法能够有效识别非结构式摘要的各类要素,并可用于优化以机器学习方法为核心的自动识别模型。

关键词: 科技论文, 论文摘要, 结构要素, 语义特征, 特征词典

Abstract: [Purpose/significance] The abstract of scientific papers is a vital indexing object within information organization. Meanwhile, indexing the abstract according to certain rules is conducive for not only scientific communication or knowledge discovery, and intelligence analysis as well. Thus, how to realize auto-index accurately and quickly, for millions of unstructured abstracts existed nowadays is a crucial problem to be addressed.[Method/process] This study assumed that different categories of abstract are inherently consistent, that is, the study of structured abstract can provide a method and technical reference for unstructured abstract auto-indexing. Acting in accordance with this assumption and based on the US National Library of Medicine's structural element labeling terminology, this study accomplished mapping across abstract element classifications and proposed BOMRC system, a normalization indexing method for structured abstract. Then we collected research sample and used text mining method to analyze multiple features of structured abstract quantitatively and statistically, such as word frequency, TF-IDF value, as for dimension of words, verbs, three-word lexical chunks and four-word lexical chunks, which enabled us propose a semantic feature dictionary for structured elements. Finally, we used unstructured abstract to test the validity of the semantic feature dictionary.[Result/conclusion] The results show that the semantic feature dictionary method can effectively identify various structural elements of scientific paper abstract, and it can be used to optimize the automatic recognition model, which may be based on machine learning methods.

Key words: scientific paper, paper abstract, structural element, semantic feature, feature dictionary

中图分类号: