图书情报工作 ›› 2018, Vol. 62 ›› Issue (1): 132-139.DOI: 10.13266/j.issn.0252-3116.2018.01.017

• 知识组织 • 上一篇    下一篇

基于用户自然标注的TF-IDF辅助标引算法及实证研究

陈白雪, 宋培彦   

  1. 中国科学技术信息研究所 北京 100038
  • 收稿日期:2017-07-10 修回日期:2017-10-23 出版日期:2018-01-05 发布日期:2018-01-05
  • 作者简介:陈白雪(ORCID:0000-0003-4726-8103),研究实习员,硕士,E-mail:chenbx@istic.ac.cn;宋培彦(ORCID:0000-0003-1055-2717),副研究馆员,博士,硕士生导师
  • 基金资助:
    本文系2016年国家社会科学基金项目"基于知识组织的科研项目评审专家发现研究"(项目编号:16BTQ079)和2017年度中国科学技术信息研究所创新研究基金面上项目"面向国家科技大数据的知识图谱动态构建方法研究"(项目编号:MS2017-06)研究成果之一。

Empirical Research on TF-IDF Assisted Indexing Algorithm Based on Users' Natural Annotation

Chen Baixue, Song Peiyan   

  1. Institute of Scientific and Technical Information of China, Beijing 100038
  • Received:2017-07-10 Revised:2017-10-23 Online:2018-01-05 Published:2018-01-05

摘要: [目的/意义] 从用户角度出发,研究基于用户自然标注的TF-IDF辅助标引算法。[方法/过程] 首先以核心期刊论文中作者标注的关键词和分类号为源数据,通过对关键词词频进行统计,使用TF-IDF算法构建用户标注词表、形成标引知识库,然后通过IK Analyzer分词软件对待标引的科技项目数据进行切词和停用词处理,进而使用TF-IDF算法和位置加权算法提取科技项目数据的特征词,最终实现对科技项目数据进行关键词和分类的同步标引。[结果/结论] 实验结果表明,机标关键词与人标关键词的相似比在60%以上的科技项目数据占总数的68.1%,机标分类号与人标分类号前三位一致的占总数的83.9%,结果表明基于用户自然标注数据并采用TF-IDF算法在关键词和分类标引方面是可行的。

关键词: 辅助标引, 用户自然标注, TF-IDF算法, 信息组织

Abstract: [Purpose/significance] This paper studies the TF-IDF assisted indexing algorithm based on the user natural annotation from the users' point of view.[Method/process] First, the keywords and the classification number in Chinese core journals were taken as the data source. The user natural annotation vocabulary was constructed by computing the keywords frequency and using the TF-IDF algorithm. Second, the featured words were extracted from the scientific and technological project data by the IK Analyzer word segmentation software and the TF-IDF algorithm. Finally, the keywords and classification number of the scientific and technological project data were indexed synchronously.[Result/conclusion] The experiment indicates that the data of scientific and technical projects take up 68.1% in total. In these projects, the ratio similitude of the keywords of machine indexing and the keywords of human indexing is more than 60% in total. The ratio of the uniformity in the former three numbers of machine-indexed classification number and the human-indexed classification number is 83.9% in total. It is feasible to adopt the TF-IDF algorithm based on the users' natural annotation data.

Key words: assisted indexing, user natural annotation, TF-IDF algorithm, information organization

中图分类号: