图书情报工作 ›› 2018, Vol. 62 ›› Issue (4): 113-120.DOI: 10.13266/j.issn.0252-3116.2018.04.015

• 知识组织 • 上一篇    下一篇

基于主题模型的科技报告文档聚类方法研究

曲靖野1,2, 陈震1, 郑彦宁2   

  1. 1. 北华大学信息技术与传媒学院 吉林 132013;
    2. 中国科学技术信息研究所 北京 100038
  • 收稿日期:2017-08-12 修回日期:2017-11-13 出版日期:2018-02-20 发布日期:2018-02-20
  • 通讯作者: 陈震(ORCID:0000-0002-3522-5272),高级实验师,博士,通讯作者,E-mail:59975235@qq.com
  • 作者简介:曲靖野(ORCID:0000-0002-1715-1919),副教授,博士;郑彦宁(ORCID:0000-0003-3885-7459),研究馆员,博士生导师.
  • 基金资助:
    本文系吉林省教育科学"十三五"规划项目"项目教学法在高校基础计算机教学中的应用研究"(项目编号:GH170061)研究成果之一。

Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model

Qu Jingye1,2, Chen Zhen1, Zheng Yanning2   

  1. 1. Information Technology and Media College of Beihua University, Jilin 132013;
    2. Institute of Scientific and Technical Information of China, Beijing 100038
  • Received:2017-08-12 Revised:2017-11-13 Online:2018-02-20 Published:2018-02-20

摘要: [目的/意义]探索实践以科技报告为文献载体形式的融合主题模型的文本聚类方法,拓展基于科技文献进行技术监测服务的新领域,提出基于科技报告进行语义分析的新方法。[方法/过程]以国家科技报告服务系统中的科技报告为数据源,首先基于LDA主题模型对经过文本预处理的科技报告进行主题挖掘,再基于Ward与K-means相结合的聚类算法对包含主题分布信息的文本向量进行聚类分析,尝试提出一种适合科技报告文档聚类的文本挖掘新方法。[结果/结论]实验结果表明,LDA主题模型能有效准确挖掘科技报告中的主题信息,所提出的Ward与K-means相结合的聚类算法对科技报告的聚类效果也优于其它传统聚类算法。

关键词: 科技报告, 主题模型, LDA, 文本聚类

Abstract: [Purpose/significance] This paper explores the method of text clustering in the science and technology reports based on the topic model, develops new scientific literature technology monitoring areas, and puts forward a new semantic analysis method based on science and technology reports. [Method/process] Based on the national science and technology report service system, firstly, it conducted topic mining based on the LDA model after the text preprocessing; secondly, a clustering analysis based on the combination of K-means and Ward was carried out based on the text vector of the abstract containing theme distribution information. A proper text clustering method for the text mining suitable for the science and technical report was proposed. [Result/conclusion] The experimental results show that the LDA model can be effectively and accurately used in the topic mining of science and technology reports, and the clustering effect of the combination of Ward and K-means proposed in this paper is better than that of other traditional clustering algorithms in science and technology reports.

Key words: science and technology report, topic model, LDA, text clustering

中图分类号: