图书情报工作 ›› 2016, Vol. 60 ›› Issue (9): 131-138,146.DOI: 10.13266/j.issn.0252-3116.2016.09.018

• 知识组织 • 上一篇    下一篇

基于语义和引用加权的文献主题提取研究

杨春艳1, 潘有能2, 赵莉1   

  1. 1. 宁波大学图书馆与信息中心 宁波 315211;
    2. 浙江大学公共管理学院 杭州 310028
  • 收稿日期:2016-01-04 修回日期:2016-04-20 出版日期:2016-05-05 发布日期:2016-05-05
  • 作者简介:杨春艳(ORCID:0000-0002-1178-3851),助理馆员,E-mail:yangchunyan@nbu.edu.cn;潘有能,副教授,硕士生导师;赵莉,馆员。
  • 基金资助:

    本研究系国家社会科学基金项目"学术型大数据知识组织与服务标准研究"(项目编号:15FTQ002)研究成果之一。

Study on Topic Extraction of Literatures Based on Weighted Semantic and Citation Relation

Yang Chunyan1, Pan Youneng2, Zhao Li1   

  1. 1. Library and Information Center, Ningbo University, Ningbo 315211;
    2. College of Public Administration, Zhejiang University, Hangzhou 310028
  • Received:2016-01-04 Revised:2016-04-20 Online:2016-05-05 Published:2016-05-05

摘要:

[目的/意义]传统的文献主题提取方法主要是通过关键词、摘要、全文等提取文献的主题内容,使得主题内容不全面或存在"噪音",而从文献内容语义出发,结合引用内容提取文献的主题,能够更加准确地提取出多文档的主题内容。[方法/过程]提出一种面向多文档的基于语义和引用加权的科技文献主题提取算法,利用文献的引用内容和关键词构建Labeled-LDA主题模型,形成文档-主题概率向量,再根据K-means聚类方法聚类文档,提取每类文档集的主题内容。[结果/结论]以PubMed生物医学数据库中的数据作为实验数据,测试该方法的可靠性,结果证明该方法能够准确、全面地提取出多文档的主题内容。

关键词: Labeled-LDA模型, 引用内容, 主题提取

Abstract:

[Purpose/significance] The traditional methods of topic extraction mainly extract the themes of literatures by keywords, abstracts and full texts, but their results are not comprehensive or have noises. The method which starts from the semantic of literature content and is combined with the citation content, can extract the themes of literatures more accurately. [Result/conclusion] This article proposes a literature topic extracting algorithm based on weighted semantic and citation relation for multi-documents. It builds the Labeled-LDA topic Model with the citation content and keywords of literatures, gets documents-topics probability distribution. Then it clusters documents through the K-means clustering method and extracts the topics of each type of documents. [Result/conclusion] In the experimental part, the test data are downloaded from the PubMed database. The result shows that the method can accurately extract the theme of literatures.

Key words: Labeled-LDA Model, citation content, topic extraction

中图分类号: