情报研究

基于LDA模型和微博热度的热点挖掘

  • 唐晓波 ,
  • 向坤
展开
  • 武汉大学信息系统研究中心
唐晓波,武汉大学信息系统研究中心教授,博士生导师;向坤,武汉大学信息系统研究中心硕士研究生,通讯作者,E-mail:350109583@QQ.com。

收稿日期: 2014-01-19

  修回日期: 2014-02-14

  网络出版日期: 2014-03-05

基金资助

本文系国家自然科学基金项目“社会化媒体集成检索与语义分析方法研究”(项目编号:71273194)研究成果之一。

Hotspot Mining Based on LDA Model and Microblog Heat

  • Tang Xiaobo ,
  • Xiang Kun
Expand
  • Center for Studies of Information System, Wuhan University, Wuhan 430072

Received date: 2014-01-19

  Revised date: 2014-02-14

  Online published: 2014-03-05

摘要

分析传统LDA模型在进行微博热点挖掘时所得概率结果抽象且难以结合实际解释的缺点;考虑到微博本身的数据特点和信息论中信息量的观点,提出微博热度的概念,并将其引入到LDA模型的热点挖掘研究中,构建基于微博热度的LDA模型;通过API采集微博数据上的实验,证明新方法与旧方法具有相同的性能,而且能得到更直观的微博热度表,并得出更具有说服力的挖掘结论。

本文引用格式

唐晓波 , 向坤 . 基于LDA模型和微博热度的热点挖掘[J]. 图书情报工作, 2014 , 58(05) : 58 -63 . DOI: 10.13266/j.issn.0252-3116.2014.05.010

Abstract

This paper analyses shortcomings in the traditional LDA (Latent Dirichlet Allocation) model when performing microblog hotspot mining, which include that excavated probability results is abstract and is difficult to interpret. Taking into account the characteristics of the microblog and the viewpoint of the information quantity in information theory, it proposes the concept of microblog heat, introduces it into the hotspots mining research of the LDA model, and frams the LDA model based on microblog heat. With experiments on microblog data collected through API, this paper proves that the new method has the same performance compared to the old one, furthermore, it can express a more intuitive table of microblog heatand draw a more convincible conclusion.

参考文献

[1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[2] 蔡淑琴, 张静, 王旸, 等.基于中心化的微博热点发现方法[J].管理学报,2012(6):874-879.
[3] Griffiths T, Steyvers M. Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(1): 5228-5235.
[4] Steyvers M, Griffiths T. Probabilistic topic models[J]. Handbook of Latent Semantic Analysis, 2007, 427(7): 424-440.
[5] 薛德军.中文文本自动分类中的关键问题研究[D].北京:清华大学, 2004.
[6] 郭红钰.基于信息熵理论的特征权重算法研究[J].计算机工程与应用, 2013, 49(10):140-146.
[7] 鲁松, 李晓黎, 白硕, 等.文档中词语权重计算方法的改进[J].中文信息学报, 2000, 14(6):8-13.
[8] Yang Yiming, Pedersen J O.A comparative study on feature selection in text categorization[C]//Proceeding of the Fourteenth International Conference on Machine Learning(ICML'97). San Francisco:Morgan Kaufmann Publishers Inc,1997:412-420.
[9] Wilson A T, Chew P A. Term weighting schemes for latent dirichlet allocation[C]//Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles:Association for Computational Linguistics, 2010: 465-473.
[10] 陆铭.Web 2.0网络热点发现与个性化检索[D].合肥:中国科学技术大学, 2012.
[11] 赵迎光, 安新颖, 李勇, 等.一种基于生命周期理论的文献热点发现方法[J].现代图书情报技术, 2012(11):86-91.
[12] Xu Weili, Feng Shi, Wang Lin, et al. Detecting hot topics in Chinese micro-blog streams based on frequent patterns mining[M]//Web Information Systems and Mining. Heidelberg:Springer, 2012: 637-644.

文章导航

/