Hotspot Mining Based on LDA Model and Microblog Heat

  • Tang Xiaobo ,
  • Xiang Kun
Expand
  • Center for Studies of Information System, Wuhan University, Wuhan 430072

Received date: 2014-01-19

  Revised date: 2014-02-14

  Online published: 2014-03-05

Abstract

This paper analyses shortcomings in the traditional LDA (Latent Dirichlet Allocation) model when performing microblog hotspot mining, which include that excavated probability results is abstract and is difficult to interpret. Taking into account the characteristics of the microblog and the viewpoint of the information quantity in information theory, it proposes the concept of microblog heat, introduces it into the hotspots mining research of the LDA model, and frams the LDA model based on microblog heat. With experiments on microblog data collected through API, this paper proves that the new method has the same performance compared to the old one, furthermore, it can express a more intuitive table of microblog heatand draw a more convincible conclusion.

Cite this article

Tang Xiaobo , Xiang Kun . Hotspot Mining Based on LDA Model and Microblog Heat[J]. Library and Information Service, 2014 , 58(05) : 58 -63 . DOI: 10.13266/j.issn.0252-3116.2014.05.010

References

[1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[2] 蔡淑琴, 张静, 王旸, 等.基于中心化的微博热点发现方法[J].管理学报,2012(6):874-879.
[3] Griffiths T, Steyvers M. Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(1): 5228-5235.
[4] Steyvers M, Griffiths T. Probabilistic topic models[J]. Handbook of Latent Semantic Analysis, 2007, 427(7): 424-440.
[5] 薛德军.中文文本自动分类中的关键问题研究[D].北京:清华大学, 2004.
[6] 郭红钰.基于信息熵理论的特征权重算法研究[J].计算机工程与应用, 2013, 49(10):140-146.
[7] 鲁松, 李晓黎, 白硕, 等.文档中词语权重计算方法的改进[J].中文信息学报, 2000, 14(6):8-13.
[8] Yang Yiming, Pedersen J O.A comparative study on feature selection in text categorization[C]//Proceeding of the Fourteenth International Conference on Machine Learning(ICML'97). San Francisco:Morgan Kaufmann Publishers Inc,1997:412-420.
[9] Wilson A T, Chew P A. Term weighting schemes for latent dirichlet allocation[C]//Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles:Association for Computational Linguistics, 2010: 465-473.
[10] 陆铭.Web 2.0网络热点发现与个性化检索[D].合肥:中国科学技术大学, 2012.
[11] 赵迎光, 安新颖, 李勇, 等.一种基于生命周期理论的文献热点发现方法[J].现代图书情报技术, 2012(11):86-91.
[12] Xu Weili, Feng Shi, Wang Lin, et al. Detecting hot topics in Chinese micro-blog streams based on frequent patterns mining[M]//Web Information Systems and Mining. Heidelberg:Springer, 2012: 637-644.

Outlines

/