收稿日期: 2014-01-19
修回日期: 2014-02-14
网络出版日期: 2014-03-05
基金资助
本文系国家自然科学基金项目“社会化媒体集成检索与语义分析方法研究”(项目编号:71273194)研究成果之一。
Hotspot Mining Based on LDA Model and Microblog Heat
Received date: 2014-01-19
Revised date: 2014-02-14
Online published: 2014-03-05
唐晓波 , 向坤 . 基于LDA模型和微博热度的热点挖掘[J]. 图书情报工作, 2014 , 58(05) : 58 -63 . DOI: 10.13266/j.issn.0252-3116.2014.05.010
This paper analyses shortcomings in the traditional LDA (Latent Dirichlet Allocation) model when performing microblog hotspot mining, which include that excavated probability results is abstract and is difficult to interpret. Taking into account the characteristics of the microblog and the viewpoint of the information quantity in information theory, it proposes the concept of microblog heat, introduces it into the hotspots mining research of the LDA model, and frams the LDA model based on microblog heat. With experiments on microblog data collected through API, this paper proves that the new method has the same performance compared to the old one, furthermore, it can express a more intuitive table of microblog heatand draw a more convincible conclusion.
Key words: LDA; microblog heat; topic model; hotspot mining
[1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[2] 蔡淑琴, 张静, 王旸, 等.基于中心化的微博热点发现方法[J].管理学报,2012(6):874-879.
[3] Griffiths T, Steyvers M. Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(1): 5228-5235.
[4] Steyvers M, Griffiths T. Probabilistic topic models[J]. Handbook of Latent Semantic Analysis, 2007, 427(7): 424-440.
[5] 薛德军.中文文本自动分类中的关键问题研究[D].北京:清华大学, 2004.
[6] 郭红钰.基于信息熵理论的特征权重算法研究[J].计算机工程与应用, 2013, 49(10):140-146.
[7] 鲁松, 李晓黎, 白硕, 等.文档中词语权重计算方法的改进[J].中文信息学报, 2000, 14(6):8-13.
[8] Yang Yiming, Pedersen J O.A comparative study on feature selection in text categorization[C]//Proceeding of the Fourteenth International Conference on Machine Learning(ICML'97). San Francisco:Morgan Kaufmann Publishers Inc,1997:412-420.
[9] Wilson A T, Chew P A. Term weighting schemes for latent dirichlet allocation[C]//Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles:Association for Computational Linguistics, 2010: 465-473.
[10] 陆铭.Web 2.0网络热点发现与个性化检索[D].合肥:中国科学技术大学, 2012.
[11] 赵迎光, 安新颖, 李勇, 等.一种基于生命周期理论的文献热点发现方法[J].现代图书情报技术, 2012(11):86-91.
[12] Xu Weili, Feng Shi, Wang Lin, et al. Detecting hot topics in Chinese micro-blog streams based on frequent patterns mining[M]//Web Information Systems and Mining. Heidelberg:Springer, 2012: 637-644.
/
〈 |
|
〉 |