知识组织

基于动态LDA主题模型的内容主题挖掘与演化

  • 胡吉明 ,
  • 陈果
展开
  • 武汉大学信息资源研究中心
胡吉明,武汉大学信息资源研究中心讲师,E-mail:whuhujiming@qq.com;陈果,武汉大学信息资源研究中心博士研究生。

收稿日期: 2013-11-13

  修回日期: 2014-01-04

  网络出版日期: 2014-01-20

基金资助

本文系教育部人文社会科学青年基金项目“社会网络环境下信息内容主题挖掘与语义分类研究”(项目编号:13YJC870008)和国家自然科学青年基金项目“社会网络环境下基于用户-资源关联的信息推荐研究(项目编号:71303178)”研究成果之一。

Mining and Evolution of Content Topics Based on Dynamic LDA

  • Hu Jiming ,
  • Chen Guo
Expand
  • Center for Studies of Information Resources, Wuhan University, Wuhan 430072

Received date: 2013-11-13

  Revised date: 2014-01-04

  Online published: 2014-01-20

摘要

指出文本内容主题的挖掘和演化研究对于文本建模和分类及推荐效果提升具有重要作用。从分析基于LDA主题模型的文本内容主题挖掘原理入手,针对当前网络环境下的文本内容特点,构建适用于动态文内容本主题挖掘的LDA模型,并通过改进的Gibbs抽样估计提高主题挖掘的准确性,进而从主题相似度和强度两个方面研究内容主题随时间的演化问题。实验表明,所提方法可行且有效,对后续有关文本语义建模和分类研究等具有重要的实践意义。

本文引用格式

胡吉明 , 陈果 . 基于动态LDA主题模型的内容主题挖掘与演化[J]. 图书情报工作, 2014 , 58(02) : 138 -142 . DOI: 10.13266/j.issn.0252-3116.2014.02.023

Abstract

The study of mining and evolution of text topics is of important significance for text modeling and classification, as well as the recommendation service. Starting from the analysis of theory of text topic modeling based on LDA, aiming at dynamic characters of text contents under social networking environment, this article constructed a dynamic LDA model for mining of text topics. Subsequently, the accuracy degree of topic mining was improved by incremental Gibbs sampling and estimation. Furthermore, the evolution of dynamic topics of text contents was achieved from the aspects of topic similarity and intensity. The experiment demonstrated that methods proposed in this article were feasible and effective, which will be the foundation of further study about semantic modeling and classification text.

参考文献

[1] Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990,114(2):211-244.
[2] Hofmann T. Probabilistic latent semantic analysis[C]//Proceedings of the Twenty-Second Annual International SIGIR,Conference on Research and Development in Information Retrieval.New York:ACM,1999:50-57.
[3] Blei D M, Ng A Y, Jordan M L, et al. Latent Dirichlet allocation[J].Journal of Machine Learning Research, 2003,3(2):993-1022.
[4] Blei D M. Probabilistic topic models[J]. Communications of the ACM,2012,55(4):77-84.
[5] Barbieri N, Manco G, Ritacco E, et al. Probabilistic topic models for sequence data[J]. Machine Learning,2013,93(1):5-29.
[6] Isaly L, Trias E, Peterson G. Improving the latent Dirichlet allocation document model with WordNet[C]//Proceedings of the 5th International Conference on Information Warfare and Security.London:Academic Conferences Ltd,2010:163-170.
[7] Hofmann T. Unsupervised learning by probabilistic latent semantic analysis[J].Machine Learning,2001,42(1):177-196.
[8] Du Lan, Buntine W, Jin Huidong, et al. Sequential latent Dirichlet allocation[J]. Knowledge and Information Systems,2012,31(3):475-503.
[9] Mohd M, Crestani F, Ruthven I. Evaluation of an interactive topic detection and tracking interface[J]. Journal of Information Science,2012,38(4):383-398.
[10] Aksoy C, Can F, Kocberber S. Novelty detection for topic tracking[J].Journal of The American Society for Information Science and Technology,2012,63(4):777-795.
[11] 余传明,张小青,陈雷,等.基于LDA模型的评论热点挖掘:原理与实现[J].情报理论与实践,2010,33(5):103-106.
[12] 刘洪涛,肖开洲,吴渝,等.带舆论评价的引文网络构建与主题发现[J].情报学报,2011,30(4):441-448.
[13] 黄颖. LDA及主题词相关性的新事件检测[J].计算机与现代化,2012(1): 6-9,13.
[14] Kang J H, Lerman K, Plangprasopchok A. Analyzing microblogs with affinity propagation[C]//Proceedings of KDD Workshop on Social Media Analytics. New York:ACM,2010:67-70.
[15] Gohr A, Hinneburg A, Schult R, et al. Topic evolution in a stream of documents[C]//Proceeding of the Society for Industrial and Applied Mathematics. Washington: National Academy of Science, 2009:859-870.
[16] Griffiths T L,Steyvers M. Finding scientific topics[C]//Proceedings of the National Academy of Science. Washington: National Academy of Sciences, 2004:5228-5235.
[17] Walsh B. Markov chain monte carlo and Gibbs sampling[EB/OL].[2014-01-05]. http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf.
[18] 楚克明. 基于LDA的新闻话题演化研究[D].上海:上海交通大学,2010.
[19] 谭松波,王月粉.中文文本分类语料库-TanCorpV1.0[EB/OL].[2011-11-10].http://www.searchforum.org.cn/tansongbo/corpus.htm.
[20] 中国科学院计算技术研究所. ICTCLAS2011[EB/OL].[2010-12-21]. http://ictclas.org/ictclas_download.aspx.
[21] Guo Xin, Xiang Yang, Chen Qian, et al. LDA-based online topic detection using tensor factorization[J]. Journal of Information Science,2013,39(4): 459-469.
[22] 单斌,李芳.基于LDA话题演化研究方法综述[J].中文信息学报,2010,24(6):43-49,68.
[23] Cao Juan, Xia Tian, Li Jintao, et al. A density-based method for adaptive LDA model selection[J]. Neurocomputing, 2009,72(7-9): 1775-1781.

文章导航

/