情报研究

基于层次概率主题模型的科技文献主题发现及演化

  • 王平
展开
  • 武汉大学信息管理学院
王平,武汉大学信息管理学院讲师,博士后,E-mail:wangping@whu.edu.cn.

收稿日期: 2014-09-01

  修回日期: 2014-11-05

  网络出版日期: 2014-11-20

基金资助

本文系国家自然科学基金青年科学基金项目“多因素融合下的微博话题可信度评估模型及实证研究”(项目编号:71303179)研究成果之一.

Topic Extraction and Evolution for Scientific Literature Based on Hierarchical Probabilistic Topic Model

  • Wang Ping
Expand
  • School of Information Management, Wuhan University, Wuhan 430072

Received date: 2014-09-01

  Revised date: 2014-11-05

  Online published: 2014-11-20

摘要

自动挖掘科技文献主题并识别主题变化对于科研工作者及时获取相关领域的最新研究动态有着重要作用.针对科技文献主题多样、动态性强等特点,分析科技文献主题发现及演化具体方法,基于层次概率主题模型hLDA,采用Gibbs抽样来进行模型参数估计,并运用互信息的方法对主题词进行筛选,以提取高质量的主题词.最后,利用先/后离散分析方法研究主题随时间的演化问题.实验结果验证了主题发现及演化方法的可行性及有效性.

本文引用格式

王平 . 基于层次概率主题模型的科技文献主题发现及演化[J]. 图书情报工作, 2014 , 58(22) : 70 -77 . DOI: 10.13266/j.issn.0252-3116.2014.22.012

Abstract

Automatic mining scientific literature's topic and observing topic change for researchers will play great role in understanding and accessing the latest research frontiers on certain field. This paper analyzed topic extraction and evolution approaches of scientific papers by examining the characteristics of the diversity and dynamics of scientific papers, and based on hierarchical probabilistic topic model, using Gibbs sampling to estimate the model parameters and choosing the high-quality topic words by means of mutual information. This paper finally used Pro/Post-discretized analysis to study the topic evolution. The experimental results show that topic extraction and evolution method proposed in this paper are feasible and effective.

参考文献

[1] Aizawa A. An information-theoretic perspective of tf-idf measures[J]. Information Processing and Management , 2003, 39(1):45-65.

[2] Salton G, Wong A, Yang C S. A vector space model for automatic indexing [EB/OL]. [2014-11-04]. http://mall.psy.ohio-state.edu/LexicalSemantics/SaltonWongYang75.pdf.

[3] Allan J, Carbonell J G, Doddington G,et al. Topic detection and tracking pilot study final report[C]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.Virginia:DARPA, 1998.

[4] Gruhl D, Guha R, Liben-Nowell D,et al. Information diffusion through blogspace [C]//Proceedings of the 13th International World Wide Web Conference (WWW'04).New York:ACM, 2004:491-501.

[5] Yang Yiming, Carbonell J G, Brown R D, et al. Learning approaches for detecting and tracking news events[J]. IEEE Intelligent Systems, 1999, 14(4): 32-43.

[6] Zhou Ding, Ji Xiang, Zha Hongyuan,et al. Topic evolution and social interactions: How authors effect research[C]//Proceedings of the 15th ACM International Conference on Lnformation and Knowledge Management.Virginia:ACM, 2006:248-257.

[7] Mei Qiaozhu, Zhai Chengxiang. Discovering evolutionary theme patterns from text: An exploration of temporal text mining[C]//Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining.Chicago:ACM, 2005:198-207.

[8] Mei Qiaozhu, Zhai Chengxiang. A mixture model for contextual text mining[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data mining.Philadelphia:ACM, 2006:649-655.

[9] Zhu Mingliang, Hu Weiming, Wu Ou. Topic detection and tracking for threaded discussion communities[C]//Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.Washington:IEEE, 2008: 77-83.

[10] Cheng V, Li C. Topic detection via participation using markov logic network[C]//Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System-Volume.Shanghai:IEEE, 2007: 85-91.

[11] Sugimoto C R, Li D, Russell T G, et al. The shifting sands of disciplinary development: Analyzing north american library and information science dissertations using iatent dirichlet allocation [J]. Journal of the American Society for Information Science and Technology, 2011, 62(1):185-204.

[12] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6):583-590.

[13] 王金龙, 徐从富, 耿雪玉.基于概率图模型的科研文献主题演化研究[J]. 情报学报,2009,28(3):347-355.

[14] 叶春蕾, 冷伏海. 基于引文——主题概率模型的科技文献主题识别方法研究[J]. 情报理论与实践, 2013, 36(9):100-103.

[15] 贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012(4):61-67.

[16] Blei D M, Ng A Y, Jordan M L, et al. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(2):993-1022.

[17] Bei D M, Griffiths T L, Jordan M L, et al. Hierarchical topic models and the nested chinese restaurant process[C]//Advances in Neural Information Processing Systems.British Columbia:NIPS, 2004, 16: 106-114.

[18] Wang Chong, Blei D M. Variational inference for the nested chinese restaurant process[C]//Advances in Neural Information Processing Systems.British Columbia:NIPS,2009: 1990-1998.

[19] Mimno D. Wallach H M, McCallum A. Gibbs Sampling for logistic normal topic models with graph-based priors [EB/OL]. [2014-11-04]. https://people.cs.umass.edu/~wallach/publications/mimno08gibbs.pdf.

[20] Andrieu C, De Freitas N, Doucet A,et al. Introduction to MCML for machine learning[J]. Machine Learning, 2003, 50:5-43.

[21] Battiti R. Using mutual information for selecting features in supervised neural net learning[J]//IEEE Trans on Neural Networks, 1994, 5(4):537-550.

[22] 单斌, 李芳. 基于LDA主题演化研究方法综述[J]. 中文信息学报,2010, 24(6):43-49.

[23] Wang Xuerui, McCallum A. Topic over time: A non-markov continuous time model of topical trends [C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia:ACM, 2006: 424-433.

[24] Griffiths T L, Steyvers M. Finding scientific topics[C]//Proceeding of the National Academy of Science of United States of America.New York:PNAS, 2004, 101: 5228-5235.

[25] Hall D, Jurafsky D, Manning C D. Studying the history of ideas using topic models [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Hawaii:ACM, 2008:363-371.

[26] Blei D M, Lafferty J D. Dynamic topic models[C]//Proceedings of the 23rd International Conference on Machine Learning. New York:ACM, 2006: 113-120.

[27] 中国社会科学研究评价中心. 中文社会科学引文索引[EB/OL].[2014-08-10].http://cssci.nju.edu.cn/.

文章导航

/