情报研究

多源信息融合的微博查询似然模型

  • 吴树芳 ,
  • 张雄涛 ,
  • 朱杰
展开
  • 1 河北大学管理学院 保定 071000;
    2 北京科技大学东凌经济管理学院 北京 100083;
    3 中央司法警官学院信息管理系 保定 071000
吴树芳(ORCID:0000-0001-6587-812X),教授,博士,博士生导师;朱杰(ORCID:0000-0002-5698-135X),副教授,博士。

收稿日期: 2019-12-16

  修回日期: 2020-05-04

  网络出版日期: 2020-09-05

基金资助

本文系国家社会科学基金项目"网络信息治理视域下社交网络不可信用户识别研究"(项目编号:17BTQ068)研究成果之一。

Microblog Query Likelihood Model Based on Multi-Source Information Fusion

  • Wu Shufang ,
  • Zhang Xiongtao ,
  • Zhu Jie
Expand
  • 1 School of Management, Hebei University, Baoding, 071000;
    2 Dongling School of Economics and Management, University of Science and Technology, Beijing 100083;
    3 Department of Information Management, the Central Institute for Correctional Police, Baoding 071000

Received date: 2019-12-16

  Revised date: 2020-05-04

  Online published: 2020-09-05

摘要

[目的/意义] 查询似然模型存在零概率问题,融合多源信息对模型进行扩展,不仅可以解决零概率问题,还可以实现对全局信息的差异化处理,降低噪声。[方法/过程] 通过LDA主题挖掘和历史微博兴趣挖掘,分别获取初始微博的主题相关信息和兴趣相关信息,并将二者与全局信息融合,用于改进初始微博的语言模型估计,从而得到扩展的微博查询似然模型。运用网络爬虫工具从新浪微博爬取数据,并通过实证研究验证扩展模型的有效性。[结果/结论] 实验结果表明:与已有的查询似然模型扩展方法相比,新模型具有较好的检索性能。

本文引用格式

吴树芳 , 张雄涛 , 朱杰 . 多源信息融合的微博查询似然模型[J]. 图书情报工作, 2020 , 64(17) : 114 -122 . DOI: 10.13266/j.issn.0252-3116.2020.17.012

Abstract

[Purpose/significance] Due to the existence of zero probability problem in the query likelihood model, we propose to extend the model by multi-source information fusion, which not only solves zero probability problem, but also achieves the differential processing of global information to reduce the introduction of noise.[Method/process] Topic related information and interest related information were obtained based on LDA topic mining and historical Microblog interest mining respectively, then we integrated them with global information to modify the evaluation of the original Microblog's language model. Finally, an extended microblog query likelihood model is obtained. We used the web crawler tools to crawl data from Sina Weibo to verify the effectiveness of the extended model by empirical study.[Result/conclusion] Experimental results indicate that our model can achieve better retrieval performance.

参考文献

[1] 吴树芳, 张雄涛, 朱杰. 融合用户兴趣和混合估计的微博检索模型[J]. 情报学报, 2019, 38(4):411-419.
[2] KATZ S. Estimation of probabilities from sparse data for the language model component of a speech recognizer[J]. IEEE transactions on acoustics speech & signal processing, 2003, 35(3):400-401.
[3] GANGULY D, ROY D, MITRA M, et al. A word embedding based generalized language model for information retrieval[C]//Proceedings of the 38th international ACM SIGIR conference. Santiago:ACM, 2015:795-798.
[4] LIU X, CROFT W B. Cluster-based retrieval using language models[C]//Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. Sheffield:ACM, 2004:186-193.
[5] TAO T, WANG X, MEI Q, et al. Language model information retrieval with document expansion[C]//Proceedings of the human language technology conference of the north American chapter of the ACL. New York:Association for Computational Linguistics, 2006:407-414.
[6] EFRON M, ORGANISCIAK P, FENLON K. Improving retrieval of short texts through document expansion[C]//Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. Portland:ACM. 2012.
[7] 卫冰洁, 史亮, 王斌. 一种融合聚类和时间信息的微博排序新方法[J]. 中文信息学报, 2015, 29(3):177-183,189.
[8] EFRON M. Hashtag retrieval in a microblogging environment[C]//Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. Geneva:ACM, 2010:787-788.
[9] 张小鹏, 吕学强, 李卓, 等. LDA与词汇链相结合的主题短语抽取方法[J]. 小型微型计算机系统, 2018, 39(11):107-113.
[10] BLEI D M, NG A Y, JORDAN M I, LAFFERTY J. Latent dirichlet allocation[J]. Journal of machine learning research, 2003(3):993-1022.
[11] JIANG Y, XU Y, SHAO L A personalized microblog search model considering user-author relationship[C]//Proceedings of international conference on data science in cyberspace. Changsha:IEEE, 2016:508-513.
[12] PONTE J M, CROFT W B. A language modeling approach to information retrieval[C]//Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. New York:ACM, 1998:275-281.
[13] 刘德喜, 付淇, 韦亚雄, 等. 基于多重增强图和主题分析的社交短文本检索方法[J]. 中文信息学报, 2018, 32(3):110-119.
[14] 唐晓波, 房小可. 一种面向微博的查询扩展方法[J]. 图书情报工作, 2014, 58(1):130-135.
[15] 熊才伟, 曹亚男. 基于发文内容的微博用户兴趣挖掘方法研究[J]. 计算机应用研究, 2018(6):63-71.
[16] CHOI J, CROFT W B. Temporal models for microblogs[C]//Proceedings of the 21st ACM international conference on Information and knowledge management. Maui:ACM, 2012:2491-2494.
[17] VAIDYA O S, KUMAR S. Analytic hierarchy process:An overview of applications[J]. European journal of operational research, 2006, 169(1):1-29.
[18] LIN J, ROEGIEST A, TAN L, et al. Overview of the TREC 2016 real-time summarization track[C]//Proceedings of the 25th text retrieval conference. Maryland:TREC, 2016.
[19] 徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28(1):13-16.
[20] CORMACK G V, PALMER C R, CLARKE L A. Efficient construction of large test collections[C]//Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Melbourne:ACM, 1998:282-289.
[21] WANG Y, HUANG H, FENG C. Query expansion based on feedback concept model for microblog retrieval[C]//Proceedings of the 26th international conference on World Wide Web. Perth:International world wide web conferences steering committee, 2017:559-568.
[22] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9):42-50.
文章导航

/