[目的/意义] 为帮助读者从热点事件产生的海量微博报道中快速了解事件的来龙去脉,提高微博事件摘要的准确性和可读性,提出一种基于事件要素的多模型微博热点事件时间轴摘要提取方法。[方法/过程] 针对微博文本特征,结合主题模型(LDA)与互信息最大熵模型(MaRxEnt-MI)的特点提取事件摘要关键词,以微博传播价值和主题相关性为标准筛选微博,以时间-摘要关键词-摘要微博的形式生成时间轴摘要。[结果/结论] 利用人工标注的测试集,与传统的TextRank方法进行对比,F值提高8%-13%,内部测试表明摘要可读性提高明显。实验文本和测试集的数量及事件丰富度需要进一步扩展,应考虑更多的加权策略模型以提高摘要的准确性。实验结果及测试反馈表明,本文的方法能很好满足用户对热点事件摘要信息需求,提高微博摘要提取的准确率。
[Purpose/significance] In order to help the readers understand the contexts of the news event on micro-blog platform and improve readability and accuracy of micro-blog event summary, we propose a method for extracting the event summary organized by time axis based on event elements.[Method/process] Based on the characteristics of micro-blog text, we combine both advantages and disadvantages of the LDA and mutual information maximum entropy model (MaxEnt-MI) and extract event summary keywords, screening micro-blog with micro-blog communication value and theme relevance and generating event summary in the form of time-keywords-mircro-blog.[Result/conclusion] Comparing with the traditional TextRank method in the artificially labeled test set, we find the F value increased by 8% to 13%, and the internal tests show that the readability of the abstracts is significantly improved. The number of experimental texts and test sets and the richness of the event need to be further expanded, and more weighting strategies should be considered in order to improve the accuracy of the abstracts. The experimental results and the test results show that the proposed method is feasible and effective, which can meet the needs of the users for the hot event summary information, and improve the accuracy of the micro-blog abstract extraction.
[1] GOLDSTEIN J, KANTROWITZ M, MITTAL V, et al. Summarizing text documents:sentence selection and evaluation metrics[C]//Proceedings of the 22nd annual international ACM SIGIR conference research and development in information retrieval. New York:ACM, 1999:121-128.
[2] CANHASI E, KONONENKO I. Multi-document summarization via Archetypal Analysis of the content-graph joint model[J]. Knowledge and information systems, 2014, 41(3):821-842.
[3] CAI X, LI W. Mutually reinforced manifold-ranking based relevance propagation model for query-focused multi-document summarization[J]. IEEE transactions on audio speech & language processing, 2012, 20(5):1597-1607.
[4] 王红玲, 张明慧, 周国栋. 主题信息的中文多文档自动文摘系统[J]. 计算机工程与应用, 2012, 48(25):132-136.
[5] LUO Y, XIONG S. A combination scheme for distributed multi-document summarization[J]. Journal of intelligence, 2013,64(1):94-102.
[6] INOUYE D. Multiple post microblog summarization[J]. Reu research final report, 2010(1):34-40.
[7] SWAN R, ALLAN J. Automatic generation of overview timelines[C]//International ACM SIGIR conference on research and development in information retrieval. Athens:DBLP, 2000:49-56.
[8] LONG R, WANG H, CHEN Y, et al. Towards effective event detection, tracking and summarization on microblog data[C]//International conference on web-age information management. Berlin:Springer-verlag, 2011:652-663.
[9] WAN X. TimedTextRank:adding the temporal dimension to multi-document summarization[C]//SIGIR 2007:proceedings of the, international ACM SIGIR conference on research and development in information retrieval.Amsterdam:DBLP, 2007:867-868.
[10] SHARIFI B, HUTTON M A, KALITA J. Summarizing microblogs automatically[C]//Humanlanguage technologies:the 2010 conference of the North American chapter of the Association for Computational Linguistics.Stroudsburg:Association for Computational Linguistics, 2010:685-688.
[11] GAGLIO S, LO RE G, MORANA M. Real-time detection of twitter social events from the user's perspective[C]//IEEE international conference on communications. London:IEEE, 2015:1207-1212.
[12] WANG Y. Distributed Gibbs Sampling of Latent Topic Models:The Gritty Details[EB/OL].[2017-04-10].http://www.52ml.net/wp-content/uploads/2014/04/LDA-wangyi.pdf.
[13] CNNIC.2015年中国社交应用用户行为研究报告[R/OL].[2016-02-11]. http://www.cnnic.cn/hlwfzyj/hlwxzbg/sqbg/201604/P020160722551429454480.pdf,26.
[14] PORTEOUS I, NEWMAN D, IHLER A, et al. Fast collapsed Gibbs sampling for latent dirichlet allocation[C]//ACM SIGKDD international conference on knowledge discovery and data mining.Las Vegas:DBLP, 2008:569-577.
[15] ARORA R, RAVINDRAN B. Latent dirichlet allocation and singular value decomposition based multi-document summarization[C]//IEEE international conference on data mining. Pisa:DBLP, 2008:713-718.
[16] PETINOT Y, MCKEOWN K, THADANI K. Ahierarchical model of web summaries[C]//The meeting of the Association for Computational Linguistics:human language technologies, proceedings of the conference. Oregon:DBLP, 2012:670-675.
[17] 范小丽, 刘晓霞. 文本分类中互信息特征选择方法的研究[J]. 计算机工程与应用, 2010, 46(34):123-125.
[18] SHANNON C E, WEAVER W. Themathematic theory of communication[J]. Physics today, 1962:97-117.
[19] 张小平, 周雪忠, 黄厚宽,等. 一种改进的LDA主题模型[J]. 北京交通大学学报, 2010, 34(2):111-114.
[20] 张华平. NLPIR汉语分词系统[EB/OL].[2014-01-15]. http://ictclas.nlpir.org.
[21] ZHU H D, ZHAO X H, ZHONG Y. Feature selection method combined optimized document frequency with improved RBF network[C]//Advanced data mining and applications, international conference, Adma 2009.Beijing:DBLP, 2009:796-803.
[22] 何玲, 胡小强, 袁玖根. 麦克卢汉媒体观下微媒体的5W分析[J]. 传媒, 2013(12):55-57.
[23] 杨保军. 论新闻价值关系的构成[J]. 国际新闻界, 2002(2):55-60.
[24] 郝雨. 回归本义的"新闻价值"研究[J]. 上海大学学报社会科学版, 2006, 13(6):69-74.