[Purpose/significance] In order to help the readers understand the contexts of the news event on micro-blog platform and improve readability and accuracy of micro-blog event summary, we propose a method for extracting the event summary organized by time axis based on event elements.[Method/process] Based on the characteristics of micro-blog text, we combine both advantages and disadvantages of the LDA and mutual information maximum entropy model (MaxEnt-MI) and extract event summary keywords, screening micro-blog with micro-blog communication value and theme relevance and generating event summary in the form of time-keywords-mircro-blog.[Result/conclusion] Comparing with the traditional TextRank method in the artificially labeled test set, we find the F value increased by 8% to 13%, and the internal tests show that the readability of the abstracts is significantly improved. The number of experimental texts and test sets and the richness of the event need to be further expanded, and more weighting strategies should be considered in order to improve the accuracy of the abstracts. The experimental results and the test results show that the proposed method is feasible and effective, which can meet the needs of the users for the hot event summary information, and improve the accuracy of the micro-blog abstract extraction.
Li Gang
,
Xu Wei
,
Wang Xinping
. Hot Event Summary on Micro-blog Generated by Multi Model Based on Event Elements[J]. Library and Information Service, 2018
, 62(1)
: 96
-105
.
DOI: 10.13266/j.issn.0252-3116.2018.01.013
[1] GOLDSTEIN J, KANTROWITZ M, MITTAL V, et al. Summarizing text documents:sentence selection and evaluation metrics[C]//Proceedings of the 22nd annual international ACM SIGIR conference research and development in information retrieval. New York:ACM, 1999:121-128.
[2] CANHASI E, KONONENKO I. Multi-document summarization via Archetypal Analysis of the content-graph joint model[J]. Knowledge and information systems, 2014, 41(3):821-842.
[3] CAI X, LI W. Mutually reinforced manifold-ranking based relevance propagation model for query-focused multi-document summarization[J]. IEEE transactions on audio speech & language processing, 2012, 20(5):1597-1607.
[4] 王红玲, 张明慧, 周国栋. 主题信息的中文多文档自动文摘系统[J]. 计算机工程与应用, 2012, 48(25):132-136.
[5] LUO Y, XIONG S. A combination scheme for distributed multi-document summarization[J]. Journal of intelligence, 2013,64(1):94-102.
[6] INOUYE D. Multiple post microblog summarization[J]. Reu research final report, 2010(1):34-40.
[7] SWAN R, ALLAN J. Automatic generation of overview timelines[C]//International ACM SIGIR conference on research and development in information retrieval. Athens:DBLP, 2000:49-56.
[8] LONG R, WANG H, CHEN Y, et al. Towards effective event detection, tracking and summarization on microblog data[C]//International conference on web-age information management. Berlin:Springer-verlag, 2011:652-663.
[9] WAN X. TimedTextRank:adding the temporal dimension to multi-document summarization[C]//SIGIR 2007:proceedings of the, international ACM SIGIR conference on research and development in information retrieval.Amsterdam:DBLP, 2007:867-868.
[10] SHARIFI B, HUTTON M A, KALITA J. Summarizing microblogs automatically[C]//Humanlanguage technologies:the 2010 conference of the North American chapter of the Association for Computational Linguistics.Stroudsburg:Association for Computational Linguistics, 2010:685-688.
[11] GAGLIO S, LO RE G, MORANA M. Real-time detection of twitter social events from the user's perspective[C]//IEEE international conference on communications. London:IEEE, 2015:1207-1212.
[12] WANG Y. Distributed Gibbs Sampling of Latent Topic Models:The Gritty Details[EB/OL].[2017-04-10].http://www.52ml.net/wp-content/uploads/2014/04/LDA-wangyi.pdf.
[13] CNNIC.2015年中国社交应用用户行为研究报告[R/OL].[2016-02-11]. http://www.cnnic.cn/hlwfzyj/hlwxzbg/sqbg/201604/P020160722551429454480.pdf,26.
[14] PORTEOUS I, NEWMAN D, IHLER A, et al. Fast collapsed Gibbs sampling for latent dirichlet allocation[C]//ACM SIGKDD international conference on knowledge discovery and data mining.Las Vegas:DBLP, 2008:569-577.
[15] ARORA R, RAVINDRAN B. Latent dirichlet allocation and singular value decomposition based multi-document summarization[C]//IEEE international conference on data mining. Pisa:DBLP, 2008:713-718.
[16] PETINOT Y, MCKEOWN K, THADANI K. Ahierarchical model of web summaries[C]//The meeting of the Association for Computational Linguistics:human language technologies, proceedings of the conference. Oregon:DBLP, 2012:670-675.
[17] 范小丽, 刘晓霞. 文本分类中互信息特征选择方法的研究[J]. 计算机工程与应用, 2010, 46(34):123-125.
[18] SHANNON C E, WEAVER W. Themathematic theory of communication[J]. Physics today, 1962:97-117.
[19] 张小平, 周雪忠, 黄厚宽,等. 一种改进的LDA主题模型[J]. 北京交通大学学报, 2010, 34(2):111-114.
[20] 张华平. NLPIR汉语分词系统[EB/OL].[2014-01-15]. http://ictclas.nlpir.org.
[21] ZHU H D, ZHAO X H, ZHONG Y. Feature selection method combined optimized document frequency with improved RBF network[C]//Advanced data mining and applications, international conference, Adma 2009.Beijing:DBLP, 2009:796-803.
[22] 何玲, 胡小强, 袁玖根. 麦克卢汉媒体观下微媒体的5W分析[J]. 传媒, 2013(12):55-57.
[23] 杨保军. 论新闻价值关系的构成[J]. 国际新闻界, 2002(2):55-60.
[24] 郝雨. 回归本义的"新闻价值"研究[J]. 上海大学学报社会科学版, 2006, 13(6):69-74.