收稿日期: 2013-04-07
修回日期: 2013-05-17
网络出版日期: 2013-06-05
基金资助
本文系国家自然科学基金青年项目"微博虚假信息及早检测与有效控制关键技术研究"(项目编号:61202271)、国家自然科学基金项目"不平衡数据的学习算法及应用研究"(项目编号:61070061)研究成果之一。
A Literature Review on Pre-processing and Learning of Microtext
Received date: 2013-04-07
Revised date: 2013-05-17
Online published: 2013-06-05
王连喜 . 微博短文本预处理及学习研究综述[J]. 图书情报工作, 2013 , 57(11) : 125 -131 . DOI: 10.7536/j.issn.0252-3116.2013.11.023
As the features of microtext are sparse and highly redundant, the pre-processing and learning methods are the key problems of the data mining for microblog, and have a very important and wide application in many ways. The paper analyzes the characteristics of the microtext, and conducts an introduction and summarization to pre-processing and learning methods and their applications, including short text representation model, short text feature expanding and selection, classification and clustering for short text, hot events detection and automatic summarization, and so on. At last, this paper also proposes the limitations of the recent study, and points out the directions for future research.
[1] 王连喜,蒋盛益,庞观松,等.微博用户关系挖掘研究综述[J].情报杂志,2012, 31(12):91-97.
[2] 蒋盛益,麦智凯,庞观松,等.微博信息挖掘技术研究综述[J].图书情报工作, 2012, 56(17):136-142.
[3] 王晶,朱珂,汪斌强.基于信息数据分析的微博研究综述[J].计算机应用,2012,32(7):2027-2029.
[4] 张剑峰,夏云庆,姚建民.微博文本处理研究综述[J].中文信息学报,2012,26(4):21-27.
[5] Ellen J. All about microtext-A working definition and a survey of current microtext research within artificial intelligence and natural language processing[C]//Proceedings of the 3th International Conference on Agents and Artificial Intelligence. Rome: Springer, 2011: 329-336.
[6] Cheong M,Lee V. Dissecting Twitter: A review on current microblogging research and lessons from related fields[C]//From Sociology to Computing in Social Networks: Theory, Foundations and Applications.Wurzburg:Springer-Verlag, 2010:343-362.
[7] Jiang Shengyi, Pang Guansong,Wu Meiling. An improved K-nearest-neighbor algorithm for text categorization[J]. Expert Systems with Applications, 2012, 48(1):159-168.
[8] Kaplan A M, Haenlein M. The early bird catches the news: Nine things you should know about micro-blogging [J]. Business Horizons, 2010, 20(10): 1-9.
[9] Guy M, Earle P, Ostrum C,et al. Integration and dissemination of citizen reported and seismically derived earthquake information via social network technologies[C]//Proceedings of the 9th International Conference on Advances in Intelligent Data Analysis.Berlin:Springer-Verlag, 2010: 42-53.
[10] 庞观松,蒋盛益.文本自动分类技术研究综述[J].情报理论与实践,2012,35(2): 123-128.
[11] 张培晶,宋蕾.基于LDA的微博文本主题建模方法研究述评[J].图书情报工作,2012,56(24):120-126.
[12] 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现[J].模式识别与人工智能, 2012, 25(3): 382-387.
[13] 唐晓波,王洪艳.基于潜在语义分析的微博主题挖掘模型研究[J].图书情报工作,2012,56(24):114-119.
[14] Yan Tao, Wang Xiwei.Feature extension for short text[C] //Proceedings of the Third International Symposium on Computer Science and Computational Technology. Jiaozuo: ACM, 2010: 338-341.
[15] Banerjee S, Ramanathan K, Gupta A. Clustering short texts using Wikipedia[C]//Proceedings of 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2007:787-788.
[16] Yonshida M, Matsushima S,Ono S, et al. ITC-UT: Tweet categorization by query categorization for on-line reputation management[C] //Proceedings of the 3rd Web People Search Evaluation Workshop. Padua: ACM, 2010:94-105.
[17] Tang Jiliang, Wang Xufei, Gao Huiji, et al. Enriching short text representation in microblog for clustering[J]. Frontiers of Computer Science in China, 2012, 6(1):88-101.
[18] Hu Xia, Sun Nan, Zhang Chao,et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C] //Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 919-928.
[19] Hu Xia,Tang Lei, Liu Huan. Enhancing accessibility of microblogging messages using semantic knowledge [C] //Proceedings of International Conference on Information and Knowledge Management. Glasgow: ACM, 2011: 2465-2468.
[20] Liu Zitao,Yu Wenchao, Chen Wei,et al. Short text feature selection and classification for microblog mining[C] // Proceedings of International Conference on Computational Intelligence and Software Engineering. Wuhan: ACM,2010: 1-4.
[21] Sriram B, Fuhry D, Demirbas M,et al. Short text classification in Twitter to improve information filtering[C] //Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva: ACM, 2010:841-842.
[22] Cheong M, Ray S. A literature review of recent mMicroblogging developments [R/OL]. [2011-02-18].http://www.csse.monash.edu.au/publications/2011/tr-2011-263-full.pdf.
[23] Xu Tan,Oard D W. Wikipedia-based topic clustering for microblogs[C] //Proceedings of the 74th Annual Meeting of the Association for Information Science and Technology. New Orleans: 2011:1-10.
[24] 彭泽映,俞晓明,许洪.大规模短文本的不完全聚类[J].中文信息学报, 2011, 25(1): 54-59.
[25] Churchill A, Liodakis E, Ye S. Twitter relevance filtering via joint bayes classifiers from user clustering [EB/OL]. [2013-02-26]. http://cs229.stanford. edu/proj2010/Churchill Liodakis
YeTwitterRelevanceFilteringViaJointBayesClassifiersFromUserClu stering.pdf.
[26] Lin Piyuan, Lin Zijian, Kuang Binggia, et al. A short Chinese text incremental clustering algorithm based on weighted semantics and naive bayes[J]. Journal of Computational Information Systems, 2012, 8(10): 4257-4268.
[27] 杨亮,林原,林鸿飞.基于情感分布的微博热点事件发现[J].中文信息学报, 2012, 26(1):84-90.
[28] 郑斐然,苗夺谦,张志飞.一种中文微博新闻话题检测的方法[J]. 计算机科学, 2012, 39(1): 138-141.
[29] Lee C, Wu C, Chien T. BursT: A dynamic term weighting scheme for mining microblogging messages [C] //Proceedings of the 8th International Symposium on Neural Networks.Guilin, 2011:548-557.
[30] Lee C.Mining spatio-temporal information on microblogging streams using a density-based online clustering method[J]. Expert Systems with Applications, 2012, 39(10):9623-9641.
[31] Long Rui, Wang Haofeng, Chen Yuqiang, et al. Towards effective event detection, tracking and summarization on microblog data[C]//Proceedings of the 12th International Conference on Web-Age Information Management.Berlin:Springer-Verleg, 2011: 652-663.
[32] Weng J, Lee B. Event detection in Twitter[C]//Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. Barcelona:AAAI, 2011:401-408.
[33] Sakaki T,Okazaki M, Matsuo Y. Earthquake shakes Twitter users: Real-time event detection by social sensors[C] //Proceedings of the 19th International Conference on World Wide Web. Raleigh:ACM, 2010: 851-860.
[34] 张晨逸,孙建伶,丁轶群.基于MB-LDA模型的微博主题挖掘[J].计算机研究与发展, 2012, 48(10):1795-1802.
[35] 李劲,张华,吴浩雄.基于特定领域的中文微博热点话题挖掘系统BTopicMiner[J]. 计算机应用, 2012, 32(8): 2346-2349.
[36] Phuvipadawat S, Murata T. Breaking news detection and tracking in Twitter[C] //Proceedings of the 2010 International Conference on Web Intelligence and Intelligent Agent Technology. Toronto: 2010: 120-130.
[37] 童薇,陈威,孟小峰. EDM:高效的微博事件检测算法[J/OL].[2013-02-26].http://www.cnki.net/kcms/detail/ 11.5602.TP.20121019.1017.001.html.
[38] Gupta M, Zhao Peixiang, Han Jiawei. Evaluating event credibility on Twitter[OL]. [2012-02-07]. http://www.cs.uiuc.edu/~hanj/ pdf/ sdm12_mgupta.pdf.
[39] 谭翀,陈跃新.自动摘要方法综述[J].情报学报, 2008, 27(1):62-68.
[40] 文坤梅,徐帅,李瑞轩.微博及中文微博信息处理研究综述[J].中文信息学报, 2012, 26(6):27-37.
[41] Sharifi B,Hutton M, Kalita J. Summarizing microblogs automatically[C]//Proceedings of the 2010 Annual Conference of the North American Chapter of the ACL. Los Angeles:IEEE, 2010:685-688.
[42] Sharifi B, Hutton M,Kalita J. Experiments in microblog summarization[C] //Proceedings of IEEE Second International Conference on Social Computing. Washington, DC: IEEE Computer Society, 2010:49-56.
[43] Inouye D. Multiple post microblog summarization [R/OL]. [2013-03-16].http://www.cs.uccs. edu/~kalita/work/reu/REUFinalPapers2010/Inouye.pdf.
[44] Chakrabarti D, Punera K. Event summarization using tweets[C]//Proceedings of the Fifth International AAAI Conference on Weblogs and Social. Barcelona:AAAI, 2011: 1-8.
[45] Nichols Je, Mahmud J, Drews C. Summarizing sporting events using Twitter[C]//Proceedings of the 2012 ACM International Conference on Intelligent User Interfaces. Lisbon: ACM, 2012: 189-198.
/
〈 | 〉 |