A Compound Word Based Algorithm for Hot Event Detection and Description on the Web

  • Li Xia ,
  • Wang Lianxi ,
  • Lu Meixiu ,
  • Liu Hanfeng ,
  • Liu Junyan
Expand
  • 1 Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou 510006;
    2 School of Informatics, Guangdong University of Foreign Studies, Guangzhou 510006;
    3 Guangdong University of Foreign Studies Library, Guangzhou 510006

Received date: 2016-05-13

  Revised date: 2016-11-15

  Online published: 2016-12-05

Abstract

[Purpose/significance] Automatic detection of hot events on the Web (from news and microblogs) and extraction of descriptive words to describe them is important for detecting internet public opinion. [Method/process] Current methods to extract descriptive words mainly rely on association rules or combination of multiple n-grams, which often lead to noise words with imprecise meaning and potential meanig drift. In this paper, a compound word based feature extraction method is proposed and used to represent news texts. A vector space model is used to cluster and detect hot events on the Web. [Result/conclusion] The experimental result on Tencent Internet News shows that the method proposed in this paper has higer clustering precision and recall and can produce better descriptive words.

Cite this article

Li Xia , Wang Lianxi , Lu Meixiu , Liu Hanfeng , Liu Junyan . A Compound Word Based Algorithm for Hot Event Detection and Description on the Web[J]. Library and Information Service, 2016 , 60(23) : 128 -134 . DOI: 10.13266/j.issn.0252-3116.2016.23.016

References

[1] ALLAN J. Topic detection and tracking:event-based information organization[M].Norwell:Kluwer Academic Publishers, 2002:194-218.
[2] 洪宇,张宇,刘挺,等. 话题检测与跟踪的评测及研究综述[J]. 中文信息学报, 2007, 21(6):71-87.
[3] 李保利,俞士汶. 话题识别与跟踪研究[J]. 计算机工程与应用, 2003, 39(17):7-10.
[4] YANG Y, AULT T, PIERCE T,et al. Improving text categorization methods for event tracking[C]//Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval.New York:ACM,2000:65-72.
[5] 于满泉,骆卫华,许洪波,等. 话题识别与跟踪中的层次化话题识别技术研究[J]. 计算机研究与发展, 2006,43(3):489-495.
[6] 洪宇,张宇,范基礼,等.基于子话题分治匹配的新事件检测[J]. 计算机学报,2008,31(4):687-695.
[7] PAPKA R, ALLAN J.On-line new event detection using single pass clustering[R].Amherst:University of Massachusetts, Amherst,1998:37-45.
[8] 刘星星,何婷婷,龚海军,等. 网络热点事件发现系统的设计[J]. 中文信息学报,2008,22(6):80-85.
[9] 任晓东,张永奎,薛晓飞. 基于K-Modes聚类的自适应话题追踪技术[J]. 计算机工程, 2009, 35(9):222-224.
[10] 贺敏,王丽宏,杜攀,等.基于有意义串聚类的微博热点话题发现方法[J]. 通信学报,2013,34(S1):256-262.
[11] 曾依灵,许洪波. 网络热点信息发现研究[J]. 通信学报,2007,28(12):141-146.
[12] 李恒训,张华平,秦鹏,等. 基于主题词的网络热点话题发现[C]//第五届全国信息检索学术会议论文集.上海:中国中文信息学会,2009.
[13] 黄玉兰,龚才春,许洪波,等.基于局部性原理的有意义串提取方法[C]//第四届全国信息检索与内容安全学术会议论文集.北京:中国中文信息学会,2008.
[14] LAI Y S,CHUNGH S W. Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology[C]//ACM transactions on Asian language information processing. New York:ACM,2002:34-64.
[15] 郑魁,疏学明,袁宏永. 网络舆情热点信息自动发现方法[J]. 计算机工程,36(3):4-6.
[16] 张海军,李勇,闫琪琪. 一种基于海量语料的网络热点新词识别方法[J]. 计算机工程与应用, 2015, 51(5):208-213.
[17] 赵华,赵铁军,于浩,等. 基于查询向量的英语话题跟踪研究[J]. 计算机研究与发展, 2007, 44(8):1412-1417.
[18] 王馨,王煜,王亮.基于新词发现的网络新闻热点排名[J]. 图书情报工作,2015,59(6):68-74.
[19] [EB/OL].[2016-05-10].http://ictclas.nlpir.org/.

Outlines

/