收稿日期: 2014-07-10
修回日期: 2014-09-05
网络出版日期: 2014-10-30
基金资助
本文系国家自然科学基金面上项目“基于语言模型的通用实体检索建模及框架实现研究”(项目编号:71173164)和国家社会科学基金青年项目“基于情景分析的网络舆情事件应急管理动态调控机制研究”(项目编号:13CGL132)研究成果之一。
Automatic Identification of News Intent Based on Analyzing Query Features
Received date: 2014-07-10
Revised date: 2014-09-05
Online published: 2014-10-30
从Sogou查询日志中选取样本查询且进行人工标注,通过对标注后新闻查询的分析,提出能用于识别新闻意图的新特征,即查询表达式特征、查询随时间分布特征以及点击结果特征。根据这3个特征,利用决策树分类器实现查询中新闻意图的自动识别,结果发现:①新闻类查询的查询目标主要集中在特定主题信息以及娱乐类信息方面,其查询主题大多为娱乐、政治、体育与经济类信息;②相对非新闻查询,新闻查询具有更可能包含实体、随时间分布波动较大、点击结果之间相似度更高的特点;③本方法对查询中新闻意图的识别效果较好,其宏平均准确率、召回率、F值分别为 0.76、0.73、0、74。
张晓娟 , 陆伟 , 雷声伟 . 基于查询特征分析的新闻意图自动识别[J]. 图书情报工作, 2014 , 58(20) : 82 -90 . DOI: 10.13266/j.issn.0252-3116.2014.20.013
This paper selects sample queries from Sogou query log, and makes these queries labeled by humans. Based on the analysis of the labeled news queries, we propose three novel features for news intent prediction, including query expression, a query distribution over time and clicked results. Finally, we apply the decision tree method to perform the task of automatic identification of news queries. Finally, experimental results show that: (1) Goals of news query are supposed to obtain information for a particular topic or some entertainment information, and search topics of news queries tend to be entertainment, economy, politics and sports. (2)Compared with non-news queries, new queries are likely to have named entities, larger fluctuation in the query distribution over time, and higher degree of similarity among clicked results. (3) Encouraging results of news identification are achieved, and the precision, recall, F-score for the query classification are 0.76、0.73 and 0.74, respectively.
Key words: query intent; news queries; news intent; query classification
[1] Leibowitz J. “Creative destruction” or just “destruction”, how will journalism survive the Internet age?[EB/OL].[2014-09-02].http://ftc.gov/speeches/leibowitz/091201newsmedia.pdf.
[2] Diaz F. Integration of news content into Web results[C]//Proceedings of the Second ACM International Conference on Web Search and Data Mining.New York: ACM Press, 2009:182-191.
[3] Diaz F, Arguello J. Adaptation offline vertical selection predictions in the presence of user feedback[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press,2009:323-330.
[4] Konig A F, Gamon M, Wu Qiang. Click-through prediction for news queries[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press, 2009:347-354.
[5] A Louis, E Crestan, Y Billawala, et al. Use of query similarity for improving presentation of news verticals[C]//Proceedings of Very Large Data Search.New York:ACM Press,2011.
[6] Beitzel S M, Jensen E C, Chowdhury A, et al. Hourly analysis of a very large topically categorized Web query log[C]//Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: ACM Press,2004:321-328.
[7] Jansen B J, Pooch U. A review of Web searching studies and a framework for future research[J]. Journal of the American Society for Information Science and Technology,2001,52(3):235-246.
[8] Gan Qingqing, Attenberg J, Markowetz A, et al. Analysis of geographic queries in a search engine log[C]//Proceedings of the First International Workshop on Location and the Webpages. New York:ACM Press, 2008:49-56.
[9] Gonzalez-Caro C, Calderon-Benavides L, Baeza-Yates R.Web queries: The tip of the iceberg of the user's intent[C]//Proceedings of the 2011 the International Conference on Web Search and Web Data Mining. New York: ACM Press, 2011:282-291.
[10] Broder A. A taxonomy of Web search[J]. SIGIR Forum, 2002, 36(2): 3-10.
[11] 陆伟, 周红霞, 张晓娟. 查询意图研究综述[J].中国图书馆学报,2013,39(1):100-111.
[12] Bernard J, Jansen D, Amanda S. Determining the user intent of Web search engine queries[C]//Proceedings of the 16th International Conference on World Wide Web.New York:ACM Press, 2004: 1149-1150.
[13] 伍大勇, 赵世奇, 刘挺,等. 融合多类特征的Web查询意图识别[J]. 模式识别与人工智能, 2012,25(3):500-504.
[14] Yuan Xiaojie, Dou Zhicheng, Zhang Lu. Automatic user goals identification based on anchor text and click-through data[J]. Wuhan University Journal of Natural Sciences, 2008, 13(4): 495-502.
[15] Liu Yiqun, Zhang Min, Ru Liyun, et al. Automatic query type identification based on click through information[J]. Lecture Notes in Computer Science, 2006(4182):593-600.
[16] 陈肿, 刘晓兵, 徐谷子,等. 一种搜索引擎的查询意图发现的新方法[J]. 情报学报,2012,31(3): 242-249.
[17] Brenes D J, Gayo-Avello D. Automatic detection of navigational queries according to behavioural characteristics[EB/OL].[2014-09-01].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.1720.
[18] Brenes D J, Gayo-Avello D, Pérez-González K. Survey and evaluation of query intent detection methods[C]//Proceedings of the 2009 Workshop on Web Search Click Data.New York: ACM Press,2009:1-7.
[19] Wang Lee, Wang Chuang, Xie Xing, et al. Detecting dominant locations from search queries[C]//Proceedings of The 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press,2005:424-431.
[20] Dai Honghua, Nie Zaiqiang, Wang Lee. Detecting online commercial intention[C]//Proceedings of the 15th International Conference on World Wide Web.New York: ACM Press,2006:829-837.
[21] Kanhabua N, Nørvag K. Improving temporal language models for determining time of non-time stamped documents[J].Research and Advanced Technology for Digital Libraries Lecture Notes in Computer Science,2008(5173):358-370.
[22] Kulkarni A, Teevan J, Svore K M, et al.Understanding temporal query dynamics[C]//Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2011:167-176.
[23] Metzler D, Jones R, Peng Fuchun, et al. Improving search relevance for implicitly temporal queries[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press, 2009:700-701.
[24] McCreadie R, Macdonald C, Ounis I. Insights on the horizons of news search[EB/OL].[2014-09-02].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.3630.
[25] Louis A, Crestan E, Billawala Y,et al. Use of query similarity for improving presentation of news verticals[EB/OL].[2014-09-01]. http://pubzone.org/dblp/conf/vlds/LouisCBSDC11.
[26] McCreadie R M C, Macdonald C, Ounis I. Crowdsourcing a news query classification dataset[EB/OL].[2014-09-01].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.3236.
[27] Hassan A, Jones R, Diaz F. A case study of using geographic cues to predict query news intent[C]//Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.New York:ACM Press, 2009: 33-41.
[28] Sogou查询日志主页[EB/OL].[2014-09-01].http://www.sogou.com/labs/dl/q.html.
[29] Rose D E, Levinson D. Understanding user goals in Web search[C]//Proceedings of the 13th International Conference on World Wide Web. New York: ACM Press, 2004: 13-19.
[30] Bar-Ilan J, Zhu Zheng, Levene M. Topic-specific analysis of search queries[C]//Proceedings of the 2009 Workshop on Web Search Click Data.New York: ACM Press, 2009: 35-42.
[31] Vlachos M,Meek C, Vagena Z.Identifying similarities, periodicities and bursts for online search queries[C]//SIGMOD.Proceedings of 2004 Special Interest Group on Management of Data.New York: ACM Press,2004:131-142.
[32] Gaugazl J, Siehndel P, DemartiniG, et al. Predicting the future impact of news events[J].Lecture Notes in Computer Science, 2012,(7224):50-62.
[33] Quinlan R. C4.5: Programs for machine learning[M]. San Francisco:Morgan Kaufmann Publishers, 1993:23-26.
/
〈 |
|
〉 |