[目的/意义]现有新闻文档实体排序研究大多以文档或实体为中心,如文本分类、实体链接等,关注实体在文本中的重要性的研究较少,本研究探讨基于重要性的新闻文档实体排序。[方法/过程]给定一篇文档,判断文档中实体相对文档而言的重要性,并基于此对实体进行排序。在搜狗全网新闻数据集上进行实验,并利用NDCG和逆序对比率两个指标对实体排序结果进行评价。[结果/结论]实验结果表明,基于实体频率、TF*IDF、信息熵、TextRank等的方法以及集成方法都达到了较好的效果,基于聚集系数的方法效果一般。其中基于TF*IDF的方法NDCG值为95.86%,是该指标下的最好结果;基于集成方法的逆序对比率值为84.46%,是该指标下的最好结果。
[Purpose/significance] We propose an importance based method for entity ranking. Entities in a particular document show different importance. Many researches focus on documents or entities, such as text categorization and entity linking, while few research pay attention to the importance of entities in documents. This research has significant theoretical and practical value. [Method/process] Given a document which consists of words and entities, our method computes the relative importance of entities in the document, and then ranks these entities based on their importance with respect to the document. We perform experiment on the Sogou News dataset, and use evaluation metrics such as NDCG and inversed pair rate to evaluate the results. [Result/conclusion] Experimental results show that methods based on entity frequency, TF*IDF, distribution entropy and TextRank achieve better performance, while method based on cluster coefficient does not work well. In terms of NDCG, TF*IDF method reaches 95.86%, which is the best result and in terms of the inverse rate, the ensemble method reaches 84.46%, which is the best result.
[1] 张晓艳,王挺,陈火旺. 命名实体识别研究[J]. 计算机科学, 2005, 32(4):44-48.
[2] 陆伟,武川. 实体链接研究综述[J]. 情报学报, 2015(1):105-112.
[3] 车万翔,刘挺,李生. 实体关系自动抽取[J]. 中文信息学报, 2005, 19(2):1-6.
[4] LIU M, LIU Y, XIANG L, et al. Extracting key entities and significant events from online daily news[C]//LI T. Proceedings for the 9th international conference on intelligent data engineering and automated learning. Berlin:Springer-Verlag, 2008:201-209.
[5] TRANI S, LUCCHESE C, PEREGO R, et al. SEL:a unified algorithm for salient entity linking and saliency detection[C]//SABLATNIG R, HASSAN T. Proceedings for the 2016 ACM symposium on document engineering. New York:ACM, 2016:85-94.
[6] 郭艳卿,赵锐,孔祥维,等. 基于事件要素加权的新闻摘要提取方法[J]. 计算机科学, 2016(1):237-241.
[7] 吴玲达,雷震,老松杨,等. 基于局部话题句群的事件相关多文档摘要研究[J]. 计算机仿真, 2006, 23(11):263-267.
[8] KIRITOSHI K, MA Q. Named entity oriented related news ranking[M]. Berlin:Springer International Publishing, 2014:82-96.
[9] LIU M, LIU Y, XIANG L, et al. Single Chinese news article summarization based on ranking propagation[C]//2008 International symposium on knowledge acquisition and modeling. Piscataway:IEEE, 2008:779-783.
[10] BEEFERMAN D, BERGER A, LAFFERTY J. A model of lexical attraction and repulsion[C]//COHEN P R, WAHLSTER W. Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 1997:373-380.
[11] NIESLER T.R., WOODLAND P.C. Modelling word-pair relations in a category-based language model[C]//1997 IEEE international conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 1997, 2:795-798.
[12] 搜狐实验室.全网新闻数据[EB/OL].[2017-03-16]. http://download.labs.sogou.com/dl/.
[13] PANTEL P, FUXMAN A. Jigs and lures:associating web queries with structured entities[C]//LIN DK. Proceedings of the 49th annual meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2011:83-92.
[14] LIN T, ETZIONI O. Entity linking at web scale[C]//Proceedings of the joint workshop on automatic knowledge base construction and Web-scale knowledge extraction. Stroudsburg:Association for Computational Linguistics, 2012:84-88.
[15] WELTY C, MURDOCK J W, KALYANPUR A, et al. A comparison of hard filters and soft evidence for answer typing in Watson[C]//Proceedings of the 11th international conference on the Semantic Web-volume part Ⅱ. Berlin:Springer-Verlag, 2012:243-256.
[16] ZHANG H P, YU H K, XIONG D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the second SIGHAN workshop on Chinese language processing-volume 17. Stroudsburg:Association for Computational Linguistics, 2003:184-187.
[17] SHANNON C E, WEAVER W, WIENER N. The mathematical theory of communication[J]. Philosophical Review,1949, 27(4):623-656.
[18] CROFT W B. Search engines:information retrieval in practice[M]. METZLER D, STROHMAN T.北京:机械工业出版社, 2009:1254-1271. 作者贡献说明:陆娜:研究方案设计, 具体实验,论文起草及修订; 周鹏程:研究方案设计和修订,论文修订; 武川:协助方案设计,参与论文修订。