知识组织

新闻文档实体重要性排序研究

  • 陆娜 ,
  • 周鹏程 ,
  • 武川
展开
  • 1. 海南师范大学信息科学技术学院 海口 571158;
    2. 武汉大学信息管理学院 武汉 430072
陆娜(ORCID:0000-0002-5262-5286),副教授,硕士,E-mail:40303405@qq.com;周鹏程(ORCID:0000-0002-5954-6863),硕士研究生;武川(ORCID:0000-0002-8784-0808),博士研究生。

收稿日期: 2018-01-02

  修回日期: 2018-03-02

  网络出版日期: 2018-06-05

基金资助

本文系国家自然科学基金面上项目"基于语言模型的通用实体检索建模及框架实现研究"(项目编号:71173164)和国家自然科学地区科学基金项目"基于需求社群的协商式旅游需求自动聚合方法研究"(项目编号:71762010)研究成果之一。

Importance Based Entity Ranking for News Documents

  • Lu Na ,
  • Zhou Pengcheng ,
  • Wu Chuan
Expand
  • 1. School of Information Science and Technology, Hainan Normal University, Haikou 571158;
    2. School of Information Management, Wuhan University, Wuhan 430072

Received date: 2018-01-02

  Revised date: 2018-03-02

  Online published: 2018-06-05

摘要

[目的/意义]现有新闻文档实体排序研究大多以文档或实体为中心,如文本分类、实体链接等,关注实体在文本中的重要性的研究较少,本研究探讨基于重要性的新闻文档实体排序。[方法/过程]给定一篇文档,判断文档中实体相对文档而言的重要性,并基于此对实体进行排序。在搜狗全网新闻数据集上进行实验,并利用NDCG和逆序对比率两个指标对实体排序结果进行评价。[结果/结论]实验结果表明,基于实体频率、TF*IDF、信息熵、TextRank等的方法以及集成方法都达到了较好的效果,基于聚集系数的方法效果一般。其中基于TF*IDF的方法NDCG值为95.86%,是该指标下的最好结果;基于集成方法的逆序对比率值为84.46%,是该指标下的最好结果。

本文引用格式

陆娜 , 周鹏程 , 武川 . 新闻文档实体重要性排序研究[J]. 图书情报工作, 2018 , 62(11) : 97 -102 . DOI: 10.13266/j.issn.0252-3116.2018.11.011

Abstract

[Purpose/significance] We propose an importance based method for entity ranking. Entities in a particular document show different importance. Many researches focus on documents or entities, such as text categorization and entity linking, while few research pay attention to the importance of entities in documents. This research has significant theoretical and practical value. [Method/process] Given a document which consists of words and entities, our method computes the relative importance of entities in the document, and then ranks these entities based on their importance with respect to the document. We perform experiment on the Sogou News dataset, and use evaluation metrics such as NDCG and inversed pair rate to evaluate the results. [Result/conclusion] Experimental results show that methods based on entity frequency, TF*IDF, distribution entropy and TextRank achieve better performance, while method based on cluster coefficient does not work well. In terms of NDCG, TF*IDF method reaches 95.86%, which is the best result and in terms of the inverse rate, the ensemble method reaches 84.46%, which is the best result.

参考文献

[1] 张晓艳,王挺,陈火旺. 命名实体识别研究[J]. 计算机科学, 2005, 32(4):44-48.
[2] 陆伟,武川. 实体链接研究综述[J]. 情报学报, 2015(1):105-112.
[3] 车万翔,刘挺,李生. 实体关系自动抽取[J]. 中文信息学报, 2005, 19(2):1-6.
[4] LIU M, LIU Y, XIANG L, et al. Extracting key entities and significant events from online daily news[C]//LI T. Proceedings for the 9th international conference on intelligent data engineering and automated learning. Berlin:Springer-Verlag, 2008:201-209.
[5] TRANI S, LUCCHESE C, PEREGO R, et al. SEL:a unified algorithm for salient entity linking and saliency detection[C]//SABLATNIG R, HASSAN T. Proceedings for the 2016 ACM symposium on document engineering. New York:ACM, 2016:85-94.
[6] 郭艳卿,赵锐,孔祥维,等. 基于事件要素加权的新闻摘要提取方法[J]. 计算机科学, 2016(1):237-241.
[7] 吴玲达,雷震,老松杨,等. 基于局部话题句群的事件相关多文档摘要研究[J]. 计算机仿真, 2006, 23(11):263-267.
[8] KIRITOSHI K, MA Q. Named entity oriented related news ranking[M]. Berlin:Springer International Publishing, 2014:82-96.
[9] LIU M, LIU Y, XIANG L, et al. Single Chinese news article summarization based on ranking propagation[C]//2008 International symposium on knowledge acquisition and modeling. Piscataway:IEEE, 2008:779-783.
[10] BEEFERMAN D, BERGER A, LAFFERTY J. A model of lexical attraction and repulsion[C]//COHEN P R, WAHLSTER W. Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 1997:373-380.
[11] NIESLER T.R., WOODLAND P.C. Modelling word-pair relations in a category-based language model[C]//1997 IEEE international conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 1997, 2:795-798.
[12] 搜狐实验室.全网新闻数据[EB/OL].[2017-03-16]. http://download.labs.sogou.com/dl/.
[13] PANTEL P, FUXMAN A. Jigs and lures:associating web queries with structured entities[C]//LIN DK. Proceedings of the 49th annual meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2011:83-92.
[14] LIN T, ETZIONI O. Entity linking at web scale[C]//Proceedings of the joint workshop on automatic knowledge base construction and Web-scale knowledge extraction. Stroudsburg:Association for Computational Linguistics, 2012:84-88.
[15] WELTY C, MURDOCK J W, KALYANPUR A, et al. A comparison of hard filters and soft evidence for answer typing in Watson[C]//Proceedings of the 11th international conference on the Semantic Web-volume part Ⅱ. Berlin:Springer-Verlag, 2012:243-256.
[16] ZHANG H P, YU H K, XIONG D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the second SIGHAN workshop on Chinese language processing-volume 17. Stroudsburg:Association for Computational Linguistics, 2003:184-187.
[17] SHANNON C E, WEAVER W, WIENER N. The mathematical theory of communication[J]. Philosophical Review,1949, 27(4):623-656.
[18] CROFT W B. Search engines:information retrieval in practice[M]. METZLER D, STROHMAN T.北京:机械工业出版社, 2009:1254-1271. 作者贡献说明:陆娜:研究方案设计, 具体实验,论文起草及修订; 周鹏程:研究方案设计和修订,论文修订; 武川:协助方案设计,参与论文修订。
文章导航

/