知识组织

基于中文维基百科的领域概念相关性研究

  • 王娟 ,
  • 曹树金 ,
  • 姜灵敏 ,
  • 胡青
展开
  • 1. 中山大学资讯管理学院;
    2. 广东外语外贸大学思科信息学院;
    3. 大连海事大学信息科学技术学院
王娟, 广东外语外贸大学思科信息学院副教授, E-mail:misisipiwj@126.com;曹树金, 中山大学资讯管理学院院长, 博士生导师;姜灵敏, 广东外语外贸大学思科信息学院院长, 硕士生导师;胡青, 大连海事大学信息科学技术学院讲师.

收稿日期: 2014-10-10

  修回日期: 2014-11-20

  网络出版日期: 2014-12-05

基金资助

本文系国家社会科学基金重大项目"基于特定领域的网络资源知识组织与导航机制研究"(项目编号:12&ZD222)和教育部人文社会科学研究青年基金项目"面向轻博客热点话题情感倾向性分析的研究"(项目编号:12YJC870023)研究成果之一.

Research on Semantic Relatedness of Domain-specific Concepts Based on Chinese Wikipedia

  • Wang Juan ,
  • Cao Shujin ,
  • Jiang Lingmin ,
  • Hu Qing
Expand
  • 1. School of Information Management, Sun Yat-Sen University, Guangzhou 510275;
    2. Cisco School of Informatics, Guangdong University of Foreign Studies, Guangzhou 510420;
    3. Computer Science Fundamentals Lab of Information Science and Technology College, Dalian Maritime University, Dalian 116026

Received date: 2014-10-10

  Revised date: 2014-11-20

  Online published: 2014-12-05

摘要

以提高领域概念相关性判断的准确度为研究宗旨, 提出综合利用中文维基百科的分类体系结构和概念释义内容进行概念间语义相关度计算的方法.选取中文维基百科分类体系下的图书情报领域的概念为实验对象, 将基于分类信息和文本信息的加权算法与单独基于分类信息的语义距离算法和信息量算法, 以及基于文本信息的文本重叠算法进行对比分析.实验结果表明:加权算法能取得更好的效果, 可为实现面向领域的信息检索、领域本体构建等应用提供重要技术支持.

本文引用格式

王娟 , 曹树金 , 姜灵敏 , 胡青 . 基于中文维基百科的领域概念相关性研究[J]. 图书情报工作, 2014 , 58(23) : 136 -142 . DOI: 10.13266/j.issn.0252-3116.2014.23.021

Abstract

In order to improve the accuracy of computing the relatedness of the domain-specific concepts, this paper proposes a new semantic relatedness algorithm using Chinese Wikipedia category architecture and concept interpretation content. The concepts in library and information science in concept-hierarchy of Chinese Wikipedia are taken as experiment objects, and weighted algorithm based on category and text information are compared with other algorithms only based on Chinese Wikipedia category like Relwup and Relseco or on Chinese Wikipedia article like Relstr. The experimental results show that the weighted algorithm is better than the others, and provide important technical support for application such as domain-oriented information retrieval, construction of domain ontology and so on.

参考文献

[1] Jiang J J, Conrath D W. Semantic similarity based on corpus statistics and lexical taxonomy[C]//Proceedings of International Conference Research on Computational Linguistics. Taipei: Association for Computational Linguistics,1997:13-33.
[2] Church K, Hanks P. Word association norms, mutual information, and lexicography[J]. Computational Linguistics, 1990,16(1):22-29.
[3] Cilibrasi R L, Vitanyi P M B. The Google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3):370-383.
[4] Landauer T K, Foltz P W, Laham D. An introduction to latent semantic analysis[J]. Discourse Processes, 1998, 25(2/3): 259-284.
[5] Fellbaum C. WordNet: An electronic lexical database[M]. Cambridge: MIT Press, 1998:18-19.
[6] Jarmasz M, Szpakowicz S. Roget's thesaurus and semantic similarity[C]//Proceedings of RANLP. Borovets, Bulgaria:Association for Computational Linguistics, 2003:212-219.
[7] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002,7(2):59-76.
[8] 田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报(信息科学版), 2010,28(6):602-608.
[9] Strube M, Ponzetto S P. WikiRelate! Computing semantic relatedness using Wikipedia[C]//Proceedings of AAAI. Boston: American Association for Artificial Intelligence, 2006: 1419-1424.
[10] Gabrilovich E, Markovitch S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis[C]// Proceedings of IJCAI. Hyderabad, India:American Association for Artificial Intelligence, 2007:1606-1611.
[11] Zesch T, Gurevych I. Analysis of the Wikipedia category graph for NLP applications[C]//Proceedings of TextGraphs-2 Workshop NAACL-HLT. Rochester:Association for Computational Linguistics, 2007:1-8.
[12] Milne D, Witten I H. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links[C]//Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence. Chicago:American Association for Artificial Intelligence, 2008: 25-3.
[13] Halavais A, Lackaff D. An analysis of topical coverage of Wikipedia[J]. Journal of Computer-Mediated Communication, 2008,13(2): 429-440.
[14] 维基媒体基金会. 特殊页面: 统计信息查阅[EB/OL]. [2014-04-09]. http://zh.wikipedia.org/wiki/Wikipedia.
[15] 李赟. 基于中文维基百科的语义知识挖掘相关研究[D]. 北京:北京邮电大学, 2009.
[16] 汪祥. 基于中文维基百科的语义相关度计算的研究与实现[D]. 长沙:国防科学技术大学,2011.
[17] 涂新辉, 张红春, 周琨峰,等. 中文维基百科的结构化信息抽取及词语相关度计算方法[J]. 中文信息学报, 2012, 26(3):109-115.
[18] Ponzetto S P, Strube M. WikiTaxonomy: A large scale knowledge resource[C]//Proceedings of ECAI. Patras:European Coordinating Committee for AI, 2008:751-752.
[19] Rada R, Mili H, Bicknell E, et al. Development and application of a metric to semantic nets[J]. IEEE Transactions on Systems, Man and Cybermetics, 1989,19(1):17-30.
[20] Wu Zhibiao, Palmer M. Verb semantics and lexical selection[C]//Proceedings of ACL. Las Cruces:Association for Computational Linguistics, 1994:133-138.
[21] Resnik P. Using information content to evaluate semantic similarity[C]//Proceedings of the IJCAI. Montreal:American Association for Artificial Intelligence, 1995: 448-453.
[22] Seco N, Veale T, Hayes J. An intrinsic information content metric for semantic similarity in WordNet[C]//Proceedings of ECAI. Valencia:European Coordinating Committee for AI, 2004:1089-1090.
[23] Lesk M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone[C]//Proceedings of 5th Annual International Conference on Systems Documentation. Toronto:Association of Computing Machinery, 1986:24-26.
[24] Banerjee S, Pedersen T. Extended gloss overlap as a measure of semantic relatedness[C]//Proceedings of IJCAI. Acapulco:American Association for Artificial Intelligence, 2003:805-810.
[25] 维基百科.分类:页面分类[EB/OL]. [2014-04-09]. http://zh.wikipedia.org/wiki/Category:%E9%A0%81%E9%9D%A2%E5%88%86%E9%A1%9E.
[26] 张华平. ICTCLAS汉语分词系统[EB/OL]. [2014-04-09]. http://ictclas.nlpir.org.
[27] Budanitsky A, Hirst G. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures[C]//Proeeding of NAACL Workshop on WordNet and Other Lexical, Pittsburgh:Association for Computational Linguistics, 2001:29-34.
[28] Spearman C. "General Intelligence" objectively determined and measured[J]. The American Journal of Psychology, 1904,15(2):201-293.

文章导航

/