图书情报工作 ›› 2013, Vol. 57 ›› Issue (11): 120-124.DOI: 10.7536/j.issn.0252-3116.2013.11.022

• 知识组织 • 上一篇    下一篇

一种基于维基百科的中文短文本分类算法

赵辉, 刘怀亮   

  1. 西安电子科技大学经济与管理学院
  • 收稿日期:2013-04-03 修回日期:2013-05-19 出版日期:2013-06-05 发布日期:2013-06-05
  • 作者简介:赵辉,西安电子科技大学经济与管理学院硕士研究生,E-mail:15877344436@139.com;刘怀亮,西安电子科技大学经济与管理学院教授,硕士生导师。

Classification Algorithm of Chinese Short Texts Based on Wikipedia

Zhao Hui, Liu Huailiang   

  1. Department of Economic Management, Xidian University, Xi’an 710071
  • Received:2013-04-03 Revised:2013-05-19 Online:2013-06-05 Published:2013-06-05

摘要:

为解决短文本特征词少、概念信号弱的问题,结合维基百科进行特征扩展以辅助中文短文本分类。通过维基百科概念及链接等信息进行词语相关概念集合抽取、概念间相关度计算,利用消歧页结合短文本上下文信息解决一词多义问题,进而以词语间语义相关关系为基础进行特征扩展,以补充文本特征语义信息。最后,给出基于维基百科的中文短文本分类算法,并对其进行实验验证。结果表明,该算法能有效提高中文短文本分类效果。

关键词: 短文本分类, 维基百科, 词义消歧, 特征扩展

Abstract:

In order to resolve the problems of the lack key words of short texts and weak signal concepts, this paper proposes a method of feature extension based on Wikipedia to classify Chinese short texts. It extracts the set of related concepts and computes the concept relevancy with Wikipedia concepts and interlinkages, and avoids the polysemy problem by combining ambiguous page with the context extracted from short texts. Then it makes the feature extension based on the theory of semantic relevance relation between words, to supply semantic features information of texts. Finally, this paper put forwards Wikipedia-based classification algorithm of Chinese short texts and verifies it. The results show that the algorithm can get better classified effect of Chinese short texts.

Key words: short text classification, Wikipedia, word sense disambiguation, feature extension

中图分类号: