图书情报工作 ›› 2013, Vol. 57 ›› Issue (15): 132-136,82.DOI: 10.7536/j.issn.0252-3116.2013.15.021

• 知识组织 • 上一篇    下一篇

基于概念簇的文本分类算法

马甲林, 刘金岭, 金春霞   

  1. 淮阴工学院计算机工程学院
  • 收稿日期:2013-06-05 修回日期:2013-07-16 出版日期:2013-08-05 发布日期:2013-08-05
  • 作者简介:马甲林,淮阴工学院计算机工程学院实验师,硕士,E-mail:majialin@126.com;刘金岭,淮阴工学院计算机工程学院教授,硕士生导师;金春霞,淮阴工学院计算机工程学院副教授。
  • 基金资助:
    本文系江苏省教育厅高校哲学社会科学项目“网络资源个性化信息服务模式研究”(项目编号:2012SJD870001)研究成果之一。

Text Classification Algorithm Based on Concept Clusters

Ma Jialin, Liu Jinling, Jin Chunxia   

  1. School of Computer Engineering, Huaiyin Institute of Technology, Huaian 223003
  • Received:2013-06-05 Revised:2013-07-16 Online:2013-08-05 Published:2013-08-05

摘要: 针对传统文本分类算法在向量空间模型表示下存在向量高维、稀疏以及忽略特征语义相关性等缺陷所导致的分类效率低和精度不高的问题,以知网(HowNet)为知识库,构建语义概念向量模型SCVM(Semantic Concept Vector Model)表示文本,根据概念语义及上下文背景对同义词进行归并,对多义词进行排歧,提出基于概念簇的文本分类算法TCABCC (Text Classification Algorithm Based on the Concept of Clusters),通过改进传统KNN,用概念簇表示各个类别训练样本,使相似度的计算基于文本概念向量和类别概念簇。实验结果表明,该算法构造的分类器在效率和性能上均比传统KNN有较大的提高。

关键词: 文本分类, 语义概念向量, 概念簇, KNN, 知网

Abstract: The traditional text classification algorithms has the problems of high-dimensional, rarefaction and ignoring the semantic correlation of keywords in the vector space model, and it easily leads to low efficiency and poor quality. Taking HowNet as knowledge repository, this paper develops the semantic concept vector model to represent text, merges synonyms and disambiguates polymerizes according to the concept of semantic and the context background. Then it proposes the text classification algorithm of TCABCC based on concept clusters by improving KNN, which uses concept clusters to present training samples of each category, makes similarity calculation based on text concept vector and category concept clusters. The experimental results show that the classifier constructed by this algorithm greatly improves the efficiency and performance than traditional KNN.

Key words: text classification, semantic concept vector, concept cluster, KNN, HowNet

中图分类号: