Text Classification Algorithm Based on Concept Clusters

  • Ma Jialin ,
  • Liu Jinling ,
  • Jin Chunxia
Expand
  • School of Computer Engineering, Huaiyin Institute of Technology, Huaian 223003

Received date: 2013-06-05

  Revised date: 2013-07-16

  Online published: 2013-08-05

Abstract

The traditional text classification algorithms has the problems of high-dimensional, rarefaction and ignoring the semantic correlation of keywords in the vector space model, and it easily leads to low efficiency and poor quality. Taking HowNet as knowledge repository, this paper develops the semantic concept vector model to represent text, merges synonyms and disambiguates polymerizes according to the concept of semantic and the context background. Then it proposes the text classification algorithm of TCABCC based on concept clusters by improving KNN, which uses concept clusters to present training samples of each category, makes similarity calculation based on text concept vector and category concept clusters. The experimental results show that the classifier constructed by this algorithm greatly improves the efficiency and performance than traditional KNN.

Cite this article

Ma Jialin , Liu Jinling , Jin Chunxia . Text Classification Algorithm Based on Concept Clusters[J]. Library and Information Service, 2013 , 57(15) : 132 -136,82 . DOI: 10.7536/j.issn.0252-3116.2013.15.021

References

[1] 庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应研究, 2001,23(9):23-26.
[2] 陈龙,范瑞霞,高琪.基于概念的文本表示模型[J].计算机工程与应用,2008,44(20):162-164.
[3] 刘金岭.基于《现代汉语语义分类词典》的文本聚类方法[J].情报杂志,2010,29(11):171-173.
[4] Delgado M,Martin-Bautista M J, Sanchez D, et al. Mining text data:Special features and patterns[C]//Proceedings of ESF Exploratory Workshop on Pattern Detection and Discovery.London:Springer-Verlag,2002:140-153.
[5] Novovicova J,Malik A. Information-theoretic feature selection algorithms for text classification[C]//Proceedings of IEEE International Joint Conference on Neural Networks. Washington, DC: IEEE Computer Society, 2005:3272-3277.
[6] 熊忠阳, 付玲玲, 张玉芳.文本分类中基于概念映射的二次特征降维方法[J].计算机工程与应用,2012,48(1):166-170.
[7] 刘海峰,张学仁,姚泽清,等.基于类别选择的改进KNN 文本分类[J].计算机科学,2009,36(11):213-216.
[8] 刘金岭,冯万利,张永军.基于词汇链的中文短信主题语句抽取方法[J].计算机工程与应用,2012,48(7):132-134.
[9] 刘金岭.基于降维的短信文本语义分类及主题提取[J].计算机工程与应用,2010,46(23):159-161.
[10] 白秋,金春霞,周海岩. 概念向量文本聚类算法[J].计算机工程与应用,2011,47(35):155-157.
[11] 刘金岭. 基于语义的高质量中文短信文本聚类算法[J].计算机工程,2009,35(10):201-205.
[12] Vries A D, Mamoulis N, Nes N, et al. Efficient KNN search vertically decomposed data[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. Madison:ACM Press,2002:322-333.
Outlines

/