Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm

  • Lu Yonghe ,
  • Li Yanfeng
Expand
  • School of Information Management, Sun Yat-sen University, Guangzhou 510006

Received date: 2012-10-12

  Revised date: 2012-12-24

  Online published: 2013-02-05

Abstract

Based on the importance of the feature and the ability of category distinguishing, this paper analyzes the disadvantages of traditional TF-[KG*4]IDF and its related improved algorithm, studies how to calculate feature weighting in text categorization, and develops a new function TW to correct feature's weight. Secondly, with the comparative experiments on term's CHI and term's TW validate, it reveals that TW can increase the weight of special features in a class and decrease the weight of common but unimportant features. Finally, this paper develops a new feature weighting algorithm combining TW with TF-IDF, and compares it with other methods by the classification experiments on Chinese classification corpus, in order to verify the validity of the new algorithm.

Cite this article

Lu Yonghe , Li Yanfeng . Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm[J]. Library and Information Service, 2013 , 57(03) : 90 -95 . DOI: 10.7536/j.issn.0252-3116.2013.03.017

References

[1] 台德艺, 王俊. 文本分类特征权重改进算法[J]. 计算机工程, 2010, 36(9):197-199.

[2] 施聪莺, 徐朝军, 杨晓江. TF-IDF算法研究综述[J]. 计算机应用, 2009, 29(S1):167-170.

[3] How B C, Narayanan K. An empirical study of feature selection for text categorization based on term weightage//Proceedings of the 2004 IEEE /W IC /ACM International Conference on Web Intelligence. Washington, DC:IEEE Computer Society, 2004:599-602.

[4] 沈志斌, 白清源. 文本分类中特征权重算法的改进[J]. 南京师范大学学报(工程技术版), 2008, 8(4):95-98.

[5] 张瑜, 张德贤. 一种改进的特征权重算法[J]. 计算机工程, 2011, 37(5):210-212.

[6] 苏丹, 周明全, 王学松, 等. 一种基于最少出现文档频的文本特征提取方法[J]. 计算机工程与应用, 2012, 48(10):164-166.

[7] Deng Zhihong, Tang Shiwei, Yang Dongqing, et al. A linear text classification algorithm based on category relevance factors / / Proceedings of ICADL-02, 5th International Conference on Asian Digital Libraries. New York: ACM Press,2002:88-98.

[8] 赵小华, 马建芬. 文本分类算法中词语权重计算方法的改进[J]. 电脑知识与技术, 2009, 5(36):10626-10628.

[9] 李原. 中文文本分类中分词和特征选择方法研究. 长春:吉林大学, 2011.

[10] 张爱华, 靖红芳, 王斌, 等. 文本分类中特征权重因子的作用研究[J]. 中文信息学报, 2010, 24(3):97-104.

[11] 张帆, 张俊丽. 统计频率算法在文本信息过滤系统中的应用[J]. 图书情报工作, 2009, 53(13):116-119.

[12] Deng Zhihong, Tang Shiwei, Yang Dongqing, et al. A comparative study on feature weight in text categorization..//http://www.informatik.uni-trier.de/~ley/db/conf/apweb/apweb2004.html.

[13] 柳培林. 基于向量空间模型的中文文本分类技术研究. 大庆:大庆石油学院, 2006.

Outlines

/