图书情报工作 ›› 2013, Vol. 57 ›› Issue (03): 90-95.DOI: 10.7536/j.issn.0252-3116.2013.03.017

• 知识组织 • 上一篇    下一篇

改进TF-IDF算法的文本特征项权值计算方法

路永和, 李焰锋   

  1. 中山大学资讯管理学院
  • 收稿日期:2012-10-12 修回日期:2012-12-24 出版日期:2013-02-05 发布日期:2013-02-05
  • 作者简介:路永和,中山大学资讯管理学院副教授,E-mail:zsuluyonghe@163.com;李焰锋,中山大学资讯管理学院硕士研究生。
  • 基金资助:

    本文系国家高技术研究发展计划(863计划)资助项目"农产品全供应链多源信息感知技术与产品开发"(项目编号:2012AA101701)和广东省哲学社会科学十二五规划项目"我国农民信息需求特征及其获取渠道实证研究"(项目编号:GD11CTS04)研究成果之一。

Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm

Lu Yonghe, Li Yanfeng   

  1. School of Information Management, Sun Yat-sen University, Guangzhou 510006
  • Received:2012-10-12 Revised:2012-12-24 Online:2013-02-05 Published:2013-02-05

摘要:

首先,从特征项重要性和类别区分能力的角度出发,通过分析传统的权重函数TF-IDF(term frequency-inverse document frequency)及其相关改进算法,研究文本分类中向量化时的特征权重计算,构建权重修正函数TW。其次,通过对特征词的卡方分布和TW作对比实验,验证TW能提高类别中专有词汇的权值,降低常见但对分类不重要的特征的权值。最后,将TW与TF-IDF结合作为新的特征权重算法,通过在中文分类语料库上的实际分类实验,与其他权重算法比较,验证此种算法的有效性。

关键词: 文本分类, TF-IDF, 特征权重, 类别区分

Abstract:

Based on the importance of the feature and the ability of category distinguishing, this paper analyzes the disadvantages of traditional TF-[KG*4]IDF and its related improved algorithm, studies how to calculate feature weighting in text categorization, and develops a new function TW to correct feature's weight. Secondly, with the comparative experiments on term's CHI and term's TW validate, it reveals that TW can increase the weight of special features in a class and decrease the weight of common but unimportant features. Finally, this paper develops a new feature weighting algorithm combining TW with TF-IDF, and compares it with other methods by the classification experiments on Chinese classification corpus, in order to verify the validity of the new algorithm.

Key words: text categorization, term frequency and inverse documentation frequency(TF-IDF), feature weighting category distinguishing

中图分类号: