收稿日期: 2013-06-07
修回日期: 2013-07-16
网络出版日期: 2013-08-05
基金资助
本文系国家自然科学基金“直觉模糊聚类理论及其应用”(项目编号:71071161)和江苏省自然科学基金“模糊语言模型研究”(项目编号:BK2012511)研究成果之一。
An Improved TF-IDF Method of Text Feature Selection Based on Category and Frequency
Received date: 2013-06-07
Revised date: 2013-07-16
Online published: 2013-08-05
刘海峰 , 于利军 , 刘守生 . 一种基于类别分布信息的文本特征选择模型[J]. 图书情报工作, 2013 , 57(15) : 137 -141 . DOI: 10.7536/j.issn.0252-3116.2013.15.022
TF-IDF is a commonly used text feature selection method. Based on the characteristics of the model selection ideas and using the feature within class distribution and the distribution between class information as the foundations, we propose a model of text feature selection based on the category distribution information through the introduction of weighting factor distribution within classes and between classes. The new model makes the TF part contains the within class text frequency information. At the same time, the IDF part contains the between class frequency information. The subsequent text classification experiments proved that the average recall rate, precision rate increased 6.4%, 7.8% respectively. At the same time, the F1 value increased about 7%. We demonstrate the effectiveness of the text feature selection model proposed in this paper.
[1] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859.
[2] Salton G, Buckley C.Term-weighting approaches in automatic retrieval[J].Information Processing & Management,1988,24(5):513-523.
[3] 周炎涛,唐剑波,王家琴.基于信息熵的改进TFIDF特征选择算法[J].计算机工程与应用,2007,43(35):156-158.
[4] 王美方,刘培玉,朱振方.基于TFIDF特征的选择方法[J].计算机工程与设计,2007,28(23):5795-5799.
[5] 林永民,吕震宇,赵爽,等.文本特征加权方法TF-IDF的分析与改进[J].计算机工程与设计,2008,29(11):2923-2925.
[6] 陈琦,伍朝辉,姚芳,等.基于TF-IDF的垃圾邮件过滤特征选择改进算法[J].计算机应用研究,2009,26(6):2165-2167.
[7] 熊忠阳,黎刚,陈小莉,等.文本分类中词语权重计算方法的改进与应用[J].计算机工程与应用。2008,44(5):187-189.
[8] 李学明,李海瑞,薛亮,等.基于信息增益与信息熵的TF-IDF算法[J].计算机工程,2012,38(8):37-40.
[9] 龚静,周经野.一种基于多重因子加权的文本特征项权值计算方法[J].计算技术与现代化,2007,26(1):81-86.
[10] Salton G, Clement T Y. On the construction of effective vocabularies for information retrieval[C]//Proceedings of 1973 Meeting on Programming Languages and Information Retrieval. New York:ACM Press, 1973.
[11] Salton G, Fox E A, Wu H. Extended boolean information retrieval[J].Communications of the ACM, 1983, 26(11): 1022-1036.
[12][EB/OL].[2013-05-18].http://www.nlp.org.cn/categories/default.php?cat_id=16.
/
〈 | 〉 |