Data Sparseness Analysis and its Avoidance Strategies in the VSM Information Retrieval

Liang Shijin

doi:10.7536/j.issn.0252-3116.2013.01.025

Library and Information Service >

2013 , Vol. 57 >Issue 01: 142 - 146

DOI: https://doi.org/10.7536/j.issn.0252-3116.2013.01.025

Data Sparseness Analysis and its Avoidance Strategies in the VSM Information Retrieval

Liang Shijin

Expand

Library and Information Center, City College of Dongguan University of Technology, Dongguan 523106

Received date: 2012-09-13

Revised date: 2012-11-25

Online published: 2013-01-05

Fold

Abstract

With matrix theory as a research starting point, this paper reconstructs the vector and the set involved in the vector space model in the form of matrix, and indicates that the similarity calculation based on the method of inner product of vectors is equivalent to the corresponding matrix multiplication. Combined with the definitions of sparse matrix and data sparseness, it analyzes the causes of data sparseness under the background of VSM information retrieval. At the same time, it discusses that the data sparseness brings common consequences-part of the meaningless time complexity to similarity calculation under three circumstances. Finally, this paper gives three layers strategies: text level strategy, text set level strategy and matrix level strategy which can avoid the data sparseness.

Key words： vector space model; information retrieval; data sparseness; avoidance strategy

Cite this article

Liang Shijin . Data Sparseness Analysis and its Avoidance Strategies in the VSM Information Retrieval[J]. Library and Information Service, 2013 , 57(01) : 142 -146 . DOI: 10.7536/j.issn.0252-3116.2013.01.025

References

[1] Salton G,Yang C S.On the specification of term values in automatic indexing[J].Journal of Documentation,1973,29(4):351- 372.

[2] Salton G,Wong A,Yang C S.A vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620.

[3] 邹涛,王继成,杨文清,等.文本信息检索技术[J].计算机科学,1999(9):72-75.

[4] Tai Xiaoying,Sasaki M,Tanaka Y,et al.Improvement of vector space information retrieval model based on supervised learning[C]//Proceedings of the 5th International Workshop Information Retrieval with Asian Languages.New York:ACM,2000:69-74.

[5] Isbell C L,Viola P.Restructuring sparse high dimensional data for effective retrieval[C]//Advances in Neural Information Processing Systems 11.San Mateo:Kaufmann,1999:480-486.

[6] 宋玲.语义相似度计算及其应用研究[D].济南:山东大学,2009.

[7] Frakes W B, Baeza-Yates R.Information retrieval:Data structures and algorithms[M].Englewood:Prentice-Hall,1992:420- 441.

[8] 严蔚敏,吴伟民.数据结构(C语言版)[M].北京:清华大学出版社,2007:96.

[9] 刘志为,何丕廉,孙越恒,等.N层向量空间模型在Web信息检索中的应用[J].微型机与应用,2004(12):60-62.

[10] 刘海峰,王元元,王倩.基于分类的VSM模式下文本检索研究[J].情报科学,2006,24(11):1700-1703.

[11] Sun Yueheng,He Pilian,Chen Zhigang.An improved team weighting scheme for vector space model[C]//Proceedings of the Third International Conference on Machine Learning and Cybernetics.Piscataway:IEEE,2004:1692-1695.

[12] Kang B Y,Lee S J.Document indexing: A concept-based approach to term weight estimation[J].Information Processing and Management,2005,41(5):1065-1080.

[13] 唐明伟,卞艺杰,陶飞飞.基于领域本体的语义向量空间模型[J].情报学报,2011,30(9):951-955.

[14] 庞弘燊,徐文贤.近年来国外信息检索的相关性研究进展[J].中国图书馆学报,2009,35(4):88-94.

[15] 谷波,李济洪,刘开瑛.基于COSA 算法的中文文本聚类[J].中文信息学报,2007,21(6):65-70.

[16] 原媛,彭建华,张汝云.基于向量空间的信息检索模型的改进[J].计算机工程与设计,2008,29(23):6012-6015.

[17] 张爱文,樊红莲.半离散矩阵分解改进算法在网页信息检索中的应用研究[J].黑龙江工程学院学报(自然科学版),2007,21(2):55-57.

[18] Kolda T G,O’Leary D P.A semi-discrete matrix decomposition for latent semantic indexing in information retrieval[J].ACM Transactions on Information Systems,1998,16(4):322-346.

[19] 张磊,冯晓森,项学智.基于非负矩阵分解的中文文本主题分类[J].计算机工程,2009,35(13):26-27,54.

[20] 宁健,林鸿飞.基于改进潜在语义分析的跨语言检索[J].中文信息学报2010,24(3):105-111.

[21] 王晓斌,温春,石昭祥.基于独立分量分析的隐蔽Web领域聚类[J].计算机工程,2009,37(7):175-176,179.

[22] Hofmann T.Probabilistic latent semantic indexing[C]//Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,1999:50-57.

[23] 李胜,胡和平.一种基于PLSA的高效检索方法[J].华中科技大学学报(自然科学版),2010,38(11):48-50,75.

[24] 王嵩,代逸生,李保珍.基于PLSA的大众标注资源主题挖掘[J].现代图书情报技术,2010(3):47-51.

[25] Liu Tao,Chen Zheng,Zhang BenYu,et al.Improving text classification using local latent semantic indexing[C]//Proceedings of the 4th IEEE International Conference on Data Mining.Los Alamitos:IEEE Computer Society,2004:162-169.

[26] 张秋余,刘洋.使用基于SVM的局部潜在语义索引进行文本分类[J].计算机应用,2007,27(6):1382-1384.

[27] Golub G H,Van Loan C F.矩阵计算[M].3版.袁亚湘,译.北京:人民邮电出版社,2011:59-61.

[28] 高仕龙.基于奇异值分解的英文文本检索算法[J].计算机工程,2011,37(1):78-80.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References