图书情报工作 ›› 2013, Vol. 57 ›› Issue (01): 142-146.DOI: 10.7536/j.issn.0252-3116.2013.01.025

• 知识组织 • 上一篇    下一篇

VSM信息检索中的数据稀疏问题分析与规避策略

梁士金   

  1. 东莞理工学院城市学院图书信息中心
  • 收稿日期:2012-09-13 修回日期:2012-11-25 出版日期:2013-01-05 发布日期:2013-01-05
  • 作者简介:梁士金,东莞理工学院城市学院图书信息中心助理馆员,硕士,E-mail:huashi007@126.com。

Data Sparseness Analysis and its Avoidance Strategies in the VSM Information Retrieval

Liang Shijin   

  1. Library and Information Center, City College of Dongguan University of Technology, Dongguan 523106
  • Received:2012-09-13 Revised:2012-11-25 Online:2013-01-05 Published:2013-01-05

摘要:

以矩阵理论作为研究的切入点,将经典向量空间模型中常用的向量和集合以矩阵的形式加以重构,并认为基于向量内积法的相似性计算与相应矩阵的乘法运算等价。结合稀疏矩阵和数据稀疏的定义,分析VSM信息检索背景下数据稀疏产生的原因;同时,讨论三种情形下数据稀疏对相似性计算的共同影响——部分毫无意义的时间复杂度。最后,给出规避数据稀疏问题的三层策略:文本级策略、文本集级策略和矩阵级策略。

关键词: 向量空间模型, 信息检索, 数据稀疏, 规避策略

Abstract:

With matrix theory as a research starting point, this paper reconstructs the vector and the set involved in the vector space model in the form of matrix, and indicates that the similarity calculation based on the method of inner product of vectors is equivalent to the corresponding matrix multiplication. Combined with the definitions of sparse matrix and data sparseness, it analyzes the causes of data sparseness under the background of VSM information retrieval. At the same time, it discusses that the data sparseness brings common consequences-part of the meaningless time complexity to similarity calculation under three circumstances. Finally, this paper gives three layers strategies: text level strategy, text set level strategy and matrix level strategy which can avoid the data sparseness.

Key words: vector space model, information retrieval, data sparseness, avoidance strategy

中图分类号: