知识组织

针对爬虫的域名链接过滤算法

  • 文阳 ,
  • 陈文宇 ,
  • 袁野 ,
  • 朱建
展开
  • 1. 电子科技大学图书馆;
    2. 电子科技大学计算机学院
文阳,电子科技大学图书馆馆员,硕士;袁野,电子科技大学计算机学院硕士研究生;朱建,电子科技大学计算机学院硕士研究生。

收稿日期: 2014-07-21

  修回日期: 2014-09-01

  网络出版日期: 2014-10-30

Link Filtering Algorithm of Domain Name in View of the Crawler

  • Wen Yang ,
  • Chen Wenyu ,
  • Yuan Ye ,
  • Zhu Jian
Expand
  • 1. Library of University of Electronic Science & Technology of China, Chengdu 611731;
    2. School of Computer Science and Engineering, University of Electronic Science & Technology of China, Chengdu 611731

Received date: 2014-07-21

  Revised date: 2014-09-01

  Online published: 2014-10-30

摘要

认为传统的基于主题的链接过滤算法虽然在某一领域的主题爬虫中使用广泛,但该方法只关心抓取的网页与主题之间的相关性,忽略了网站自身链接的结构特点。提出基于域名的链接过滤算法,该方法对基于网页链接中域名的结构特点进行比较,同时以基于主题的链接过滤算法作为辅助,判断出无用的垃圾链接。与单一基于主题的链接过滤算法相比较,基于域名的链接过滤算法的判断方式更为全面,链接过滤效率更高,从而能有效地提高网络爬虫的抓取效率和情报检索的效率。最后,通过仿真实验证明该算法的有效性。

本文引用格式

文阳 , 陈文宇 , 袁野 , 朱建 . 针对爬虫的域名链接过滤算法[J]. 图书情报工作, 2014 , 58(20) : 125 -130 . DOI: 10.13266/j.issn.0252-3116.2014.20.019

Abstract

Traditional link filtering algorithm based on topic even though the topic in the field of a crawler is widely used, but this method only cares about fetching the correlation between subject and the website, and ignoring the website links to the structure characteristics of itself. The connection filtering algorithm is proposed based on domain name, and this method is based on the structure characteristics of the domain name in the web link. Link filtering algorithm will be based on the theme at the same time as the auxiliary, judge the useless garbage links. Compared with the single link filtering algorithm based on theme, link filtering algorithm based on domain name is a more comprehensive judgment way. Besides, link filter is more effective, which can effectively improve the efficiency of the web crawler capture, and improve the efficiency of information retrieval. Finally, through the simulation experiment proves the validity of the algorithm.

参考文献

[1] 张云秋,安文秀,冯佳.探索式信息搜索行为研究[J].图书情报工作,2012,56(14):67-72.

[2] A. Emtage, P. Deutsch. Archie: An electronic directory service for the Internet[C]//Proceedings of the Winter 2010 Usenix Conference.California:USENIX, 2010:93-110

[3] Alberti B, Anklesaria F, Lindner P, et al. The Internet Gopher protocol: A distributed document search and retrieval protocol[J].The Journal of Universal Computer Science,1991,24(2):235-246.

[4] Pant G, Srinivasan P. Learning to crawl: Comparing classification schemes[J]. ACM Transactions on Information Systems (TOIS), 2005, 23(4): 430-462.

[5] Knoblock C A, Arens Y. An architecture for information retrieval agents[C]//Working Notes of the AAAI Spring Symposium on Software Agents.New York:SIGIR, 2010:49-56.

[6] Abiteboul S, Preda M, Cobena G. Adaptive on-line page importance computation[C]//Proceedings of the 12th International Conference on World Wide Web.Budapest:Springer, 2012:280-290.

[7] Cutting D, Pedersen J. Optimization for dynamic inverted index maintenance[C]//Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:SIGIR, 2011:405-411.

[8] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 2009, 24(5): 513-523.

文章导航

/