收稿日期: 2014-08-04
修回日期: 2014-09-04
网络出版日期: 2014-10-30
基金资助
本文系上海市科技发展基金软科学研究项目“大数据环境下基于领域本体的情报处理分析方法研究——以钢铁行业为例”(项目编号:14692107100)研究成果之一。
Design and Implementation of the Topic Information Crawler for Intelligence Acquisition
Received date: 2014-08-04
Revised date: 2014-09-04
Online published: 2014-10-30
谷俊 , 翁佳 , 许鑫 . 面向情报获取的主题采集工具设计与实现[J]. 图书情报工作, 2014 , 58(20) : 91 -99 . DOI: 10.13266/j.issn.0252-3116.2014.20.014
Topic information collection based on the Internet is an important means of acquiring intelligence. A topic information crawler is designed and realized to deal with the explosive growth of Internet information resources. The crawler comprises stages of acquisition preparation, URL analysis and extraction, template learning, and text extraction. A URL filtering method based on link types is used in the URL analysis and extraction stage to filter the URLs of text-containing Web pages. A node comparison method based on the DOM tree is used in the template learning and text extraction stages to construct templates and extract text. Test results show that the topic information crawler has a high accuracy in gathering information, and thus can meet the current need for information acquisition.
Key words: Web crawler; topic information acquisition; link filtering; DOM tree
[1] 李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统[M].北京:科学出版社,2005.
[2] 许鑫, 黄仲清, 邓三鸿. 互联网侨情信息采集系统设计与实现[J]. 现代图书情报技术,2010(Z1): 95-101.
[3] Chakrabarti S, Van Den Berg M, Dom B. Focused crawling:A new approach to topic-specific Web resource discovery[J]. Computer Networks,1999(11): 1623-1640.
[4] Aggarwal C C, Al-Garawi F, Yu Philip S. Intelligent crawling on the World Wide Web with arbitrary predicates[C]//Proceedings of the 10th International Conference on World Wide Web. Hong Kong:ACM, 2001:96-105.
[5] Nie Zaiqing, Zhang Yuanzhi, Wen Jirong, et al. Object-level ranking:Bringing order to Web objects[C]//Proceedings of the 14th International Conference on World Wide Web.New York:ACM, 2005:567-574.
[6] 杜义华, 及俊川. 通用互联网信息采集系统的设计与初步实现[J]. 计算机应用研究,2005(1): 187-189.
[7] 宫进, 胡长军, 曾广平. 互联网信息定向采集系统的设计与实现[J]. 计算机应用,2007(S1): 16-17.
[8] 罗立宏, 陈志. 基于语义分析的垂直搜索网络蜘蛛[J]. 计算机工程与设计,2008(18): 4662-4665.
[9] 许鑫,黄仲清. 垂直搜索引擎应用中的若干策略探讨——以12580餐饮垂直搜索为例[J]. 现代图书情报技术,2009(2):62-70.
[10] 姚双良. 基于主题的Deep Web聚焦爬虫研究与设计[J]. 西北师范大学学报(自然科学版),2013(2): 40-48.
[11] 余智华.WWW站点的分析与分类[D].北京:中国科学院计算技术研究所,1999.
[12] 暗网[EB/OL].[2014-06-20].http://zh.wikipedia.org/wiki/%E6%9A%97%E7%BD%91.
[13] Arasu A, Garcia-Molina H. Extracting structured data from Web pages[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data.San Diego:ACM,2003:337-348.
[14] Html Agility Pack Home[EB/OL].[2014-07-12].http://htmlagilitypack.codeplex.com/.
[15] Wang Jiying, Lochovsky F H. Data-rich section extraction from HTML pages[C]//Proceedings of the Third International Conference on Web Information Systems Engineering. WISE 2002.Singapore:IEEE, 2002:313-322.
[16] 万晶. Web网页正文抽取方法研究[D].南昌:南昌大学, 2010.
[17] 谷俊. 中文专利本体半自动构建系统设计[J]. 图书情报工作,2013,57(3):105-111,146.
[18] ICTCLAS简介[EB/OL].[2014-06-18].http://ictclas.org/ictclas_feature.html.
[19] 董慧.基于本体论和数字图书馆的信息检索[J].情报学报,2003(6):648-652.
/
〈 |
|
〉 |