基于回归分析的网络恐怖信息主题爬虫

黄炜; 张展程; 朱彬; 李岳峰; 陆薇

doi:10.13266/j.issn.0252-3116.2018.04.016

图书情报工作 >

2018 , Vol. 62 >Issue 4: 121 - 129

DOI: https://doi.org/10.13266/j.issn.0252-3116.2018.04.016

知识组织

基于回归分析的网络恐怖信息主题爬虫

黄炜 ,
张展程 ,
朱彬 ,
李岳峰 ,
陆薇

展开

1. 湖北工业大学经济与管理学院武汉 430068;
2. 国家电网武汉市东湖新技术开发区供电公司武汉 430073

黄炜(ORCID:0000-0002-5804-9371),教授,博士,硕士生导师,E-mail:tonnyhw@163.com;张展程(ORCID:0000-0002-7533-4764),硕士研究生;朱彬(ORCID:0000-0002-1073-0379),本科生;李岳峰(ORCID:0000-0001-5173-9575),教授,博士;陆薇(ORCID:0000-0001-6270-8846),助理工程师.

收稿日期: 2017-08-21

修回日期: 2017-11-16

网络出版日期: 2018-02-20

基金资助

本文系国家自然科学基金项目"微博环境下实时主动感知网络舆情事件的多核方法研究"（项目编号：71303075）和"大数据环境下基于特征本体学习的无监督文本分类方法研究"（项目编号：71571064）研究成果之一。

收起

A Network Counter-terrorism Information Crawler Based on the Regression Analysis

Huang Wei ,
Zhang Zhancheng ,
Zhu Bing ,
Li Yuefeng ,
Lu Wei

Expand

1. School of Economics and Management, Hubei University of Technology, Wuhan 430068;
2. Wuhan East Lake High-tech Development Zone Power Company, State Grid Corporation of China, Wuhan 430073

Received date: 2017-08-21

Revised date: 2017-11-16

Online published: 2018-02-20

Fold

摘要

[目的/意义]针对目前从开源网络信息中采集网络恐怖信息难、采集效率低的问题，提出一种回归分析法，以综合语义相关与网页重要性两个因素，从而提高网络恐怖信息的采集效率。[方法/过程]通过分析、比较主题爬虫的特性，结合网络恐怖信息的特点，找出PageRank算法和TF-IDF算法中适用于恐怖信息采集的优点，并结合回归分析法，将恐怖信息的采集策略进行相关度预测，用预测结果反馈调节信息的采集过程。[结果/结论]网络恐怖信息采集要兼顾采集的数量和质量，在传统主题爬虫算法的基础上进行改进，提出针对于开源网络恐怖信息采集的爬虫优化算法，可以提高信息采集效率。

关键词： 主题爬虫; 回归分析; 网络反恐; 语义相似度

本文引用格式

黄炜 , 张展程 , 朱彬 , 李岳峰 , 陆薇 . 基于回归分析的网络恐怖信息主题爬虫[J]. 图书情报工作, 2018 , 62(4) : 121 -129 . DOI: 10.13266/j.issn.0252-3116.2018.04.016

Abstract

[Purpose/significance] Aiming at the problems that getting the terrorist information on the network is difficult and the acquisition efficiency is low from the open source network information, a method based on the regression analysis is proposed to improve the acquisition efficiency of the network terror information by combining the advantages of the semantic relevance and the web page importance.[Method/process] By analyzing and comparing the characteristics of the theme crawler and combining them with the characteristics of the network terrorist information, the advantages of the PageRank algorithm and the IF-IDF algorithm for the collection of the terrorist information were found out. Combined with the regression analysis, the relevance prediction of the terrorist information was done, which reflected the process of the information collection.[Result/conclusion] Both the quantity and quality of the collection of the network terrorist information should be taken into consideration. Based on the traditional common network crawler algorithm, this paper proposes a crawler optimization algorithm pertinent to the network terrorist information collection, which improves the collection efficiency.

Key words： theme crawler; regression analysis; network anti-terrorism; semantic similarity

参考文献

[1] CÔTÉ-BOUCHER K. The diffuse border: intelligence-sharing, control and confinement along Canada's smart border [J]. Surveillance & society, 2008, 5(2):142-165.
[2] 李本先,江成俊,方锦清.网络科学在反恐研究中面临的挑战和机遇[J].复杂系统与复杂性科学,2014,11(1):60-66.
[3] 李鸥.网络反恐及对策[J].江西警察学院学报,2006(3):92-95.
[4] 汪勇,梅建明.当前反恐斗争的特点、挑战及应对策略[J].中国人民公安大学学报(社会科学版),2016,32(1):19-23.
[5] 黄炜,余辉,李岳峰.网络反恐知识库构建研究[J].情报杂志,2017,36(5):168-174.
[6] 刘炯. 网络时代暴恐音视频传播防控研究[J]. 中国人民公安大学学报(社会科学版),2015,31(1):1-9.
[7] 黄炜,余辉,李岳峰.国内网络反恐研究的现状、问题和展望[J].现代图书情报技术,2016(11):1-10.
[8] CHAKRABARTI S, BERG M V D, DOM B. Focused crawling: anew approach to topic specific resource discovery[J]. Computer networks, 2000, 31(11/16):1623-1640.
[9] HEYDON A, NAJORK M. Mercator: a scalable, extensible Web crawler[J]. World Wide Web:Internet & Web information systems, 1999, 2(4):219-229.
[10] AVRAAM I, ANAGNOSTOPOULOS I. A comparison over focused Web crawling strategies[C]//Panhellenic conference on Informatics. Kastoria:IEEE Computer Society, 2011:245-249.
[11] 杨彬,康慕宁.基于概念的权重PageRank改进算法[J].情报杂志,2006(11):70-72.
[12] 林泓,刘朋,李晶晶,龙振海. 基于概率的PageRank改进算法[J].武汉理工大学学报,2009(3):81-83.
[13] 何明,周军,李树友.语义相似的PageRank改进算法[J].计算机工程与应用,2009(27):140-142.
[14] 王钟斐,王彪.基于锚文本相似度的PageRank改进算法[J].计算机工程,2010(24):258-260.
[15] 王建雄. 基于特殊主题的PageRank改进算法[J].图书情报工作,2012,56(21):114-118.
[16] 王冲, 纪仙慧. 基于用户兴趣与主题相关的PageRank算法改进研究[J].计算机科学, 2016, 43(3):275-278.
[17] 王德广,周志刚,梁旭.PageRank算法的分析及其改进[J].计算机工程,2010,36(22):291-293.
[18] Maestre J M, Ishii H. A PageRank based coalitional control scheme[J]. International journal of control automation & systems,2017,15(5):1983-1990.
[19] 路永和,李焰锋. 改进TF-IDF算法的文本特征项权值计算方法[J]. 图书情报工作,2013,57(3):90-95.
[20] 王景中,邱铜相.基于TF-IDF 改进算法的聚焦主题网络爬虫[J].计算机应用,2015,35(10):2901-2904,2919.
[21] 黄炜,程宝生,杨青.基于本体的网络群体性事件主题发现研究[J].图书情报工作,2012,56(20):47-52,27.
[22] 李稚楹,杨武,谢治军.PageRank算法研究综述[J]. 计算机科学,2011,38(S1):185-188.
[23] 宋聚平,王永成,尹中航,等. 对网页PageRank算法的改进[J]. 上海交通大学学报, 2003(3): 397-400.
[24] 朱颢东,丁温雪,杨立志,等.微博环境下基于用户行为与主题相似度的改进PageRank算法[J].计算机工程,2017,43(5):179-184.
[25] DENG Z H, TANG S W, YANG D Q, et al. A linear text classification algorithm based on category relevance factors[C]//International conference on Asian digital libraries: digital libraries: people, knowledge,and technology. New York:Springer-Verlag, 2002:88-98.
[26] 赵小华,马建芬.文本分类算法中词语权重计算方法的改进[J].电脑知识与技术,2009,5(36):10626-10628.
[27] YE Y X. New research advances in technologies of semantic Web search[J]. Computer science, 2010,1(37):1-5.
[28] 张环,刘乃文,段会川.基于T-Graph算法的主题爬虫研究[J].计算机工程与设计,2014,35(9):3014-3017,3028.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献