图书情报工作 ›› 2018, Vol. 62 ›› Issue (4): 121-129.DOI: 10.13266/j.issn.0252-3116.2018.04.016

• 知识组织 • 上一篇    下一篇

基于回归分析的网络恐怖信息主题爬虫

黄炜1, 张展程1, 朱彬1, 李岳峰1, 陆薇2   

  1. 1. 湖北工业大学经济与管理学院 武汉 430068;
    2. 国家电网武汉市东湖新技术开发区供电公司 武汉 430073
  • 收稿日期:2017-08-21 修回日期:2017-11-16 出版日期:2018-02-20 发布日期:2018-02-20
  • 作者简介:黄炜(ORCID:0000-0002-5804-9371),教授,博士,硕士生导师,E-mail:tonnyhw@163.com;张展程(ORCID:0000-0002-7533-4764),硕士研究生;朱彬(ORCID:0000-0002-1073-0379),本科生;李岳峰(ORCID:0000-0001-5173-9575),教授,博士;陆薇(ORCID:0000-0001-6270-8846),助理工程师.
  • 基金资助:
    本文系国家自然科学基金项目"微博环境下实时主动感知网络舆情事件的多核方法研究"(项目编号:71303075)和"大数据环境下基于特征本体学习的无监督文本分类方法研究"(项目编号:71571064)研究成果之一。

A Network Counter-terrorism Information Crawler Based on the Regression Analysis

Huang Wei1, Zhang Zhancheng1, Zhu Bing1, Li Yuefeng1, Lu Wei2   

  1. 1. School of Economics and Management, Hubei University of Technology, Wuhan 430068;
    2. Wuhan East Lake High-tech Development Zone Power Company, State Grid Corporation of China, Wuhan 430073
  • Received:2017-08-21 Revised:2017-11-16 Online:2018-02-20 Published:2018-02-20

摘要: [目的/意义]针对目前从开源网络信息中采集网络恐怖信息难、采集效率低的问题,提出一种回归分析法,以综合语义相关与网页重要性两个因素,从而提高网络恐怖信息的采集效率。[方法/过程]通过分析、比较主题爬虫的特性,结合网络恐怖信息的特点,找出PageRank算法和TF-IDF算法中适用于恐怖信息采集的优点,并结合回归分析法,将恐怖信息的采集策略进行相关度预测,用预测结果反馈调节信息的采集过程。[结果/结论]网络恐怖信息采集要兼顾采集的数量和质量,在传统主题爬虫算法的基础上进行改进,提出针对于开源网络恐怖信息采集的爬虫优化算法,可以提高信息采集效率。

关键词: 主题爬虫, 回归分析, 网络反恐, 语义相似度

Abstract: [Purpose/significance] Aiming at the problems that getting the terrorist information on the network is difficult and the acquisition efficiency is low from the open source network information, a method based on the regression analysis is proposed to improve the acquisition efficiency of the network terror information by combining the advantages of the semantic relevance and the web page importance.[Method/process] By analyzing and comparing the characteristics of the theme crawler and combining them with the characteristics of the network terrorist information, the advantages of the PageRank algorithm and the IF-IDF algorithm for the collection of the terrorist information were found out. Combined with the regression analysis, the relevance prediction of the terrorist information was done, which reflected the process of the information collection.[Result/conclusion] Both the quantity and quality of the collection of the network terrorist information should be taken into consideration. Based on the traditional common network crawler algorithm, this paper proposes a crawler optimization algorithm pertinent to the network terrorist information collection, which improves the collection efficiency.

Key words: theme crawler, regression analysis, network anti-terrorism, semantic similarity

中图分类号: