图书情报工作 ›› 2019, Vol. 63 ›› Issue (4): 101-111.DOI: 10.13266/j.issn.0252-3116.2019.04.013

• 情报研究 • 上一篇    下一篇

大数据环境下文本情感分析算法的规模适配研究:以Twitter为数据源

余传明1, 原赛2, 王峰1, 安璐3   

  1. 1. 中南财经政法大学信息与安全工程学院 武汉 430073;
    2. 中南财经政法大学统计与数学学院 武汉 430073;
    3. 武汉大学信息管理学院 武汉 430072
  • 收稿日期:2018-05-09 修回日期:2018-09-21 出版日期:2019-02-20 发布日期:2019-02-20
  • 通讯作者: 安璐(ORCID:0000-0002-5408-7135),教授,博士生导师,通讯作者,E-mail:anlu97@163.com
  • 作者简介:余传明(ORCID:0000-0001-7099-0853),教授;原赛(ORCID:0000-0002-5822-2496),硕士研究生;王峰(ORCID:0000-0003-1602-7235),硕士研究生。
  • 基金资助:
    本文系国家自然科学基金面上项目"大数据环境下基于领域知识获取与对齐的观点检索研究"(项目编号:71373286)和教育部哲学社会科学研究重大课题攻关项目"提高反恐怖主义情报信息工作能力对策研究"(项目编号:17JZD034)研究成果之一。

Research on Scale Adaptation of Text Sentiment Analysis Algorithm in Big Data Environment: Using Twitter as Data Source

Yu Chuanming1, Yuan Sai2, Wang Feng1, An Lu3   

  1. 1. School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073;
    2. School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073;
    3. School of Information Management, Wuhan University, Wuhan 430072
  • Received:2018-05-09 Revised:2018-09-21 Online:2019-02-20 Published:2019-02-20

摘要: [目的/意义]以大数据环境下的文本情感分析这一特定任务为目的,对规模适配问题进行研究,为情报学领域研究人员进行大数据环境下数据分析时,实现效率和成本的最优选择提供借鉴。[方法/过程]采用斯坦福大学Sentiment140数据集,在对传统情感分析算法分析的基础上,提出了5种面向大数据的文本情感分析算法,检验各种算法在不同环境和数据规模下的适配效果,从准确性、可扩展性和效率等方面进行实证比较研究。[结果/结论]实验结果显示,本文所搭建的集群具有良好的运行效率、正确性以及可扩展性,Spark集群在处理海量文本情感分析数据时更具有效率优势,且在数据规模越大的情况下,效率优势越明显;在资源利用方面,随着节点数和核数的增加,集群的整体运行效率变化显著,配置5个4核4G内存的从节点,能够实现在高效完成分类任务的同时达到节约资源成本的效果。

关键词: 规模适配, 大数据, 海量文本, 情感分析, 机器学习算法

Abstract: [Purpose/significance] This paper aims to study the scale adaptation problem for the purpose of textual sentiment analysis in big data environment. The paper provides reference for the best choice between efficiency and cost when researchers in the field of information science carry out data analysis under big data environment. [Method/process] We use the Sentiment140 dataset of Stanford University. Based on the analysis of traditional sentiment analysis algorithms, we propose five textual sentiment analysis algorithms for big data to test the adaptation effectiveness of various algorithms under different environments and data sizes, and conduct empirical comparisons in terms of accuracy, scalability and efficiency. [Result/conclusion] The experimental results show that the cluster built in this paper has good operational efficiency, correctness, and scalability. Spark clusters have more efficiency advantages in processing large-scale text sentiment analysis data, and with increasing the data size, its efficiency advantage is more obvious. In resource utilization, as the number of nodes and cores increase, the overall operating efficiency of the cluster changes significantly. We find the configuration of five slave nodes with 4 cores and 4G memory can achieve the effect of saving resource costs while efficiently completing the classification task.

Key words: scale adaptation, big data, massive text, sentiment analysis, machine learning algorithm

中图分类号: