KNOWLEDGE ORGANIZATION

A Comparative Study on the Integration of Text Enhanced and Pre-trained Language Models in the Classification of Internet Political Messages

  • Shi Guoliang ,
  • Chen Yuqi
Expand
  • Business School, Hohai University, Nanjing 211100

Received date: 2020-10-07

  Revised date: 2021-03-19

  Online published: 2021-07-10

Abstract

[Purpose/significance] Government network platform for political inquiry is one of the important ways for rulers to know public opinions. In order to improve the accuracy of the classification of political inquiry messages and to deal with the problems such as poor quality and small quantity of message data, the classification effects of various BERT improved models combined with text enhancement technology and the reasons for their differences were explored.[Method/process] Design the network political inquiry message classification integrated comparison model,the EDA (Easier Data Augment) technology and SimBERT text Augment technology were used for comparison experiment in the aspect of text augmentation, and various pre-training language models (such as ALBERT and RoBERTa) based on BERT improvement were used for comparison experiment in the aspect of text classification model.[Result/conclusion] The experimental results showed that the text classification model based on RoBERTa and SimBERT text enhancement had the best effect, and the F1 value on the test set was as high as 92.05%, 2.89% higher than that of the Bert-Base model without text enhancement. At the same time, F1 value after SimBERT text enhancement was 0.61% higher than that before no enhancement. The experiment proved that text enhancement model based on RoBERTa and SimBERT can effectively improve the classification effect of multiple categories of text classification problems, and has strong referability in solving similar problems.

Cite this article

Shi Guoliang , Chen Yuqi . A Comparative Study on the Integration of Text Enhanced and Pre-trained Language Models in the Classification of Internet Political Messages[J]. Library and Information Service, 2021 , 65(13) : 96 -107 . DOI: 10.13266/j.issn.0252-3116.2021.13.010

References

[1] 徐晓雯,曹守新.网络问政对公共政策制定的影响——基于SWOT分析方法[J].山东社会科学,2015(6):179-183.
[2] 马思丹,刘东苏.基于加权Word2vec的文本分类方法研究[J].情报科学,2019,37(11):38-42.
[3] 程婧,刘娜娜,闵可锐,等.一种低频词词向量优化方法及在短文本分类中的应用[J].计算机科学,2020(8):255-260.
[4] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the conference of the north american chapter of the Association for Computational Linguistics:human language technologies. Stroudsburg:Association for Computational Linguistics, 2018:2227-2237.
[5] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the conference of the North American chapter of the Association for Computational Linguistics:human language technologies. Stroudsburg:Association for Computational Linguistics, 2019:4171-4186.
[6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceeding of advances in neural information processing systems. California:MIT Press, 2017:6000-6010.
[7] 陈燕方.基于DDAG-SVM的在线商品评论可信度分类模型[J].情报理论与实践,2017,40(7):132-137.
[8] 余本功,曹雨蒙,陈杨楠,等.基于nLD-SVM-RF的短文本分类研究[J].数据分析与知识发现,2020,4(1):111-120.
[9] 韩栋,王春华,肖敏.基于句子级学习改进CNN的短文本分类方法[J].计算机工程与设计,2019,40(1):256-260.
[10] 杨云龙,孙建强,宋国超.基于GRU和胶囊特征融合的文本情感分析[J/OL].[2021-02-10]. http://kns.cnki.net/kcms/detail/51.1307.TP.20200429.1704.010.html.
[11] 赵旸,张智雄,刘欢,等.基于BERT模型的中文医学文献分类研究[J].数据分析与知识发现, 2020(8):41-49.
[12] 吴俊,程垚,郝瀚,等.基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J].情报学报,2020,39(4):409-418.
[13] 廖胜兰,吉建民,俞畅,等.基于BERT模型与知识蒸馏的意图分类方法[J/OL].[2021-02-10]. https://doi.org/10.19678/j.issn.1000-3428.0057416.
[14] JASON W, KAI Z. Eda:easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceeding of the 2019 conference on empirical methods in natural language processing. Hong Kong:ACL, 2019.
[15] 俞畅,欧阳昱,张波,等.基于对抗式生成网络的电力用户意图文本生成[J].信息技术与网络安全,2019,38(11):67-72.
[16] LIU Y, OTT M, GOYAL N, et al. RoBERTa:a robustly optimized BERT pretraining approach[J/OL].[2021-02-10].https://arxiv.org/abs/1907.11692.
[17] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT:a lite BERT for self-supervised learning of language representations[C]//Proceedings of the international conference on learning representations. Ethiopia:ICLR, 2020.
[18] 苏剑林.鱼与熊掌兼得:融合检索和生成的SimBERT模型[EB/OL].[2020-05-18]. https://kexue.fm/archives/7427.
[19] LIU P F, QIU X P, HUANG X J. Recurrent neural for text classification with multi-task learning[C]//Proceeding of the international joint conference on artificial intelligence. New York:IJCAI, 2016.
[20] KIM Y. Convolutional neural networks for sentence classification[C]//Proceeding of the 2014 conference on empirical methods in natural language processing. Doha:ACM, 2014.
[21] LAI S W, XU L H, LIU K, et al. Recurrent convolutional neural networks for text classification[C]//Proceeding of the 29th national conference on artificial intelligence. Austin:AAAI, 2015.
[22] YANG Z L, DAI Z H, YANG Y Y, et al. XLNet:generalized autoregressive pretraining for language understanding[C]//Proceeding of the 33rd conference on neural information processing systems. Vancouver:MIT Press, 2019.
[23] SANH V, DEBUT L, CHAUMOND J, et al. DistilBERT, a distilled version of BERT:smaller,faster,cheaper and lighter[J/OL].[2021-02-10].https://arxiv.org/abs/1910.01108v4.
[24] JIAO X Q, YIN Y C, SHANG L F, et al. TinyBERT:distilling BERT for natural language understanding[J/OL].[2021-02-10].https://arxiv.org/abs/1909.10351v3.
[25] BAO H B, DONG L, WEI F R, et al. UniLMv2:pseudo-masked language models for unified language model pre-trainning[J/OL].[2021-02-10].https://arxiv.org/abs/2002.12804.
Outlines

/