知识组织

文本增强与预训练语言模型在网络问政留言分类中的集成对比研究

  • 施国良 ,
  • 陈宇奇
展开
  • 河海大学商学院 南京 211100
施国良(ORCID:0000-0001-7585-640X),副教授,博士,E-mail:shigl@hhu.edu.cn;陈宇奇(ORCID:0000-0001-5755-5208),硕士研究生。

收稿日期: 2020-10-07

  修回日期: 2021-03-19

  网络出版日期: 2021-07-10

基金资助

本文系中央高校基本业务费项目"基于图数据库的水利知识图谱关键技术研究"(项目编号:B200207036)研究成果之一。

A Comparative Study on the Integration of Text Enhanced and Pre-trained Language Models in the Classification of Internet Political Messages

  • Shi Guoliang ,
  • Chen Yuqi
Expand
  • Business School, Hohai University, Nanjing 211100

Received date: 2020-10-07

  Revised date: 2021-03-19

  Online published: 2021-07-10

摘要

[目的/意义] 政府网络问政平台是政府部门知晓民意的重要途径之一,为提高问政留言分类的精度以及处理留言数据质量差、数量少等问题,对比多种基于BERT改进模型与文本增强技术结合的分类效果并探究其差异原因。[方法/过程] 设计网络问政留言分类集成对比模型,文本增强方面采用EDA技术与SimBERT文本增强技术进行对比实验,文本分类模型方面则采用多种基于BERT改进的预训练语言模型(如ALBERT、RoBERTa)进行对比实验。[结果/结论] 实验结果表明,基于RoBERTa与SimBERT文本增强的文本分类模型效果最佳,在测试集上的F1值高达92.05%,相比于未进行文本增强的BERT-base模型高出2.89%。同时,SimBERT文本增强后F1值相比未增强前平均提高0.61%。实验证明了基于RoBERTa与SimBERT文本增强模型能够有效提升多类别文本分类的效果,在解决同类问题时具有较强可借鉴性。

本文引用格式

施国良 , 陈宇奇 . 文本增强与预训练语言模型在网络问政留言分类中的集成对比研究[J]. 图书情报工作, 2021 , 65(13) : 96 -107 . DOI: 10.13266/j.issn.0252-3116.2021.13.010

Abstract

[Purpose/significance] Government network platform for political inquiry is one of the important ways for rulers to know public opinions. In order to improve the accuracy of the classification of political inquiry messages and to deal with the problems such as poor quality and small quantity of message data, the classification effects of various BERT improved models combined with text enhancement technology and the reasons for their differences were explored.[Method/process] Design the network political inquiry message classification integrated comparison model,the EDA (Easier Data Augment) technology and SimBERT text Augment technology were used for comparison experiment in the aspect of text augmentation, and various pre-training language models (such as ALBERT and RoBERTa) based on BERT improvement were used for comparison experiment in the aspect of text classification model.[Result/conclusion] The experimental results showed that the text classification model based on RoBERTa and SimBERT text enhancement had the best effect, and the F1 value on the test set was as high as 92.05%, 2.89% higher than that of the Bert-Base model without text enhancement. At the same time, F1 value after SimBERT text enhancement was 0.61% higher than that before no enhancement. The experiment proved that text enhancement model based on RoBERTa and SimBERT can effectively improve the classification effect of multiple categories of text classification problems, and has strong referability in solving similar problems.

参考文献

[1] 徐晓雯,曹守新.网络问政对公共政策制定的影响——基于SWOT分析方法[J].山东社会科学,2015(6):179-183.
[2] 马思丹,刘东苏.基于加权Word2vec的文本分类方法研究[J].情报科学,2019,37(11):38-42.
[3] 程婧,刘娜娜,闵可锐,等.一种低频词词向量优化方法及在短文本分类中的应用[J].计算机科学,2020(8):255-260.
[4] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the conference of the north american chapter of the Association for Computational Linguistics:human language technologies. Stroudsburg:Association for Computational Linguistics, 2018:2227-2237.
[5] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the conference of the North American chapter of the Association for Computational Linguistics:human language technologies. Stroudsburg:Association for Computational Linguistics, 2019:4171-4186.
[6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceeding of advances in neural information processing systems. California:MIT Press, 2017:6000-6010.
[7] 陈燕方.基于DDAG-SVM的在线商品评论可信度分类模型[J].情报理论与实践,2017,40(7):132-137.
[8] 余本功,曹雨蒙,陈杨楠,等.基于nLD-SVM-RF的短文本分类研究[J].数据分析与知识发现,2020,4(1):111-120.
[9] 韩栋,王春华,肖敏.基于句子级学习改进CNN的短文本分类方法[J].计算机工程与设计,2019,40(1):256-260.
[10] 杨云龙,孙建强,宋国超.基于GRU和胶囊特征融合的文本情感分析[J/OL].[2021-02-10]. http://kns.cnki.net/kcms/detail/51.1307.TP.20200429.1704.010.html.
[11] 赵旸,张智雄,刘欢,等.基于BERT模型的中文医学文献分类研究[J].数据分析与知识发现, 2020(8):41-49.
[12] 吴俊,程垚,郝瀚,等.基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J].情报学报,2020,39(4):409-418.
[13] 廖胜兰,吉建民,俞畅,等.基于BERT模型与知识蒸馏的意图分类方法[J/OL].[2021-02-10]. https://doi.org/10.19678/j.issn.1000-3428.0057416.
[14] JASON W, KAI Z. Eda:easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceeding of the 2019 conference on empirical methods in natural language processing. Hong Kong:ACL, 2019.
[15] 俞畅,欧阳昱,张波,等.基于对抗式生成网络的电力用户意图文本生成[J].信息技术与网络安全,2019,38(11):67-72.
[16] LIU Y, OTT M, GOYAL N, et al. RoBERTa:a robustly optimized BERT pretraining approach[J/OL].[2021-02-10].https://arxiv.org/abs/1907.11692.
[17] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT:a lite BERT for self-supervised learning of language representations[C]//Proceedings of the international conference on learning representations. Ethiopia:ICLR, 2020.
[18] 苏剑林.鱼与熊掌兼得:融合检索和生成的SimBERT模型[EB/OL].[2020-05-18]. https://kexue.fm/archives/7427.
[19] LIU P F, QIU X P, HUANG X J. Recurrent neural for text classification with multi-task learning[C]//Proceeding of the international joint conference on artificial intelligence. New York:IJCAI, 2016.
[20] KIM Y. Convolutional neural networks for sentence classification[C]//Proceeding of the 2014 conference on empirical methods in natural language processing. Doha:ACM, 2014.
[21] LAI S W, XU L H, LIU K, et al. Recurrent convolutional neural networks for text classification[C]//Proceeding of the 29th national conference on artificial intelligence. Austin:AAAI, 2015.
[22] YANG Z L, DAI Z H, YANG Y Y, et al. XLNet:generalized autoregressive pretraining for language understanding[C]//Proceeding of the 33rd conference on neural information processing systems. Vancouver:MIT Press, 2019.
[23] SANH V, DEBUT L, CHAUMOND J, et al. DistilBERT, a distilled version of BERT:smaller,faster,cheaper and lighter[J/OL].[2021-02-10].https://arxiv.org/abs/1910.01108v4.
[24] JIAO X Q, YIN Y C, SHANG L F, et al. TinyBERT:distilling BERT for natural language understanding[J/OL].[2021-02-10].https://arxiv.org/abs/1909.10351v3.
[25] BAO H B, DONG L, WEI F R, et al. UniLMv2:pseudo-masked language models for unified language model pre-trainning[J/OL].[2021-02-10].https://arxiv.org/abs/2002.12804.
文章导航

/