[目的/意义] 将从互联网大数据中无监督学习的结果迁移到目标领域,解决目标领域因学习样本有限而信息识别效果难以提升的问题。[方法/过程] 使用以中文维基百科等数据预训练的RoBERTa模型进行迁移学习,将学习结果映射到目标领域后使用DPCNN对其进行聚合凝练,然后结合部分标注数据微调模型完成领域信息的精准识别。[结果/结论] 在10个领域内与未进行迁移学习的模型及经典模型TextCNN对比,提出的模型均较大幅度优于对比模型,平均后的精确率绝对提高4.15%、3.43%,召回率绝对提高4.55%、3.44%,F1分数绝对提高4.52%、3.44%,表明利用网络大数据迁移学习可以显著提升目标领域的信息识别效果。
[Purpose/significance] To solve the problem that the identification effect of the target domain information is difficult to improve because of not enough samples, we will transfer the results of unsupervised learning from big data to the feature space of the target domain. [Method/process] Used the RoBERTa model, which was pre-trained with Chinese Wikipedia and other data, for transfer learning. After mapping the learning results to the target domain, DPCNN was used to aggregate and condense it, and then fine-tuned the model with part of the labeled data to complete the accurate recognition of domain information. [Result/conclusion] Compared with the model without transfer learning and the classic model TextCNN in 10 fields, the model in this paper is much better than the comparison models. After average, the precision is increased by 4.15% and 3.43%, the recall is increased by 4.55% and 3.44%, and the F1 score is increased by 4.52% and 3.44%. It shows that knowledge transfer using big data can effectively improve the information recognition effect in the target field.
[1] RINGEL D, RADINSKY K, MARKOVITCH S. Cross-cultural transfer learning for text classification[D]. Israel:Technion, 2019.
[2] YU S, SU J, LUO D. Improving BERT-based text classification with auxiliary sentence and domain knowledge[J]. IEEE access, 2019, 7:176600-176612.
[3] 潘洪亮, 王正德. 信息知识词典[M]. 北京:军事谊文出版社, 2002.
[4] 张学工. 模式识别[M].3版. 北京:清华大学出版社, 2010.
[5] MA Y, TANG J, AGGARWAL C. Feature engineering for data streams[M]//Feature engineering for machine learning and data analytics. Boca Raton:CRC Press, 2018:117-143.
[6] 廖列法,勒孚刚,朱亚兰.LDA模型在专利文本分类中的应用[J].现代情报,2017,37(3):35-39.
[7] 杨腾飞, 解吉波, 李振宇, 等. 微博中蕴含台风灾害损失信息识别和分类方法[J]. 地球信息科学学报, 2018, 20(7):906-917.
[8] KIM Y. Convolutional neural networks for sentence classification[C]//YUVAL M. Empirical methods in natural language processing. Qatar:ACL, 2014:1746-1751.
[9] 黄涛. 基于机器学习的新闻分类系统研究与实现[D].北京:北京邮电大学,2019.
[10] LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[C]//IJCAI.Proceedings of the twenty-fifth international joint conference on artificial intelligence. New York:AAAI Press, 2016:2873-2879.
[11] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]//MIRELLA L. the 15th conference of the european chapter of the association for computational linguistics. Spain:EACL, 2017:427-431.
[12] JOHNSON R, ZHANG T. Deep pyramid convolutional neural networks for text categorization[C]//HINRICH S. Proceedings of the 55th annual meeting of the association for computational linguistics. Vancouver:ACL, 2017:562-570.
[13] 庄福振, 罗平, 何清,等. 迁移学习研究进展[J]. 软件学报, 2015, 26(1):26-39.
[14] ZHOU B, LAPEDRIZA A, XIAO J, et al. Learning deep features for scene recognition using places database[C]//ROMAN G. International conference on neural information processing systems. Cambridge:MIT press, 2014:487-495.
[15] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//JUN Z. Advances in neural information processing systems. Harrahs:NIPS, 2013:3111-3119.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//ISABELLE G. Advances in neural information processing systems. Long Beach:NIPS, 2017:5998-6008.
[17] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8):9.
[18] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[J/OL].[2020-06-28]. https://arxiv.org/pdf/1810.04805.
[19] LIU Y, OTT M, GOYAL N, et al. RoBERTa:a robustly optimized BERT pretraining approach[J/OL].[2020-04-01]. https://arxiv.org/pdf/1907.11692.
[20] LAN Z, CHEN M, GOODMAN S, et al. ALBERT:a lite BERT for self-supervised learning of language representations[J/OL].[2020-04-01]. https://arxiv.org/pdf/1909.11942.
[21] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[J].[2020-04-01].https://arxiv.org/pdf/1902.00751.
[22] SHARMA T, UPADHYAY U, BAGLER G. Classification of cuisines from sequentially structured recipes[C]//2020 IEEE 36th international conference on data engineering workshops (ICDEW). Dallas:IEEE, 2020:105-108.
[23] 孙茂松,李景阳,郭志芃,等. THUCTC:一个高效的中文文本分类工具包[EB/OL].[2020-05-10].http://thuctc.thunlp.org/.
[24] CUI Y, CHE W, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing[J/OL].[2020-06-01]. https://arxiv.org/pdf/2004.13922.
[25] TENNEY I, DAS D, PAVLICK E. BERT rediscovers the classical NLP pipeline[J/OL].[2020-01-01]. https://arxiv.org/pdf/1905.05950.