[目的/意义] 在新时代人民日报分词语料库的基础上构建的深度学习自动分词模型,不仅有助于为高性能分词模型的构建提供经验,也可以借助具体的自然语言处理研究任务验证深度学习相应模型的性能。[方法/过程] 在介绍双向长短时记忆模型(Bi-LSTM)和双向长短时记忆与条件随机场融合模型(Bi-LSTM-CRF)的基础上,阐明汉语分词语料预处理、评价指标和参数与硬件平台的过程、种类和情况,分别构建Bi-LSTM和Bi-LSTM-CRF汉语自动分词模型,并对模型的整体性能进行分析。[结果/结论] 从精准率、召回率和调和平均值3个指标上看,所构建的Bi-LSTM和Bi-LSTM-CRF汉语自动分词模型的整体性能相对较为合理。在具体性能上,Bi-LSTM分词模型优于Bi-LSTM-CRF分词模型,但这一差距非常细微。
[Purpose/significance] On the basis of the new era People's Daily(NEPD) word segmentation corpus, the construction of the automatic word segmentation model of deep learning not only can help to provide relevant experience for the construction of high-performance word segmentation model, but also can verify the performance of the corresponding model of deep learning through specific natural language processing tasks.[Method/process] Based on the introduction of Bi-directional Long Short-Term Memory (Bi-LSTM) and Bi-directional Long Short-Term Memory with conditional random field (Bi-LSTM-CRF), this paper expounded the process, type and situation of Chinese word segmentation preprocessing, the evaluation indexes and parameters and hardware platform, the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation models were constructed respectively, and the overall performance of the models was analyzed.[Result/conclusion] The overall performance of the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation model is relatively reasonable from the three indexes of precision, recall and F value. In terms of specific performance, Bi-LSTM word segmentation model is superior to Bi-LSTM-CRF word segmentation model, but the difference is very small.
[1] 黄水清,王东波.新时代人民日报分词语料库构建、性能及应用(一)——语料库构建及测评[J]. 图书情报工作, 2019,63(22):5-12.
[2] ZHENG X,CHEN H,XU T.Deep learning for Chinese word segmentation and POS tagging[C]//YAROWSKY D, BALDWIN T, KORHONEN A, et al. Proceedings of the 2013 Conference on empirical methods in natural language processing. Washington:Association for computational Linguistics, 2013:647-657.
[3] LI X, MENG Y, SUN X, et al. Is word segmentation necessary for deep learning of Chinese representations?[J].[2019-11-10]. https://arxiv.org/abs/1508.01991v1.
[4] 张洪刚,李焕.基于双向长短时记忆模型的中文分词方法[J].华南理工大学学报(自然科学版),2017(3):61-67.
[5] MA J, GANCHEV K, WEISS D. State-of-the-art Chinese word segmentation with Bi-LSTMs[C]//RILOFF E, CHIANG D, HOCKENMAIER J, et al. Proceedings of the 2018 conference on empirical methods in natural language processing. Belgium:Association for Computational Linguistics, 2018:4902-4908.
[6] 解宇涵. 基于深度学习的中文分词模型应用研究[D].重庆:重庆大学,2017.
[7] 李雪莲,段鸿,许牧.基于门循环单元神经网络的中文分词法[J].厦门大学学报(自然科学版),2017,56(2):237-243.
[8] 姜猛, 王子牛, 高建瓴. 基于异构数据联合训练的中文分词法[J]. 电子科技, 2019, 32(4):33-36.
[9] 王玮. 基于Bi-LSTM-6Tags的智能中文分词方法[J]. 计算机应用, 2018, 38(S2):112-115.
[10] WANG X, WANG M, ZHANG Q. Realization of Chinese word segmentation based on deep learning method[C]//Green Energy and Sustainable Development I. Proceedings of the international conference on green energy and sustainable development. Chongqing:AIP Publishing, 2017:1-6.
[11] 王梦鸽.基于深度学习中文分词的研究[D]. 西安:西安邮电大学, 2018.
[12] 薛源. 基于深度学习算法的中文分词的研究[J]. 计算机产品与流通, 2019(5):202.
[13] 张子睿,刘云清.基于BI-LSTM-CRF模型的中文分词法[J].长春理工大学学报(自然科学版),2017,40(4):87-92.
[14] 刘玉德. 基于深度学习的中文分词方法研究[D].广州:华南理工大学,2018.
[15] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088):533-536.
[16] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL].[2019-11-10]. http://arxiv.org/abs/1508.01991v1.