Construction, Performance and Application of New Era People's Daily Segmented Corpus (Ⅱ)&mdash;&mdash;Constructing Automatic Word Segmentation Model of Deep Learning

Huang Shuiqing; Wang Dongbo

doi:10.13266/j.issn.0252-3116.2019.23.001

2019 , Vol. 63 >Issue 23: 5 - 12

DOI: https://doi.org/10.13266/j.issn.0252-3116.2019.23.001

Construction, Performance and Application of New Era People's Daily Segmented Corpus (Ⅱ)——Constructing Automatic Word Segmentation Model of Deep Learning

Huang Shuiqing ,
Wang Dongbo

Expand

1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
2. Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095

Received date: 2019-11-15

Revised date: 2019-12-02

Online published: 2019-12-05

Fold

Abstract

[Purpose/significance] On the basis of the new era People's Daily(NEPD) word segmentation corpus, the construction of the automatic word segmentation model of deep learning not only can help to provide relevant experience for the construction of high-performance word segmentation model, but also can verify the performance of the corresponding model of deep learning through specific natural language processing tasks.[Method/process] Based on the introduction of Bi-directional Long Short-Term Memory (Bi-LSTM) and Bi-directional Long Short-Term Memory with conditional random field (Bi-LSTM-CRF), this paper expounded the process, type and situation of Chinese word segmentation preprocessing, the evaluation indexes and parameters and hardware platform, the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation models were constructed respectively, and the overall performance of the models was analyzed.[Result/conclusion] The overall performance of the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation model is relatively reasonable from the three indexes of precision, recall and F value. In terms of specific performance, Bi-LSTM word segmentation model is superior to Bi-LSTM-CRF word segmentation model, but the difference is very small.

Key words： new era People's Daily segmented corpus; segmented corpus; automatic word segmentation; deep learning; Bi-LSTM; Bi-LSTM-CRF

Cite this article

Huang Shuiqing , Wang Dongbo . Construction, Performance and Application of New Era People's Daily Segmented Corpus (Ⅱ)——Constructing Automatic Word Segmentation Model of Deep Learning[J]. Library and Information Service, 2019 , 63(23) : 5 -12 . DOI: 10.13266/j.issn.0252-3116.2019.23.001

References

[1] 黄水清,王东波.新时代人民日报分词语料库构建、性能及应用(一)——语料库构建及测评[J]. 图书情报工作, 2019,63(22):5-12.
[2] ZHENG X,CHEN H,XU T.Deep learning for Chinese word segmentation and POS tagging[C]//YAROWSKY D, BALDWIN T, KORHONEN A, et al. Proceedings of the 2013 Conference on empirical methods in natural language processing. Washington:Association for computational Linguistics, 2013:647-657.
[3] LI X, MENG Y, SUN X, et al. Is word segmentation necessary for deep learning of Chinese representations?[J].[2019-11-10]. https://arxiv.org/abs/1508.01991v1.
[4] 张洪刚,李焕.基于双向长短时记忆模型的中文分词方法[J].华南理工大学学报(自然科学版),2017(3):61-67.
[5] MA J, GANCHEV K, WEISS D. State-of-the-art Chinese word segmentation with Bi-LSTMs[C]//RILOFF E, CHIANG D, HOCKENMAIER J, et al. Proceedings of the 2018 conference on empirical methods in natural language processing. Belgium:Association for Computational Linguistics, 2018:4902-4908.
[6] 解宇涵. 基于深度学习的中文分词模型应用研究[D].重庆:重庆大学,2017.
[7] 李雪莲,段鸿,许牧.基于门循环单元神经网络的中文分词法[J].厦门大学学报(自然科学版),2017,56(2):237-243.
[8] 姜猛, 王子牛, 高建瓴. 基于异构数据联合训练的中文分词法[J]. 电子科技, 2019, 32(4):33-36.
[9] 王玮. 基于Bi-LSTM-6Tags的智能中文分词方法[J]. 计算机应用, 2018, 38(S2):112-115.
[10] WANG X, WANG M, ZHANG Q. Realization of Chinese word segmentation based on deep learning method[C]//Green Energy and Sustainable Development I. Proceedings of the international conference on green energy and sustainable development. Chongqing:AIP Publishing, 2017:1-6.
[11] 王梦鸽.基于深度学习中文分词的研究[D]. 西安:西安邮电大学, 2018.
[12] 薛源. 基于深度学习算法的中文分词的研究[J]. 计算机产品与流通, 2019(5):202.
[13] 张子睿,刘云清.基于BI-LSTM-CRF模型的中文分词法[J].长春理工大学学报(自然科学版),2017,40(4):87-92.
[14] 刘玉德. 基于深度学习的中文分词方法研究[D].广州:华南理工大学,2018.
[15] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088):533-536.
[16] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL].[2019-11-10]. http://arxiv.org/abs/1508.01991v1.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References