图书情报工作 ›› 2019, Vol. 63 ›› Issue (23): 5-12.DOI: 10.13266/j.issn.0252-3116.2019.23.001

• 专稿 • 上一篇    下一篇

新时代人民日报分词语料库构建、性能及应用(二)——深度学习自动分词模型构建

黄水清1,2, 王东波1,2   

  1. 1. 南京农业大学信息科学技术学院 南京 210095;
    2. 南京农业大学领域知识关联研究中心 南京 210095
  • 收稿日期:2019-11-15 修回日期:2019-12-02 出版日期:2019-12-05 发布日期:2019-12-05
  • 作者简介:黄水清(ORCID:0000-0002-1646-9300),教授,博士生导师,E-mail:sqhuang@njau.edu.cn;王东波(ORCID:0000-0002-9894-9550),教授,博士生导师。

Construction, Performance and Application of New Era People's Daily Segmented Corpus (Ⅱ)——Constructing Automatic Word Segmentation Model of Deep Learning

Huang Shuiqing1,2, Wang Dongbo1,2   

  1. 1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
    2. Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095
  • Received:2019-11-15 Revised:2019-12-02 Online:2019-12-05 Published:2019-12-05

摘要: [目的/意义] 在新时代人民日报分词语料库的基础上构建的深度学习自动分词模型,不仅有助于为高性能分词模型的构建提供经验,也可以借助具体的自然语言处理研究任务验证深度学习相应模型的性能。[方法/过程] 在介绍双向长短时记忆模型(Bi-LSTM)和双向长短时记忆与条件随机场融合模型(Bi-LSTM-CRF)的基础上,阐明汉语分词语料预处理、评价指标和参数与硬件平台的过程、种类和情况,分别构建Bi-LSTM和Bi-LSTM-CRF汉语自动分词模型,并对模型的整体性能进行分析。[结果/结论] 从精准率、召回率和调和平均值3个指标上看,所构建的Bi-LSTM和Bi-LSTM-CRF汉语自动分词模型的整体性能相对较为合理。在具体性能上,Bi-LSTM分词模型优于Bi-LSTM-CRF分词模型,但这一差距非常细微。

关键词: 新时代人民日报分词语料, 语料库, 自动分词, 深度学习, Bi-LSTM, Bi-LSTM-CRF

Abstract: [Purpose/significance] On the basis of the new era People's Daily(NEPD) word segmentation corpus, the construction of the automatic word segmentation model of deep learning not only can help to provide relevant experience for the construction of high-performance word segmentation model, but also can verify the performance of the corresponding model of deep learning through specific natural language processing tasks.[Method/process] Based on the introduction of Bi-directional Long Short-Term Memory (Bi-LSTM) and Bi-directional Long Short-Term Memory with conditional random field (Bi-LSTM-CRF), this paper expounded the process, type and situation of Chinese word segmentation preprocessing, the evaluation indexes and parameters and hardware platform, the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation models were constructed respectively, and the overall performance of the models was analyzed.[Result/conclusion] The overall performance of the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation model is relatively reasonable from the three indexes of precision, recall and F value. In terms of specific performance, Bi-LSTM word segmentation model is superior to Bi-LSTM-CRF word segmentation model, but the difference is very small.

Key words: new era People's Daily segmented corpus, segmented corpus, automatic word segmentation, deep learning, Bi-LSTM, Bi-LSTM-CRF

中图分类号: