图书情报工作 ›› 2019, Vol. 63 ›› Issue (22): 5-12.DOI: 10.13266/j.issn.0252-3116.2019.22.001

• 专稿 • 上一篇    下一篇

新时代人民日报分词语料库构建、性能及应用(一)——语料库构建及测评

黄水清1,2, 王东波1,2   

  1. 1. 南京农业大学信息科学技术学院 南京 210095;
    2. 南京农业大学领域知识关联研究中心 南京 210095
  • 收稿日期:2019-10-08 修回日期:2019-10-17 出版日期:2019-11-20 发布日期:2019-11-20
  • 作者简介:黄水清(ORCID:0000-0002-1646-9300),教授,博士生导师,E-mail:sqhuang@njau.edu.cn;王东波(ORCID:0000-0002-9894-9550),教授,博士生导师。

Construction, Performance and Application of New Era People's Daily Segmented Corpus (I)——Construction and Evaluation of Corpus

Huang Shuiqing1,2, . Wang Dongbo1,2   

  1. 1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
    2. Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095
  • Received:2019-10-08 Revised:2019-10-17 Online:2019-11-20 Published:2019-11-20

摘要: [目的/意义] 构建与新时代相适应的人民日报分词语料库,为中文信息处理提供最新的精标注语料,也为从历时的角度分析现代汉语提供新的语言资源。[方法/过程] 在分析已有汉语分词语料库的基础上,描述所构建新时代人民日报语料库的数据源、标注规范和流程,通过构建分词自动标注模型测评语料库的性能,并与已有语料库进行对比。[结果/结论] 新时代人民日报语料库遵循现代汉语语料库基本加工规范,规模大,时间跨度长。选取其中的2018年1月部分,基于条件随机场构建分词模型,与1998年1月人民日报语料进行性能测评与对比,所得到的各项具体测评指标表明,新时代人民日报语料整体性能突出,1998年语料无法替代,当前构建该语料库非常必要。

关键词: 新时代, 人民日报, 自动分词, 条件随机场模型, 语料库, NEPD

Abstract: [Purpose/significance] The construction of the segmented corpus of People's Daily in line with the new era provides new annotated corpus for Chinese information processing, and also offers new language resources for analyzing modern Chinese from a diachronic perspective.[Method/process] The data source, annotation specification and process of the constructed corpus were explained on the basis of analyzing the existing Chinese word segmentation corpus, on the other hand, the corpus performance was evaluated by constructing the automatic word segmentation model by comparing with the existing corpus.[Result/conclusion] The New Era People's Daily Segmented Corpus(NEPD) with a large scale and a long time span follows the basic processing standards of modern Chinese corpus. The part of January 2018 is selected from NEPD to build a segmentation model based on conditional random field model. The performance of the corpus of People's Daily in January 2018 is evaluated and compared with that of the corpus of People's Daily in January 1998. The specific evaluation indexes obtained from the corpus show that the overall performance of the corpus of People's Daily in the new era is relatively outstanding. The corpus of 1998 could not be replaced, but it is very necessary to construct the NEPD.

Key words: new era, People's Daily, automatic word segmentation, conditional random field model, segmented corpus, NEPD

中图分类号: