[目的/意义] 构建与新时代相适应的人民日报分词语料库,为中文信息处理提供最新的精标注语料,也为从历时的角度分析现代汉语提供新的语言资源。[方法/过程] 在分析已有汉语分词语料库的基础上,描述所构建新时代人民日报语料库的数据源、标注规范和流程,通过构建分词自动标注模型测评语料库的性能,并与已有语料库进行对比。[结果/结论] 新时代人民日报语料库遵循现代汉语语料库基本加工规范,规模大,时间跨度长。选取其中的2018年1月部分,基于条件随机场构建分词模型,与1998年1月人民日报语料进行性能测评与对比,所得到的各项具体测评指标表明,新时代人民日报语料整体性能突出,1998年语料无法替代,当前构建该语料库非常必要。
[Purpose/significance] The construction of the segmented corpus of People's Daily in line with the new era provides new annotated corpus for Chinese information processing, and also offers new language resources for analyzing modern Chinese from a diachronic perspective.[Method/process] The data source, annotation specification and process of the constructed corpus were explained on the basis of analyzing the existing Chinese word segmentation corpus, on the other hand, the corpus performance was evaluated by constructing the automatic word segmentation model by comparing with the existing corpus.[Result/conclusion] The New Era People's Daily Segmented Corpus(NEPD) with a large scale and a long time span follows the basic processing standards of modern Chinese corpus. The part of January 2018 is selected from NEPD to build a segmentation model based on conditional random field model. The performance of the corpus of People's Daily in January 2018 is evaluated and compared with that of the corpus of People's Daily in January 1998. The specific evaluation indexes obtained from the corpus show that the overall performance of the corpus of People's Daily in the new era is relatively outstanding. The corpus of 1998 could not be replaced, but it is very necessary to construct the NEPD.
[1] 俞士汶, 段慧明, 朱学锋. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002(5):49-64.
[2] 王洪俊, 施水才, 俞士汶. 人民日报标注语料的索引方法研究[C]//全国计算语言学联合学术会议. 全国第八届计算语言学联合学术会议(JSCL-2005)论文集. 南京:南京师范大学,2005:576-578.
[3] 国家语言文字工作委员会. 国家语委现代汉语语料库[EB/OL].[2019-06-02].http://www.cncorpus.org/.
[4] 周强. 汉语句法树库标注体系[J]. 中文信息学报, 2004, 18(4):2-9.
[5] ANTONY P J, WARRIER N J, SOMAN K P. Penn treebank-based syntactic parsers for South Dravidian languages using a machine learning approach[J]. International journal of computer applications, 2010, 7(8):14-21.
[6] HUANG C, ZHAO H. Chinese word segmentation:a decade review[J]. Journal of Chinese information processing, 2007, 21(3):8-19.
[7] 李双龙,刘群,王成耀. 基于条件随机场的汉语分词系统[J].微计算机信息,2006(10):178-180.
[8] 沈勤中,周国栋,朱巧明, 等. 基于字位置概率特征的条件随机场中文分词方法[J]. 苏州大学学报:自然科学版, 2008, 24(3):49-54.
[9] 迟呈英,于长远,战学刚. 基于条件随机场的中文分词方法[J]. 情报杂志, 2008, 27(5):79-81.
[10] 宋彦,蔡东风,张桂平, 等.一种基于字词联合解码的中文分词方法[J].软件学报,2009(9):2366-2375.
[11] 刘泽文,丁冬,李春文. 基于条件随机场的中文短文本分词方法[J]. 清华大学学报(自然科学版),2015,55(8):906-910,915.
[12] 冯雪. 中文分词模型词典融入方法比较[J]. 计算机应用研究, 2019, 36(1):14-16.
[13] 王若佳,赵常煜,王继民.中文电子病历的分词及实体识别研究[J].图书情报工作,2019,63(2):34-42.
[14] LAFFRTTY J,MCCALLUM A,PEREIRA F. Conditional randomfields:probabilistic models for segmenting and labeling sequencedata[C]//Proceeding of international conference on machinelearning.Williamstown:International Machine Learning Society,2001:282-289.