图书情报工作 ›› 2019, Vol. 63 ›› Issue (24): 5-15.DOI: 10.13266/j.issn.0252-3116.2019.24.001

• 专稿 • 上一篇    下一篇

新时代人民日报分词语料库构建、性能及应用(三)——句长与词的分析比较

黄水清1,2, 王东波1,2   

  1. 1. 南京农业大学信息科学技术学院 南京 210095;
    2. 南京农业大学领域知识关联研究中心 南京 210095
  • 收稿日期:2019-11-20 出版日期:2019-12-20 发布日期:2019-12-20
  • 作者简介:黄水清(ORCID:0000-0002-1646-9300),教授,博士生导师,E-mail:sqhuang@njau.edu.cn;王东波(ORCID:0000-0002-9894-9550),教授,博士生导师。

Construction, Performance and Application of New Era People's Daily Segmented Corpus (Ⅲ)——Analysis and Comparison of Sentence Length and Word

Huang Shuiqing1,2, Wang Dongbo1,2   

  1. 1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
    2. Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095
  • Received:2019-11-20 Online:2019-12-20 Published:2019-12-20

摘要: [目的/意义] 基于新时代人民日报分词语料库从不同维度统计分析句子长度和词汇分布,有助于了解当代汉语文本的语言学特征,进而开展自然语言处理和文本挖掘研究。[方法/过程] 在2018年1月人民日报分词语料的基础上,结合1998年1月人民日报分词语料,确定统计中所使用的6种句子类别,统计和分析字与词单位上的句子长度分布,并基于齐普夫定律揭示词汇静态分布情况。[结果/结论] 从字词维度上的句子长度分布情况和词汇的齐普夫分布状态上看,随着时间的推移,在1998和2018两个语料上,句子的长度和词汇的分布均发生变化,但这种变化又是延续的、有关联的。

关键词: 新时代人民日报分词语料, 语料库, 句子长度, 词汇分布, 齐普夫定律

Abstract: [Purpose/significance] The statistics and analysis of sentence length in different dimensions and vocabulary distribution based on the New Era People's Daily(NEPD) word segmentation corpus is not only conducive to a relatively comprehensively and systematically understanding of the linguistic characteristics of the contemporary Chinese text, but also beneficial to the subsequent exploration of natural language processing and text mining of the text.[Method/process] Based on the word segmentation data of People's Daily in January 2018 and the word segmentation data of People's Daily in January 1998, 6 sentence categories used in the statistics were determined, and the sentence length distribution of character and word units was counted and analyzed, and the distribution of words in static state was revealed based on Zipf's law.[Result/conclusion] From the perspective of the sentence length distribution in the word dimension and the Zipf distribution of vocabulary, the sentence length and vocabulary distribution have both changed in the 1998 and 2018 corpora as time goes by, but this change is continuous and related.

Key words: New Era People's Daily segmented corpus, segmented corpus, sentence length, distribution of word, Zipf's law

中图分类号: