Construction, Performance and Application of New Era People's Daily Segmented Corpus (Ⅲ)——Analysis and Comparison of Sentence Length and Word

  • Huang Shuiqing ,
  • Wang Dongbo
Expand
  • 1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
    2. Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095

Received date: 2019-11-20

  Online published: 2019-12-20

Abstract

[Purpose/significance] The statistics and analysis of sentence length in different dimensions and vocabulary distribution based on the New Era People's Daily(NEPD) word segmentation corpus is not only conducive to a relatively comprehensively and systematically understanding of the linguistic characteristics of the contemporary Chinese text, but also beneficial to the subsequent exploration of natural language processing and text mining of the text.[Method/process] Based on the word segmentation data of People's Daily in January 2018 and the word segmentation data of People's Daily in January 1998, 6 sentence categories used in the statistics were determined, and the sentence length distribution of character and word units was counted and analyzed, and the distribution of words in static state was revealed based on Zipf's law.[Result/conclusion] From the perspective of the sentence length distribution in the word dimension and the Zipf distribution of vocabulary, the sentence length and vocabulary distribution have both changed in the 1998 and 2018 corpora as time goes by, but this change is continuous and related.

Cite this article

Huang Shuiqing , Wang Dongbo . Construction, Performance and Application of New Era People's Daily Segmented Corpus (Ⅲ)——Analysis and Comparison of Sentence Length and Word[J]. Library and Information Service, 2019 , 63(24) : 5 -15 . DOI: 10.13266/j.issn.0252-3116.2019.24.001

References

[1] 黄水清,王东波.新时代人民日报分词语料库构建、性能及应用(一)——语料库构建及测评[J]. 图书情报工作, 2019,63(22): 5-12.
[2] 黄水清,王东波.新时代人民日报分词语料库构建、性能及应用(二)——句长与词的分析比较[J]. 图书情报工作, 2019,63(23):5-12.
[3] CLAYMAN D. Sentence length in Greek hexameter poetry[J]. Hexameter studies. quantitative linguistics, 1981(11): 107-136.
[4] 黄自然. 以"字"为单位的汉语平均句长与句长分布研究[J]. 齐齐哈尔大学学报(哲学社会科学版), 2018(1): 133-138.
[5] 李青苗. 从《左传》的偏正结构和句子长度看现代汉语细节意义的增强[J]. 东北师大学报(哲学社会科学版), 2018(4): 99-103.
[6] 王萍, 石锋. 汉语普通话不同语句类型的时长分布模式[J]. 语言教学与研究, 2019(2): 101-112.
[7] 左思民. 汉语句长的制约因素[J]. 汉语学习, 1992(3): 16-21.
[8] 张绍麒, 李明. 小说与政论文言语风格异同的计算机统计(实验报告)[J]. 天津师范大学学报: 社会科学版, 1986(4): 82-86.
[9] 黄自然, 贾成南. 平均句长在语言习得研究中的应用与问题[J]. 长江大学学报: 社会科学版, 2013(1): 95-97.
[10] 金志娟, 金星明. 学龄前儿童普通话平均句子长度和词汇广度研究[J]. 中国循证儿科杂志, 2008, 3(4): 261-266.
[11] MONTGOMERY M, MONTGOMERY A, STEPHENS M. Sentence repetition in preschoolers: effects of length, complexity, and word familiarity[J]. Journal of psycholinguistic research, 1978, 7(6): 435-452.
[12] 李建平, 张晓菡. 中美中学生英语写作句子长度对比分析——一项基于高考英语作文的研究[J]. 教育测量与评价: 理论版, 2015(7): 50-53.
[13] ZIPF G K. Human behaviour and the principle of least-effort [M]. Cambridge: Addison-Wesley, 1949.
[14] 冯志伟. 齐普夫定律的来龙去脉[J]. 情报科学, 1983(2): 37-42.
[15] 孙清兰, 王肇建. 齐夫定律的词等级确定方法探讨[J]. 东北师大学报: 自然科学版, 1993(3): 32-37.
[16] 孙清兰. 高频, 低频词的界分及词频估计方法[J]. 情报科学, 1992, 13(2): 28-32.
[17] LI W. Zipf’s law everywhere[J]. Glottometrics, 2002(5): 14-21.
[18] TUZZI A, POPESCUI I, ALATMANN G. Zipf’s laws in Italian texts[J]. Journal of quantitative linguistics, 2009, 16(4): 354-367.
[19] 沈关龙. 齐普夫定律与专题文献标题词频的研究及应用[J]. 情报理论与实践, 1988(2):58-64,130.
[20] 王崇德, 来玲. 汉语文集的齐夫分布[J]. 情报科学, 1989, 10(2): 1-8.
[21] 何凤远. 中文词频分布与齐夫定律的汉语适用性初探[J]. 现代语文(语言研究), 2010(10):110-111.
[22] GABAIX X. Zipf’s law for cities: an explanation[J]. The quarterly journal of economics, 1999, 114(3): 739-767.
[23] AXTELL R L. Zipf distribution of US firm sizes[J].Science, 2001, 293(5536): 1818-1820.
[24] ADAMIC L A, Huberman B A. Zipf’s law and the Internet[J]. Glottometrics, 2002, 3(1): 143-150.
Outlines

/