KNOWLEDGE ORGANIZATION

A Joint Model of Automatic Sentence Segmentation and Punctuation for Ancient Classical Texts Based on Deep Learning

  • Yuan Yiguo ,
  • Li Bin ,
  • Feng Minxuan ,
  • He Sheng ,
  • Wang Dongbo
Expand
  • 1 School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097;
    2 Institute of Digit and Humanities, Nanjing Normal University, Nanjing 210023;
    3 College of Information Management, Nanjing Agricultural University, Nanjing 210095

Received date: 2022-06-08

  Revised date: 2022-10-16

  Online published: 2022-12-02

Abstract

[Purpose/Significance] There are a large number of ancient classical books in China. Automatic sentence segmentation and punctuation of ancient book texts using computers is helpful to speed up the transformation and utilization of ancient books. There are two urgent problems in the existing research which need to be solved. First, the previous research divides automatic sentence segmentation and punctuation of ancient books into two serial tasks, which causes error accumulation. Second, the punctuations automatically tagged are relatively chaotic, and there is less research on tagging long-distance nested pairwise quotation marks. [Method/Process] Based on the statistics of punctuation frequency in a large-scale ancient books corpus and the punctuation usage standards, the paper clarified the punctuation system used in automatic punctuation of ancient books. As the sentence segmentation can be inferred by the stop punctuations, an integrated solution of sentence segmentation and punctuation was proposed, and automatic punctuation was directly carried out on ancient texts without sentence segmentation. By designing a multiple-tag set and filling placeholders at the beginning of paragraphs, the problem of automatic tagging of long-distance nested pairwise quotation marks was solved. Within the framework of sequence labeling, the algorithm used SikuRoBERTa-BiLSTM-CRF to train model on the corpus of traditional Chinese ancient books which contains more than 100 million characters. [Result/Conclusion] In the open test Zuo Zhuan, the F1 score of stop punctuations tagging is 77.09%, sentence segmentation is 91.72%. The F1 score of a single quotation marks tagging is 89.28%, and the pairwise quotation marks tagging is 83.88%. The results show that the method in the paper effectively improves the effect of automatic sentence segmentation and punctuation of ancient books, and effectively solves the problem of automatically tagging quotation marks.

Cite this article

Yuan Yiguo , Li Bin , Feng Minxuan , He Sheng , Wang Dongbo . A Joint Model of Automatic Sentence Segmentation and Punctuation for Ancient Classical Texts Based on Deep Learning[J]. Library and Information Service, 2022 , 66(22) : 134 -141 . DOI: 10.13266/j.issn.0252-3116.2022.22.012

References

[1] 陈力.数字人文视域下的古籍数字化与古典知识库建设问题[J].中国图书馆学报,2022,48(2):36-46.
[2] 中共中央办公厅,国务院办公厅.关于推进新时代古籍工作的意见[N].新华社,2022-04-12(01).
[3] 吕叔湘.整理古籍的第一关[J].中国出版,1983,71(4):44-50.
[4] CY/T 124-2015,学术出版规范古籍整理[S].北京:中国书籍出版社,2015.
[5] 黄建年,侯汉清.农业古籍断句标点模式研究[J].中文信息学报,2008,22(4):31-38.
[6] 陈天莹,陈蓉,潘璐璐,等.基于前后文n-gram模型的古汉语句子切分[J].计算机工程,2007,33(3):192-193,196.
[7] 张开旭,夏云庆,宇航.基于条件随机场的古汉语自动断句与标点方法[J].清华大学学报(自然科学版), 2009,49(10):1733-1736.
[8] 张合,王晓东,杨建宇,等.一种基于层叠CRF的古文断句与句读标记方法[J].计算机应用研究,2009,26(9):3326-3329.
[9] HUANG H H, SUN C T, CHEN H H. Classical Chinese sentence segmentation[C]//SUN L, CHEN K J. CIPS-SIGHAN joint conference on Chinese language processing. Beijing:Chinese Information Processing Society of China, 2010:15-22.
[10] 王博立,史晓东,苏劲松.一种基于循环神经网络的古文断句方法[J].北京大学学报(自然科学版),2017,53(2):255-261.
[11] HAN X, WANG H, ZHANG S, et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding[J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2):1-8.
[12] WANG H, WEI H, GUO J, et al. Ancient Chinese sentence segmentation based on bidirectional LSTM+CRF model[J]. Journal of advanced computational intelligence and intelligent informatics, 2019, 23(4):719-725.
[13] 俞敬松,魏一,张永伟.基于BERT的古文断句研究与应用[J].中文信息学报,2019,33(11):57-63.
[14] 程宁.基于深度学习的古籍文本断句与词法分析一体化处理技术研究[D].南京:南京师范大学,2020.
[15] 胡韧奋,李绅,诸雨辰.基于深层语言模型的古汉语知识表示及自动断句研究[J].中文信息学报,2021,35(4):8-15.
[16] 王倩,王东波,李斌,等.面向海量典籍文本的深度学习自动断句与标点平台构建研究[J].数据分析与知识发现,2021,5(3):25-34.
[17] 释贤超,方恺齐,释贤迥,等.一种自动标点的方法与实现[J].数位典藏与数位人文,2019,3(1):1-19.
[18] 洪涛,程瑞雪,刘思汐,等.一种基于Transformer模型的古籍自动标点技术[J].数字人文,2021,6(2):111-122.
[19] GB/T 15834-2011.标点符号用法[S].北京:中国标准出版社,2012.
[20] ZYYXH/T363-2012.标点规范[S].北京:中国中医药出版社,2012.
[21] LIU Y, OTT M, GOYAL N, et al. Roberta:a robustly optimized bert pretraining approach[EB/OL].[2022-11-4]. https://arxiv.org/pdf/1907.11692.pdf.
[22] 王东波,刘畅,朱子赫,等.SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J].图书馆论坛,2022,42(6):31-43.
Outlines

/