[目的/意义] 中国拥有海量的古代典籍,利用计算机对古籍文本进行自动断句与标点有助于加快古籍资源的转化利用。现有研究主要存在两个亟待解决的问题。首先,将古籍断句与标点分为两个串行任务,会引起错误传递。其次,自动标注的标点也较为混乱,对长距离可嵌套的成对引号标注研究较少。[方法/过程] 通过对大规模古籍语料库的标点符号频率统计,结合现有标点符号用法标准,明确古文自动标点的符号体系。根据点号含有断句信息,提出断句标点一体化处理方案,直接在没有断句的古籍文本上进行自动标点。并通过设计多元引号标记集和段首填充占位符,解决长距离可嵌套成对引号的自动标注难题。算法上根据序列标注方法,采用SikuRoBRETa-BiLSTM-CRF在1亿多字的繁体古籍文本语料上完成模型训练。[结果/结论] 在开放测试集《左传》上,点号标注的F1值为77.09%,断句达到91.72%;对单个引号的标注F1值达到89.28%,成对引号为83.88%。结果表明本文的方法有效地提升了古籍文本的自动断句与自动标点效果,有效地解决了引号的自动标注问题。
[Purpose/Significance] There are a large number of ancient classical books in China. Automatic sentence segmentation and punctuation of ancient book texts using computers is helpful to speed up the transformation and utilization of ancient books. There are two urgent problems in the existing research which need to be solved. First, the previous research divides automatic sentence segmentation and punctuation of ancient books into two serial tasks, which causes error accumulation. Second, the punctuations automatically tagged are relatively chaotic, and there is less research on tagging long-distance nested pairwise quotation marks. [Method/Process] Based on the statistics of punctuation frequency in a large-scale ancient books corpus and the punctuation usage standards, the paper clarified the punctuation system used in automatic punctuation of ancient books. As the sentence segmentation can be inferred by the stop punctuations, an integrated solution of sentence segmentation and punctuation was proposed, and automatic punctuation was directly carried out on ancient texts without sentence segmentation. By designing a multiple-tag set and filling placeholders at the beginning of paragraphs, the problem of automatic tagging of long-distance nested pairwise quotation marks was solved. Within the framework of sequence labeling, the algorithm used SikuRoBERTa-BiLSTM-CRF to train model on the corpus of traditional Chinese ancient books which contains more than 100 million characters. [Result/Conclusion] In the open test Zuo Zhuan, the F1 score of stop punctuations tagging is 77.09%, sentence segmentation is 91.72%. The F1 score of a single quotation marks tagging is 89.28%, and the pairwise quotation marks tagging is 83.88%. The results show that the method in the paper effectively improves the effect of automatic sentence segmentation and punctuation of ancient books, and effectively solves the problem of automatically tagging quotation marks.
[1] 陈力.数字人文视域下的古籍数字化与古典知识库建设问题[J].中国图书馆学报,2022,48(2):36-46.
[2] 中共中央办公厅,国务院办公厅.关于推进新时代古籍工作的意见[N].新华社,2022-04-12(01).
[3] 吕叔湘.整理古籍的第一关[J].中国出版,1983,71(4):44-50.
[4] CY/T 124-2015,学术出版规范古籍整理[S].北京:中国书籍出版社,2015.
[5] 黄建年,侯汉清.农业古籍断句标点模式研究[J].中文信息学报,2008,22(4):31-38.
[6] 陈天莹,陈蓉,潘璐璐,等.基于前后文n-gram模型的古汉语句子切分[J].计算机工程,2007,33(3):192-193,196.
[7] 张开旭,夏云庆,宇航.基于条件随机场的古汉语自动断句与标点方法[J].清华大学学报(自然科学版), 2009,49(10):1733-1736.
[8] 张合,王晓东,杨建宇,等.一种基于层叠CRF的古文断句与句读标记方法[J].计算机应用研究,2009,26(9):3326-3329.
[9] HUANG H H, SUN C T, CHEN H H. Classical Chinese sentence segmentation[C]//SUN L, CHEN K J. CIPS-SIGHAN joint conference on Chinese language processing. Beijing:Chinese Information Processing Society of China, 2010:15-22.
[10] 王博立,史晓东,苏劲松.一种基于循环神经网络的古文断句方法[J].北京大学学报(自然科学版),2017,53(2):255-261.
[11] HAN X, WANG H, ZHANG S, et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding[J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2):1-8.
[12] WANG H, WEI H, GUO J, et al. Ancient Chinese sentence segmentation based on bidirectional LSTM+CRF model[J]. Journal of advanced computational intelligence and intelligent informatics, 2019, 23(4):719-725.
[13] 俞敬松,魏一,张永伟.基于BERT的古文断句研究与应用[J].中文信息学报,2019,33(11):57-63.
[14] 程宁.基于深度学习的古籍文本断句与词法分析一体化处理技术研究[D].南京:南京师范大学,2020.
[15] 胡韧奋,李绅,诸雨辰.基于深层语言模型的古汉语知识表示及自动断句研究[J].中文信息学报,2021,35(4):8-15.
[16] 王倩,王东波,李斌,等.面向海量典籍文本的深度学习自动断句与标点平台构建研究[J].数据分析与知识发现,2021,5(3):25-34.
[17] 释贤超,方恺齐,释贤迥,等.一种自动标点的方法与实现[J].数位典藏与数位人文,2019,3(1):1-19.
[18] 洪涛,程瑞雪,刘思汐,等.一种基于Transformer模型的古籍自动标点技术[J].数字人文,2021,6(2):111-122.
[19] GB/T 15834-2011.标点符号用法[S].北京:中国标准出版社,2012.
[20] ZYYXH/T363-2012.标点规范[S].北京:中国中医药出版社,2012.
[21] LIU Y, OTT M, GOYAL N, et al. Roberta:a robustly optimized bert pretraining approach[EB/OL].[2022-11-4]. https://arxiv.org/pdf/1907.11692.pdf.
[22] 王东波,刘畅,朱子赫,等.SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J].图书馆论坛,2022,42(6):31-43.