图书情报工作 ›› 2022, Vol. 66 ›› Issue (22): 134-141.DOI: 10.13266/j.issn.0252-3116.2022.22.012

• 知识组织 • 上一篇    下一篇

基于深度学习的古籍文本自动断句与标点一体化研究

袁义国1, 李斌1,2, 冯敏萱1, 贺胜1, 王东波3   

  1. 1 南京师范大学文学院 南京 210097;
    2 南京师范大学数字与人文研究中心 南京 210023;
    3 南京农业大学信息管理学院 南京 210095
  • 收稿日期:2022-06-08 修回日期:2022-10-16 出版日期:2022-11-20 发布日期:2022-12-02
  • 通讯作者: 李斌,副教授,博士,通信作者,E-mail:libin.njnu@gmail.com
  • 作者简介:袁义国,硕士研究生;冯敏萱,副教授,博士;贺胜,副教授,博士;王东波,教授,博士。
  • 基金资助:
    本文系江苏省社会科学基金项目"人工智能辅助青少年传统文化教育研究"(项目编号:20JYB004)和国家社会科学基金重大项目"中国古代典籍跨语言知识库构建及应用研究"(项目编号:21ZD&331)研究成果之一。

A Joint Model of Automatic Sentence Segmentation and Punctuation for Ancient Classical Texts Based on Deep Learning

Yuan Yiguo1, Li Bin1,2, Feng Minxuan1, He Sheng1, Wang Dongbo3   

  1. 1 School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097;
    2 Institute of Digit and Humanities, Nanjing Normal University, Nanjing 210023;
    3 College of Information Management, Nanjing Agricultural University, Nanjing 210095
  • Received:2022-06-08 Revised:2022-10-16 Online:2022-11-20 Published:2022-12-02

摘要: [目的/意义] 中国拥有海量的古代典籍,利用计算机对古籍文本进行自动断句与标点有助于加快古籍资源的转化利用。现有研究主要存在两个亟待解决的问题。首先,将古籍断句与标点分为两个串行任务,会引起错误传递。其次,自动标注的标点也较为混乱,对长距离可嵌套的成对引号标注研究较少。[方法/过程] 通过对大规模古籍语料库的标点符号频率统计,结合现有标点符号用法标准,明确古文自动标点的符号体系。根据点号含有断句信息,提出断句标点一体化处理方案,直接在没有断句的古籍文本上进行自动标点。并通过设计多元引号标记集和段首填充占位符,解决长距离可嵌套成对引号的自动标注难题。算法上根据序列标注方法,采用SikuRoBRETa-BiLSTM-CRF在1亿多字的繁体古籍文本语料上完成模型训练。[结果/结论] 在开放测试集《左传》上,点号标注的F1值为77.09%,断句达到91.72%;对单个引号的标注F1值达到89.28%,成对引号为83.88%。结果表明本文的方法有效地提升了古籍文本的自动断句与自动标点效果,有效地解决了引号的自动标注问题。

关键词: 自动断句, 自动标点, 古籍, 深度学习, 数字人文

Abstract: [Purpose/Significance] There are a large number of ancient classical books in China. Automatic sentence segmentation and punctuation of ancient book texts using computers is helpful to speed up the transformation and utilization of ancient books. There are two urgent problems in the existing research which need to be solved. First, the previous research divides automatic sentence segmentation and punctuation of ancient books into two serial tasks, which causes error accumulation. Second, the punctuations automatically tagged are relatively chaotic, and there is less research on tagging long-distance nested pairwise quotation marks. [Method/Process] Based on the statistics of punctuation frequency in a large-scale ancient books corpus and the punctuation usage standards, the paper clarified the punctuation system used in automatic punctuation of ancient books. As the sentence segmentation can be inferred by the stop punctuations, an integrated solution of sentence segmentation and punctuation was proposed, and automatic punctuation was directly carried out on ancient texts without sentence segmentation. By designing a multiple-tag set and filling placeholders at the beginning of paragraphs, the problem of automatic tagging of long-distance nested pairwise quotation marks was solved. Within the framework of sequence labeling, the algorithm used SikuRoBERTa-BiLSTM-CRF to train model on the corpus of traditional Chinese ancient books which contains more than 100 million characters. [Result/Conclusion] In the open test Zuo Zhuan, the F1 score of stop punctuations tagging is 77.09%, sentence segmentation is 91.72%. The F1 score of a single quotation marks tagging is 89.28%, and the pairwise quotation marks tagging is 83.88%. The results show that the method in the paper effectively improves the effect of automatic sentence segmentation and punctuation of ancient books, and effectively solves the problem of automatically tagging quotation marks.

Key words: automatic sentence segmentation, automatic punctuation, ancient books, deep learning, digital humanities

中图分类号: