图书情报工作 ›› 2021, Vol. 65 ›› Issue (9): 97-104.DOI: 10.13266/j.issn.0252-3116.2021.09.011

• 知识组织 • 上一篇    下一篇

古籍同事异文的自动发掘研究

梁媛1,2, 王东波1,2, 黄水清1,2   

  1. 1 南京农业大学信息管理学院 南京 210095;
    2 南京农业大学人文与社会计算研究中心 南京 210095
  • 收稿日期:2020-11-06 修回日期:2021-02-03 出版日期:2021-05-05 发布日期:2021-06-02
  • 通讯作者: 黄水清(ORCID:0000-0002-1646-9300),教授,博士,博士生导师,通讯作者,E-mail:sqhuang@njau.edu.cn
  • 作者简介:梁媛(ORCID:0000-0002-4922-6938),博士研究生;王东波(ORCID:0000-0002-9894-9550),教授,博士,博士生导师。
  • 基金资助:
    本文系国家社会科学基金重大项目"基于《汉学引得丛刊》的典籍知识库构建及人文计算研究"(项目编号:15ZDB127)和国家自然科学基金面上项目"基于典籍引得的句法级汉英平行语料库构建及人文计算研究"(项目编号:71673143)研究成果之一。

Research on Automatic Mining of Variants Expressing the Same Event in the Ancient Books

Liang Yuan1,2, Wang Dongbo1,2, Huang Shuiqing1,2   

  1. 1 College of Information Management, Nanjing Agricultural University, Nanjing 210095;
    2 Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095
  • Received:2020-11-06 Revised:2021-02-03 Online:2021-05-05 Published:2021-06-02

摘要: [目的/意义] 异文是古籍中的常见现象,也是重要研究对象。传统的古籍校勘是从大量古籍文献中人工查找校勘资料包括异文等,不仅耗时、费力、工作量大,而且找到的数据未必精准全面。通过计算机实现异文的自动发掘,可以从更大规模的语料中获取有效信息。并且,结合异文自动发掘的校勘方式可以实现穷尽式检索,对于古籍他校法具有重要意义,为新时期古籍校勘研究提供了新思路和新方法。[方法/过程] 本研究以《春秋》及"春秋三传"作为实验语料,引入常用于文本翻译领域的平行语料库思想,结合深度学习算法,对LSTM、BERT模型与较为经典的SVM模型进行比较实验,并对两部古籍中用不同表述描述同一事件的同事异文相关内容展开进一步探索和讨论。[结果/结论] 实验得到适用于"春秋三传"的同事异文自动发掘深度学习模型,证明深度学习等新兴技术融合到古籍知识库构建等研究中的可行性,同时,深度学习技术和平行语料库思想的结合在异文研究中能够发挥较大作用,对数字人文在汉语言文学研究中的应用提供实践支撑。

关键词: 春秋三传, 异文, BERT, 自动发掘, 数字人文

Abstract: [Purpose/significance] Variations are a common phenomenon and also an important research object in ancient books. The traditional collation of ancient books is to manually search for materials, including variations from a large number of ancient books. This work is not only time-consuming, laborious, and heavy, but the data may not be accurate and comprehensive. Automatic mining of variant sentences through computers can obtain effective information from larger-scale corpus. In addition, the collation method combined with automatic mining of variant sentences can realize exhaustive retrieval, which is of great significance to the collation of ancient books. It provides new ideas and methods for the collation research of ancient books in the new period.[Method/process] This research automatically mined the variant sentences in Three Biographies of the Spring and Autumn Period, combining deep learning and introducing parallel corpus commonly used in the field of machine translation. Subsequently, this study compared LSTM and BERT models'results with the classic SVM model and further explored and analyzed the related content of the variants expressing the same event with different descriptions in two ancient books.[Result/conclusion] The experiment obtained a deep learning model for automatic mining of variants expressing the same event suitable for Three Biographies of the Spring and Autumn Period. It proves the feasibility of integrating new technologies such as deep learning into the construction of ancient books' knowledge base. Meanwhile, the combination of deep learning and parallel corpus can play a more significant role in studying variant sentences and provide practical support for applying digital humanities in the Chinese language and literature.

Key words: Three Biographies of the Spring and Autumn Period, variants, BERT, automatic mining, digital humanities

中图分类号: