图书情报工作 ›› 2023, Vol. 67 ›› Issue (3): 119-130.DOI: 10.13266/j.issn.0252-3116.2023.03.011

• 知识组织 • 上一篇    下一篇

基于文本生成技术的历史古籍事件识别模型构建研究

王彦莹1,2, 王昊1,2, 朱惠1,2, 李晓敏1,2   

  1. 1 南京大学信息管理学院 南京 210023;
    2 江苏省数据工程与知识服务重点实验室(南京大学) 南京 210093
  • 收稿日期:2022-09-02 修回日期:2022-12-06 出版日期:2023-02-24 发布日期:2023-02-24
  • 通讯作者: 王昊,教授,博士,博士生导师,通信作者, E-mail:ywhaowang@nju.edu.cn
  • 作者简介:王彦莹,硕士研究生;惠,副教授,博士,硕士生导师;李晓敏,博士研究生。
  • 基金资助:
    本文系国家自然科学基金面上项目“关联数据驱动下我国非遗文本的语义解析与人文计算研究”(项目编号: 72074108)和中央高校基本科研项目“面向人文计算的方志文本的语义分析和知识图谱研究”(项目编号: 010814370113)研究成果之一。

Research on the Construction of an Event Recognition Model for Historical Antique Books Based on Text Generation Technology

Wang Yanying1,2, Wang Hao1,2, Zhu Hui1,2, Li Xiaomin1,2   

  1. 1 School of Information Management, Nanjing University, Nanjing 210023;
    2 Jiangsu Province Key Laboratory of Data Engineering and Knowledge Services (Nanjing University), Nanjing 210093
  • Received:2022-09-02 Revised:2022-12-06 Online:2023-02-24 Published:2023-02-24

摘要: [目的/意义] 针对历史古籍事件识别问题,对比序列标注方法和文本生成方法,探究两种方法在古汉语上的表现,构建模型实现历史古籍事件识别自动化,以提高面向历史古籍构建知识图谱的效率。[方法/过程] 选取《三国志》为原始语料,序列标注实验对《三国志》事件数据集进行 BMES 标注,构建 BBCN-SG 模型,文本生成实验构建 T5-SG 模型,对比两种方法的表现。接下来,构建 RoBERTa-SG、 NEZHA-SG 模型展开生成模型的对比实验。最后,结合三个文本生成模型,融入 Stacking 集成学习的思想,构建 Stacking-TRN-SG 模型。[结果/结论] 在历史古籍事件识别建模问题上,文本生成方法的表现明显优于序列标注方法。而在文本生成方法中, RoBERTaSG 模型的识别效果综合最好。Stacking 集成学习能够大大提高生成模型的识别效果,构建的 Stacking-TRN-SG模型达到 70.35% 的召回率,初步实现历史古籍的自动事件识别。

关键词: 历史古籍, 事件识别, 文本生成, 序列标注, 集成学习

Abstract: [Purpose/Significance] To address the problem of event recognition for historical antique books, compare the sequence labeling method and the text generation method, explore the performance of the two methods on ancient Chinese and construct a model to automate historical antique events recognition, improve the efficiency of building knowledge maps for historical antique books. [Method/Process] This paper selected Records of the Three Kingdoms as the original corpus, the sequence labeling experiment was conducted to label Records of the Three Kingdoms event dataset with BMES and construct BBCN-SG model, while the text generation experiment was conducted to construct T5-SG model to compare the performance of the two methods. The RoBERTa-SG and NEZHA-SG models were also constructed to launch the comparison experiments of text generation model. Finally, combining the three text generation models, the Stacking-TRN-SG model was constructed by incorporating the idea of Stacking ensemble learning. [Result/Conclusion] In the modeling problem of event recognition for historical antique books, the text generation method performs significantly better than the sequence labeling method. Among three text generation methods, the RoBERTa-SG model has the best comprehensive recognition effect. Stacking ensemble learning greatly improves the recognition effect of the generation model, and the Stacking-TRN-SG model constructed achieves a recall rate of 70.35%, which initially realizes the automatic event recognition of historical antiquities.

Key words: historical antique books, event recognition, text generation, sequence labeling, ensemble learning

中图分类号: