基于文本生成技术的历史古籍事件识别模型构建研究

王彦莹; 王昊; 朱惠; 李晓敏

doi:10.13266/j.issn.0252-3116.2023.03.011

图书情报工作 >

2023 , Vol. 67 >Issue 3: 119 - 130

DOI: https://doi.org/10.13266/j.issn.0252-3116.2023.03.011

知识组织

基于文本生成技术的历史古籍事件识别模型构建研究

王彦莹 ,
王昊 ,
朱惠 ,
李晓敏

展开

1 南京大学信息管理学院南京 210023;
2 江苏省数据工程与知识服务重点实验室(南京大学) 南京 210093

王彦莹,硕士研究生;惠,副教授,博士,硕士生导师;李晓敏,博士研究生。

收稿日期: 2022-09-02

修回日期: 2022-12-06

网络出版日期: 2023-02-24

基金资助

本文系国家自然科学基金面上项目“关联数据驱动下我国非遗文本的语义解析与人文计算研究”（项目编号： 72074108）和中央高校基本科研项目“面向人文计算的方志文本的语义分析和知识图谱研究”（项目编号： 010814370113）研究成果之一。

收起

Research on the Construction of an Event Recognition Model for Historical Antique Books Based on Text Generation Technology

Wang Yanying ,
Wang Hao ,
Zhu Hui ,
Li Xiaomin

Expand

1 School of Information Management, Nanjing University, Nanjing 210023;
2 Jiangsu Province Key Laboratory of Data Engineering and Knowledge Services (Nanjing University), Nanjing 210093

Received date: 2022-09-02

Revised date: 2022-12-06

Online published: 2023-02-24

Fold

摘要

[目的/意义] 针对历史古籍事件识别问题，对比序列标注方法和文本生成方法，探究两种方法在古汉语上的表现，构建模型实现历史古籍事件识别自动化，以提高面向历史古籍构建知识图谱的效率。[方法/过程] 选取《三国志》为原始语料，序列标注实验对《三国志》事件数据集进行 BMES 标注，构建 BBCN-SG 模型，文本生成实验构建 T5-SG 模型，对比两种方法的表现。接下来，构建 RoBERTa-SG、 NEZHA-SG 模型展开生成模型的对比实验。最后，结合三个文本生成模型，融入 Stacking 集成学习的思想，构建 Stacking-TRN-SG 模型。[结果/结论] 在历史古籍事件识别建模问题上，文本生成方法的表现明显优于序列标注方法。而在文本生成方法中， RoBERTaSG 模型的识别效果综合最好。Stacking 集成学习能够大大提高生成模型的识别效果，构建的 Stacking-TRN-SG模型达到 70.35% 的召回率，初步实现历史古籍的自动事件识别。

关键词： 历史古籍; 事件识别; 文本生成; 序列标注; 集成学习

本文引用格式

王彦莹 , 王昊 , 朱惠 , 李晓敏 . 基于文本生成技术的历史古籍事件识别模型构建研究[J]. 图书情报工作, 2023 , 67(3) : 119 -130 . DOI: 10.13266/j.issn.0252-3116.2023.03.011

Abstract

[Purpose/Significance] To address the problem of event recognition for historical antique books, compare the sequence labeling method and the text generation method, explore the performance of the two methods on ancient Chinese and construct a model to automate historical antique events recognition, improve the efficiency of building knowledge maps for historical antique books. [Method/Process] This paper selected Records of the Three Kingdoms as the original corpus, the sequence labeling experiment was conducted to label Records of the Three Kingdoms event dataset with BMES and construct BBCN-SG model, while the text generation experiment was conducted to construct T5-SG model to compare the performance of the two methods. The RoBERTa-SG and NEZHA-SG models were also constructed to launch the comparison experiments of text generation model. Finally, combining the three text generation models, the Stacking-TRN-SG model was constructed by incorporating the idea of Stacking ensemble learning. [Result/Conclusion] In the modeling problem of event recognition for historical antique books, the text generation method performs significantly better than the sequence labeling method. Among three text generation methods, the RoBERTa-SG model has the best comprehensive recognition effect. Stacking ensemble learning greatly improves the recognition effect of the generation model, and the Stacking-TRN-SG model constructed achieves a recall rate of 70.35%, which initially realizes the automatic event recognition of historical antiquities.

Key words： historical antique books; event recognition; text generation; sequence labeling; ensemble learning

参考文献

[1] 鄂海红, 张文静, 肖思琪, 等.深度学习实体关系抽取研究综述[J].软件学报, 2019, 30(6):1793-1818.
[2] 赵妍妍, 秦兵, 车万翔, 等.中文事件抽取技术研究[J].中文信息学报, 2008(1):3-8.
[3] 姜吉发.自由文本的信息抽取模式获取的研究[D].北京:中国科学院研究生院, 2004.
[4] SURDEANU M, HARABAGIU S M, WILLIAMS J, et al.Using predicate-argument structures for information extraction[C]//Proceedings of the 41st annual meeting of the Association for Computational Linguistics.Sapporo:Tokyo Association for Computational Linguistics, 2003:8-15.
[5] SURDEANU M, HARABAGIU S M.Infrastructure for opendomain information extraction[C]//Proceedings of the second international conference on human language technology research.San Francisco:Morgan Kaufmann Publishers Inc, 2002:325-330.
[6] 李章超, 李忠凯, 何琳.《左传》战争事件抽取技术研究[J].图书情报工作, 2020, 64(7):20-29.
[7] 刘忠宝, 党建飞, 张志剑.《史记》历史事件自动抽取与事理图谱构建研究[J].图书情报工作, 2020, 64(11):116-124.
[8] 喻雪寒, 何琳, 徐健.基于RoBERTa-CRF的古文历史事件抽取方法研究[J].数据分析与知识发现, 2021, 5(7):26-35.
[9] ZHOU H, CHEN J, DONG G, et al.Detection and diagnosis of bearing faults using shift-invariant dictionary learning and hidden Markov model[J].Mechanical systems and signal processing, 2016(72):65-79.
[10] 卢达威, 宋柔.基于最大熵模型的汉语标点句缺失话题自动识别初探[J].计算机工程与科学, 2015, 37(12):2282-2293.
[11] 李丽双, 黄德根, 毛婷婷, 等.基于支持向量机的中国人名的自动识别[J].计算机工程, 2006, 32(19):188-190.
[12] 邬伦, 刘磊, 李浩然, 等.基于条件随机场的中文地名识别方法[J].武汉大学学报:信息科学版, 2017, 42(2):150-156.
[13] AHN D.The stages of event extraction[C]//Proceedings of the workshop on annotating and reasoning about time and events.Sydney:Association for Computational Linguistics, 2006:1-8.
[14] 胡博磊, 贺瑞芳, 孙宏, 等.基于条件随机域的中文事件类型识别[J].模式识别与人工智能, 2012, 25(3):445-449.
[15] 王子牛, 姜猛, 高建瓴, 等.基于BERT的中文命名实体识别方法[J].计算机科学, 2019, 46(S2):138-142.
[16] 任智慧, 徐浩煜, 封松林, 等.基于LSTM网络的序列标注中文分词法[J].计算机应用研究, 2017, 34(5):1321-1324, 1341.
[17] 陈伟, 吴友政, 陈文亮, 等.基于BiLSTM-CRF的关键词自动抽取[J].计算机科学, 2018, 45(S1):91-96, 113.
[18] 唐慧慧, 王昊, 张紫玄, 等.基于汉字标注的中文历史事件名抽取研究[J].数据分析与知识发现, 2018, 2(7):89-100.
[19] 胡瑞娟, 周会娟, 刘海砚, 等.基于深度学习的篇章级事件抽取研究综述[J].计算机工程与应用, 2022, 58(24):47-60.:
[20] 唐语奇.基于深度学习的事件抽取技术研究与应用[D].成都:电子科技大学, 2022.
[21] 朱艺娜, 曹阳, 钟靖越, 等.事件抽取技术研究综述[J].计算机科学, 2022, 49(12):264-273.
[22] 柏瑶.基于文本的事件抽取关键技术研究[D].成都:电子科技大学, 2022.
[23] CHEN Y, XU L, LIU K, et al.Event extraction via dynamic multi-pooling convolutional neural networks[C]//Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing.Beijing:Association for Computational Linguistics, 2015:167-176.
[24] NGUYEN T H, CHO K, GRISHMAN R.Joint event extraction via recurrent neural networks[C]//Proceedings of the 2016 conference of the North American chapter of the Association for Computational Linguistics-human language technologies.San Diego:Association for Computational Linguistics, 2016:300-309.
[25] WANG R, ZHOU D, HE Y.Open event extraction from online text using a generative adversarial network[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/1908.09246.
[26] 石磊, 阮选敏, 魏瑞斌, 等.基于序列到序列模型的生成式文本摘要研究综述[J].情报学报, 2019, 38(10):1102-1116.
[27] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the conference on empirical methods in natural language processing.Doha:Association for Computational Linguistics, 2014:1724-1734.
[28] SUTSKEVER I, VINYALS O, LE Q V.Sequence to sequence learning with neural networks[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/1409.3215.
[29] RUSH A M, CHOPRA S, WESTON J.A neural attention model for abstractive sentence summarization[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/1509.00685.
[30] 陈颖, 朱韬.《史记》与《三国志》考辩[J].成都大学学报(社会科学版), 2007(6):50-52.
[31] 崔磊, 雷家宏.《三国志》裴松之注的史学价值[J].襄樊学院学报, 2011, 32(3):10-12, 31.
[32] DEVLIN J, CHANG M W, LEE K, et al.BERT:pre-training of deep bidirectional transformers for language understanding[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/1810.04805.
[33] RAFFEL C, SHAZEER N, ROBERTS A, et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].Journal of machine learning research, 2020, 21(140):1-67.
[34] XUE L, CONSTANT N, ROBERTS A, et al.mT5:a massively multilingual pre-trained text-to-text transformer[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/2010.11934.
[35] SHAZEER N.GLU variants improve transformer[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/2002.05202.
[36] DAUPHIN Y N, FAN A, AULI M, et al.Language modeling with gated convolutional networks[C]//Proceedings of the 34th international conference on machine learning.Sydney:JMLR.org, 2017:933-941.
[37] CHUNG H W, T FÉVRY, TSAI H, et al.Rethinking embedding coupling in pre-trained language models[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/2010.12821.
[38] JOSHI M, CHEN D, LIU Y, et al.SpanBERT:improving pretraining by representing and predicting spans[J].Transactions of the Association for Computational Linguistics, 2020(8):64-77.
[39] LIU Y, OTT M, GOYAL N, et al.RoBERTa:a robustly optimized BERT pretraining approach[J/OL].[2022-09-01].https://doi.org/10.48550/arXiv.1907.11692.
[40] WEI J, REN X, LI X, et al.NEZHA:neural contextualized representation for chineselanguage understanding[J/OL].arXiv preprint[2022-09-01].https://arxiv.org/abs/1909.00204.
[41] 徐继伟, 杨云.集成学习方法:研究综述[J].云南大学学报(自然科学版), 2018, 40(6):1082-1092.
[42] 周星, 丁立新, 万润泽, 等.分类器集成算法研究[J].武汉大学学报(理学版), 2015, 61(6):503-508.
[43] BREIMAN L.Bagging predictors[J].Machine learning, 1996, 24(2):123-140.
[44] FREUND Y, SCHAPIRE R E.A decision-theoretic generalization of on-line learning and an application to boosting[J].Journal of computer and system sciences, 1997, 55(1):119-139.
[45] WOLPERT D H.Stacked generalization[J].Neural networks, 1992, 5(2):241-259.
[46] LIN C Y.Rouge:a package for automatic evaluation of summaries[C]//Text summarization branches out.Barcelona:Association for Computational Linguistics, 2004:74-81.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献