知识组织

数字人文下的典籍深度学习实体自动识别模型构建及应用研究

  • 杜悦 ,
  • 王东波 ,
  • 江川 ,
  • 徐润华 ,
  • 李斌 ,
  • 许超 ,
  • 徐晨飞
展开
  • 1. 南京农业大学信息科学技术学院 南京 210095;
    2. 金陵科技学院人文学院 南京 210001;
    3. 南京师范大学文学院 南京 210097;
    4. 南通大学经济与管理学院 南通 226019
杜悦(ORCID:0000-0001-7131-2325),本科生;江川(ORCID:0000-0003-2436-9411),硕士研究生;徐润华(ORCID:0000-0003-0889-1808),讲师,博士;李斌(ORCID:0000-0002-7328-9947),副教授,博士;许超(ORCID:0000-0003-1051-5633),工程师,博士;徐晨飞(ORCID:0000-0002-9894-9550),讲师,博士研究生。

收稿日期: 2018-11-25

  修回日期: 2020-10-30

  网络出版日期: 2021-02-05

基金资助

本文系国家自然科学基金面上项目"基于典籍引得的句法级汉英平行语料库构建及人文计算研究"(项目编号:71673143)和国家社会科学基金重大项目"基于《汉学引得丛刊》的典籍知识库构建及人文计算研究"(项目编号:15ZDB127)研究成果之一。

Construction and Application of Entity Recognition Model Based on Deep Learning of Classics in Digital Humanities

  • Du Yue ,
  • Wang Dongbo ,
  • Jiang Chuan ,
  • Xu Runhua ,
  • Li Bin ,
  • Xu Chao ,
  • Xu Chenfei
Expand
  • 1 College of Information and Technology, Nanjing Agricultural University, Nanjing 210095;
    2 College of Humanities, Jinling University of Science and Technology, Nanjing 210001;
    3 College of Literature, Nanjing Normal University, Nanjing 210097;
    4 Economics and Management School of Nantong University, Nantong 226019

Received date: 2018-11-25

  Revised date: 2020-10-30

  Online published: 2021-02-05

Supported by

 

摘要

[目的/意义] 典籍是我国传统文化、思想和智慧的载体,结合数字人文的数据获取、标注和分析方法对典籍进行实体自动识别,对于后续应用研究具有重要意义。[方法/过程] 基于经过自动分词与人工标注的25本先秦典籍构建古籍语料库,分别基于不同规模的语料库和Bi-LSTM、Bi-LSTM-Attention、Bi-LSTM-CRF、Bi-LSTM-CRF-Attention、Bi-RNN和Bi-RNN-CRF、BERT等7种深度学习模型,从中抽取构成历史事件的相应实体并进行效果对比。[结果/结论] 在全部语料上训练得到的Bi-LSTM-Attention与Bi-RNN-CRF模型的准确率分别达到89.79%和89.33%,证实了深度学习应用于大规模文本数据集的可行性。

本文引用格式

杜悦 , 王东波 , 江川 , 徐润华 , 李斌 , 许超 , 徐晨飞 . 数字人文下的典籍深度学习实体自动识别模型构建及应用研究[J]. 图书情报工作, 2021 , 65(3) : 100 -108 . DOI: 10.13266/j.issn.0252-3116.2021.03.013

Abstract

[Purpose/significance] The classics are the carrier of Chinese traditional culture, thought and wisdom. Combining the methods of data acquisition, labeling and analysis of digital humanities, it is of great significance for the automatic entity recognition of classics for subsequent application research. [Method/process] The corpus was constructed based on 25 pre-Qin literature that have been automatically segmented and manually annotated, based on the corpus of different sizes and seven deep learning models of Bi-LSTM, Bi-LSTM-Attention, Bi-LSTM-CRF, Bi-LSTM-CRF-Attention, Bi-RNN, Bi-RNN-CRF and BERT, we extracted the corresponding entities that constituted historical events and compared their effects.[Result/conclusion] The accuracy of the Bi-LSTM-Attention and Bi-RNN-CRF models trained on all corpus reached 89.79% and 89.33%, respectively, confirming the feasibility of applying deep learning to large-scale text datasets.

参考文献

[1] 欧阳剑.面向数字人文研究的大规模古籍文本可视化分析与挖掘[J].中国图书馆学报,2016,42(2):66-80.
[2] 谢韬.基于古文学的命名实体识别的研究与实现[D].北京:北京邮电大学,2018.
[3] CHERRY C, GUO H. The unreasonable effectiveness of word representations for twitter named entity recognition[C]//Proceedings of the 2015 conference of the North American chapter of the Association for Computational Linguistics:human language technologies. Denver:Association for Computational Linguistics,2015:735-745.
[4] PENG N, DREDZE M. Improving named entity recognition for chinese social media with word segmentation representation learning[J]. arXiv preprint arXiv:1603.007862016:149-155
[5] LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural architectures for named entity recognition[J]. arXiv preprint arXiv:1603.013602016:260-270.
[6] DONG X,QIAN L,GUAN Y, et al.A multiclass classification method based on deep learning for named entity recognition in electronic medical records[C]//2016 New York scientific data summit. New York:IEEE, 2016:1-10.
[7] WANG G,CAI Y,GE F. Using hybrid neural network to address Chinese named entity recognition[C]//IEEE 3rd International conference on cloud computing and intelligence systems. Shenzhen:IEEE, 2015:433-438.
[8] 刘玉娇,琚生根,李若晨,等.基于深度学习的中文微博命名实体识别[J].四川大学学报(工程科学版),2016,48(S2):142-146.
[9] 朱娜娜,景东,薛涵.基于深度神经网络的微博图书名识别研究[J].图书情报工作,2016,60(4):102-106,141.
[10] 陈佳浩.基于深度学习的在线健康文献食材命名实体识别[D].广州:华南理工大学,2017.
[11] 沈思,朱丹浩.基于深度学习的中文地名识别研究[J].北京理工大学学报,2017,37(11):1150-1155.
[12] 朱丹浩,杨蕾,王东波.基于深度学习的中文机构名识别研究——一种汉字级别的循环神经网络方法[J].现代图书情报技术,2016(12):36-43.
[13] BENGIO Y, SIMARD P, FRASONI P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE transactions on neural networks, 1994, 5(2):157-166.
[14] 周青宇.基于深度学习的自然语言句法分析研究[D].哈尔滨:哈尔滨工业大学,2016.
[15] 杨培,杨志豪,罗凌,等.基于注意机制的化学药物命名实体识别[J].计算机研究与发展,2018,55(7):1548-1556.
[16] 章成志,苏新宁.基于条件随机场的自动标引模型研究[J].中国图书馆学报,2008,34(5):89-94.
[17] 张海楠,伍大勇,刘悦,等.基于深度神经网络的中文命名实体识别[J].中文信息学报,2017,31(4):28-35.
[18] 唐敏.基于深度学习的中文实体关系抽取方法研究[D].成都:西南交通大学,2018.
文章导航

/