[目的/意义]医学实体抽取是医疗健康领域信息组织和知识挖掘的关键环节。针对中文医学实体专业性强、命名规则复杂和抽取难度大的现状,探究如何利用多种深度学习方法混合协作以提升中文医学实体抽取的准确性。[方法/过程]首先在深度学习模型BiLSTM-CRF基础上,引入语言模型BERT和迭代膨胀卷积神经网络IDCNN,增强文本语义表征能力和局部特征捕获能力;接着利用BERT预训练进行外部医学语料资源的知识迁移,实现多语义特征融合;然后引入自注意力机制捕获全局上下文重要信息,并加入Highway优化深层网络训练,解决网络加深导致的精度下降问题,最终提出MF-HDL (Multi Feature-Hybrid Deep Learning)模型。[结果/结论]MF-HDL模型在中文糖尿病数据集上效果显著,其F1值较基准模型IDCNN-CRF和BiLSTM-CRF分别提升18.42%和17.18%,此方法在中文医学实体抽取任务上表现优异。
[Purpose/significance] Medical entity extraction is a key link in information organization and knowledge mining in the medical and health field. Aiming at the current situation of strong professionalism of Chinese medical entities, complex naming rules and difficulty in extraction, this paper explores how to use a variety of deep learning methods to mix and cooperate to enhance the accuracy of Chinese medical entity extraction. [Method/process] Firstly, on the basis of the deep learning model BiLSTM-CRF, this study introduced the language model BERT and iterative expanded convolutional neural network IDCNN to enhance the text semantic representation ability and local feature capture ability. Secondly, it utilized the BERT pre-training to transfer the knowledge of external medical corpus resources and realize the fusion of multiple semantic features. In addition, the self-attention mechanism was introduced to capture important global contextual information, and Highway was added to optimize deep network training to solve the problem of reduced accuracy caused by network deepening. Finally, MF-HDL model (Multi Feature-Hybrid Deep Learning) was proposed. [Result/conclusion] The MF-HDL model has a significant performance on the Chinese diabetes dataset. Compared with the benchmark models IDCNN-CRF and BiLSTM-CRF, the F1 value of MF-HDL has increased by 18.42% and 17.18%, respectively, which verifies the excellent performance of the method in the Chinese medical entity extraction task.
[1] 沈思,孙豪,王东波.基于深度学习表示的医学主题语义相似度计算及知识发现研究[J].情报理论与实践,2020,43(5):183-190.
[2] GAO Y, WANG Y, WANG P, et al. Medical named entity extraction from Chinese resident admit notes using character and word attention-enhanced neural network[J]. International journal of environmental research and public health, 2020, 17(5):1614.
[3] ABIB M S, KALITA J. Scalable biomedical named entity recognition:investigation of a database-supported SVM approach[J]. International journal of bioinformatics research and applications, 2010, 6(2):191-208.
[4] WEI Q, CHEN T, XU R, et al. Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks[EB/OL].[2022-03-28].https://doi.org/10.1093/database/baw140.
[5] NOZZA D, MANCHANDA P, FERSINI E, et al. LearningToAdapt with word embeddings:domain adaptation of named entity recognition systems[J]. Information processing&management, 2021, 58(3):102537.
[6] PUCCETTI G, CHIARELLO F, FANTONI G. A simple and fast method for named entity context extraction from patents[J]. Expert systems with applications, 2021, 184:115570.
[7] ZHANG J, HUANG W, JI D, et al. Globally normalized neural model for joint entity and event extraction[J]. Information processing&management, 2021, 58(5):102636.
[8] HU Q, LIU N, WANG J, et al. An overlapping sequence tagging mechanism for symptoms and details extraction on Chinese medical records[J]. Computers&electrical engineering, 2021, 91:107019.
[9] 陈德鑫,占袁圆,杨兵,等.基于CNN-BiLSTM模型的在线医疗实体抽取研究[J].图书情报工作,2019,63(12):105-113.
[10] DEVLIN J, CHANG M W, LEE K, et al. Bert:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2022-03-28].https://arxiv.org/pdf/1810.04805.pdf.
[11] YANG X, BIAN J, HOGAN W R, et al. Clinical concept extraction using transformers[J]. Journal of the American Medical Informatics Association, 2020, 27(12):1935-1942.
[12] LIU J, GAO L, GUO S, et al. A hybrid deep-learning approach for complex biochemical named entity recognition[J]. Knowledge-based systems, 2021, 221:106958.
[13] 任秋彤,王昊,熊欣,等.融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究[J].数据分析与知识发现, 2021, 5(12):123-136.
[14] 刘浏,秦天允,王东波.非物质文化遗产传统音乐术语自动抽取[J].数据分析与知识发现,2020,4(12):68-75.
[15] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL].[2022-03-28]. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[16] LI Y, DU G, XIANG Y, et al. Towards Chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge[J]. Journal of biomedical informatics, 2020, 106:103435.
[17] 程齐凯,李鹏程,张国标,等.学术文本词汇功能识别——基于标题生成策略和注意力机制的问题方法抽取[J].情报学报,2021,40(1):43-52.
[18] 范涛,王昊,张宝隆.基于远程监督和深度学习的非物质文化遗产文本属性抽取研究[J].情报理论与实践,2021,44(10):1-7.
[19] LI X, ZHANG H, ZHOU X H. Chinese clinical named entity recognition with variant neural structures based on BERT methods[J]. Journal of biomedical informatics, 2020, 107:103422.
[20] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks[EB/OL].[2022-03-28]. https://arxiv.org/pdf/1505.00387.
[21] ALIYUN. A labeled Chinese dataset for diabetes[EB/OL].[2022-03-28].https://tianchi.aliyun.com/competition/entrance/231687/information.
[22] 李旭晖,程威,唐小雅,等.基于多层卷积神经网络的金融事件联合抽取方法[J].图书情报工作,2021,65(24):89-99.
[23] 杜悦,王东波,江川,等.数字人文下的典籍深度学习实体自动识别模型构建及应用研究[J].图书情报工作,2021,65(3):100-108.
[24] 俞琰,陈磊,姜金德,等.融合论文关键词知识的专利术语抽取方法[J].图书情报工作,2020, 65(14):104-111.
[25] 何春辉,王梦贤,何小波.基于双层Bi-LSTM-CRF模型的糖尿病领域命名实体识别[J].邵阳学院学报(自然科学版),2020,17(1):21-26.
[26] DENG J, CHENG L, WANG Z. Self-attention-based BiGRU and capsule network for named entity recognition[EB/OL].[2022-03-28].https://arxiv.org/pdf/2002.00735.
[27] 杨佳琦.基于中文自然语言处理的糖尿病知识图谱构建[D].包头:内蒙古科技大学,2020.
[28] WANG Y, SUN Y, MA Z, et al. Named entity recognition in Chinese medical literature using pretraining models[J]. Scientific programming, 2020, 2020:8812754.作者贡献说明:韩普:提出研究思路,对研究方法提供指导,撰写论文,修改论文;顾亮:采集数据,编写代码,撰写论文,修改论文。