专题:数据驱动研究新范式

中文电子病历的分词及实体识别研究

  • 王若佳 ,
  • 赵常煜 ,
  • 王继民
展开
  • 1. 北京大学信息管理系 北京 100871;
    2. 北京大学海洋研究院 北京 100871
王若佳(ORCID:0000-0003-1806-0688),博士研究生;赵常煜(ORCID:0000-0001-6780-1070),硕士。

收稿日期: 2018-07-16

  修回日期: 2018-09-15

  网络出版日期: 2019-01-20

Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record

  • Wang Ruojia ,
  • Cho Sang ,
  • Wang Jimin
Expand
  • 1. Department of information management, Peking University, Beijing 100871;
    2. Institute of Ocean Research, Peking University, Beijing 100871

Received date: 2018-07-16

  Revised date: 2018-09-15

  Online published: 2019-01-20

摘要

[目的/意义]健康医疗大数据是我国重要的基础性战略资源,本研究对中文电子病历分词与实体识别的探讨与实证较好地完成了医疗数据的信息抽取任务,对今后医疗大数据在语义层面的应用发展具有重要意义。[方法/过程]本研究首先融合权威词表、官方标准、健康网站数据及其他医学补充词库构建了词语数量级达到10万的医学词表;然后对电子病历的字段进行分词,对比了jieba工具、导入词典后的jieba、无监督学习及AC自动机4种模型的分词效果;最后,以自动分词和人工标注结果为语料,实现基于条件随机场的电子病历实体识别研究,并比较不同实体类别以及不同文本特征下的实体识别效果,选出最优模板。[结果/结论]分词结果显示,AC自动机的效果最好,F值可达82%;实体识别结果表明,"检查"和"疾病"实体的识别效果最好,而"症状"的识别效果不太理想。

本文引用格式

王若佳 , 赵常煜 , 王继民 . 中文电子病历的分词及实体识别研究[J]. 图书情报工作, 2019 , 63(2) : 34 -42 . DOI: 10.13266/j.issn.0252-3116.2019.02.004

Abstract

[Purpose/significance] Healthcare big data is an important basic strategic resource in China. Word segmentation and entity recognition of Chinese electronic medical record(EMR) is helpful in extracting important information from a large number of unstructured text.[Method/process] In this study, a Chinese medical thesaurus is firstly built in terms of authoritative medical subject headings, official standards and health website data; then, the effect of four segmentation methods is compared based on the corpus of artificial segmentation and manual annotation; finally, CRF model is used to identify 5 entities, including disease, symptom, test, drug and treatment.[Result/conclusion] Results show that (i)AC automaton model has the best F-measure in EMR word segmentation, which is 82%; (ii) compared with Western medical record, it's difficult to identify medical entities in the record of traditional Chinese medicine. Besides, "Test" and "Disease" entities have better F-measure, while the F-measure of "Symptom" entity is not that ideal.

参考文献

[1] 国家卫生健康委员会. 电子病历应用管理规范(试行)[EB/OL].[2018-02-20]. http://www.nhfpc.gov.cn/yzygj/s3593/201702/22bb2525318f496f846e8566754876a1.shtml.
[2] 刘群, 张华平, 俞鸿魁,等. 基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展, 2004, 41(8):1421-1429.
[3] 李兆福. 基于K最短路径的中文分词算法研究与实现[D]. 哈尔滨:哈尔滨工程大学, 2009.
[4] 张立邦. 基于半监督学习的中文电子病历分词和名实体挖掘[D]. 哈尔滨:哈尔滨工业大学, 2014.
[5] 张立邦, 关毅, 杨锦峰. 基于无监督学习的中文电子病历分词[J]. 智能计算机与应用, 2014(2):68-71.
[6] 李国垒, 陈先来, 夏冬,等. 面向临床决策的电子病历文本潜在语义分析[J]. 现代图书情报技术, 2016, 32(3):50-57.
[7] FRIEDMAN C, HRIPCSAK G, DUMOUCHEL W, et al. Natural language processing in an operational clinical information system[J]. Natural language engineering, 1995, 1(1):83-108.
[8] SEVENSTER M, VAN O R, QIAN Y. Automatically correlating clinical findings and body locations in radiology reports using MedLEE[J]. Journal of digital imaging, 2012, 25(2):240-249.
[9] MetaMap. A Tool For Recognizing UMLS Concepts in Text[EB/OL].[2018-08-18]. https://mmtx.nlm.nih.gov/.
[10] XU H, STENNER S P, DOAN S, et al. MedEx:a medication information extraction system for clinical narratives[J]. Journal of the American medical informatics association, 2010, 17(1):19-24.
[11] SAVOVA G K, MASANZ J J, OGREN P V, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES):architecture, component evaluation and applications[J]. Journal of the American medical informatics association jamia, 2010, 17(5):507-513.
[12] LI Y, GORMAN S L. Section classification in clinical notes using supervised hidden markov model[C]//Arlington, VA, USA:Proceedings of the 1st ACM International Health Informatics Symposium. ACM, 2010:744-750.
[13] 王鹏远, 姬东鸿. 基于多标签CRF的疾病名称抽取[J]. 计算机应用研究, 2017, 34(1):118-122.
[14] 叶枫, 陈莺莺, 周根贵,等. 电子病历中命名实体的智能识别[J]. 中国生物医学工程学报, 2011, 30(2):256-262.
[15] LEI J, TANG B, LU X, et al. A comprehensive study of named entity recognition in Chinese clinical text[J]. Journal of the American medical informatics association, 2014, 21(5):808-814.
[16] LIANG J, XIAN X, HE X, et al. A novel approach towards medical entity recognition in Chinese clinical text[J]. Journal of healthcare engineering,2017(2):1-16.
[17] UMLS. Current semantic types[EB/OL].[2018-02-20]. https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html.
[18] UZUNER Ö, SOUTH B R, SHEN S Y, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American medical informatics association, 2011, 18(5):552-556.
[19] 结巴中文分词[EB/OL].[2018-02-20].https://github.com.
[20] 沈翔翔, 李小勇. 使用无监督学习改进中文分词[J]. 小型微型计算机系统, 2017, 38(4):744-748.
[21] 孔东林, 罗向阳, 邓崎皓,等. 基于AC自动机匹配算法的入侵检测系统研究[J]. 微电子学与计算机, 2005, 22(3):89-92.
[22] 李原.中文文本分类中分词和特征选择方法研究[D].长春:吉林大学,2011.
文章导航

/