图书情报工作 ›› 2019, Vol. 63 ›› Issue (2): 34-42.DOI: 10.13266/j.issn.0252-3116.2019.02.004

• 专题:数据驱动研究新范式 • 上一篇    下一篇

中文电子病历的分词及实体识别研究

王若佳1,2, 赵常煜1, 王继民1   

  1. 1. 北京大学信息管理系 北京 100871;
    2. 北京大学海洋研究院 北京 100871
  • 收稿日期:2018-07-16 修回日期:2018-09-15 出版日期:2019-01-20 发布日期:2019-01-20
  • 通讯作者: 王继民(ORCID:0000-0002-3573-7788),教授,博士生导师,通讯作者,E-mail:wjm@pku.edu.cn
  • 作者简介:王若佳(ORCID:0000-0003-1806-0688),博士研究生;赵常煜(ORCID:0000-0001-6780-1070),硕士。

Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record

Wang Ruojia1,2, Cho Sang1, Wang Jimin1   

  1. 1. Department of information management, Peking University, Beijing 100871;
    2. Institute of Ocean Research, Peking University, Beijing 100871
  • Received:2018-07-16 Revised:2018-09-15 Online:2019-01-20 Published:2019-01-20

摘要: [目的/意义]健康医疗大数据是我国重要的基础性战略资源,本研究对中文电子病历分词与实体识别的探讨与实证较好地完成了医疗数据的信息抽取任务,对今后医疗大数据在语义层面的应用发展具有重要意义。[方法/过程]本研究首先融合权威词表、官方标准、健康网站数据及其他医学补充词库构建了词语数量级达到10万的医学词表;然后对电子病历的字段进行分词,对比了jieba工具、导入词典后的jieba、无监督学习及AC自动机4种模型的分词效果;最后,以自动分词和人工标注结果为语料,实现基于条件随机场的电子病历实体识别研究,并比较不同实体类别以及不同文本特征下的实体识别效果,选出最优模板。[结果/结论]分词结果显示,AC自动机的效果最好,F值可达82%;实体识别结果表明,"检查"和"疾病"实体的识别效果最好,而"症状"的识别效果不太理想。

关键词: 电子病历, 中文分词, 实体识别, 健康医疗大数据, AC自动机, 条件随机场

Abstract: [Purpose/significance] Healthcare big data is an important basic strategic resource in China. Word segmentation and entity recognition of Chinese electronic medical record(EMR) is helpful in extracting important information from a large number of unstructured text.[Method/process] In this study, a Chinese medical thesaurus is firstly built in terms of authoritative medical subject headings, official standards and health website data; then, the effect of four segmentation methods is compared based on the corpus of artificial segmentation and manual annotation; finally, CRF model is used to identify 5 entities, including disease, symptom, test, drug and treatment.[Result/conclusion] Results show that (i)AC automaton model has the best F-measure in EMR word segmentation, which is 82%; (ii) compared with Western medical record, it's difficult to identify medical entities in the record of traditional Chinese medicine. Besides, "Test" and "Disease" entities have better F-measure, while the F-measure of "Symptom" entity is not that ideal.

Key words: healthcare data mining, electronic medical record, Chinese word segmentation, named entity recognition, AC automaton, conditional random field

中图分类号: