图书情报工作 ›› 2018, Vol. 62 ›› Issue (11): 103-111.DOI: 10.13266/j.issn.0252-3116.2018.11.012

• 知识组织 • 上一篇    下一篇

序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例

孙安1,2, 于英香1, 罗永刚1,3, 王祺4   

  1. 1. 上海大学图书情报档案系 上海 200444;
    2. 河南科技大学图书馆 洛阳 471023;
    3. 上海健康医学院医疗器械学院 上海 201318;
    4. 华东理工大学计算机科学与技术系 上海 200237
  • 收稿日期:2017-10-24 修回日期:2018-03-12 出版日期:2018-06-05 发布日期:2018-06-05
  • 作者简介:孙安(ORCID:0000-0002-2292-1308),馆员,博士研究生,E-mail:52127688@qq.com;于英香(ORCID:0000-0002-1822-6302),教授,博士生导师;罗永刚(ORCID:0000-0002-8572-335X),讲师,博士研究生;王祺(ORCID:0000-0002-6792-887X),硕士研究生。
  • 基金资助:
    本文系国家社会科学基金一般项目"'区域-国家’电子文件管理整合模型构建与实证研究"(项目编号:11BTQ039)研究成果之一。

Research on Feature Extraction Scheme of Chinese-character Granularity in Sequence Labeling Model——A Case Study About Clinical Named Entity Recognition of CCKS2017: Task2

Sun An1,2, Yu Yingxiang1, Luo Yonggang1,3, Wang Qi4   

  1. 1. Information and Archival Department, Shanghai University, Shanghai 200444;
    2. Library, Henan University of Science and Technology, Luoyang 471023;
    3. College of Medical Instrument, Shanghai University of Medicine & Health Sciences, Shanghai 200444;
    4. Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237
  • Received:2017-10-24 Revised:2018-03-12 Online:2018-06-05 Published:2018-06-05

摘要: [目的/意义]针对中文语言表达特点,提出一种含分词标签的字粒度词语特征提取方法,有效提升了中文临床病历命名实体识别任务的F1值,同时该方法可以为其他中文序列标注模型所借鉴。[方法/过程]选取汉语词语的词性标注、关键词权值、依存句法分析三个特征,构筑字粒度序列标注模型的临床病历训练文本,语料来源CCKS2017:Task2。在不同特征组合方式下,采用条件随机场算法验证两种字粒度词语特征提取方案Method1与Method2。[结果/结论]在四种不同词语特征组合下,Method2相对于Method1在临床病历命名实体识别任务中性能均有所提升,四折交叉测试中F1值平均提升了0.23%。实验表明在中文分词技术日趋成熟的环境下,Method2相对Method1能够获得更好的词语特征表示,对中文字粒度序列标注模型的处理性能具有提升作用。

关键词: 命名实体识别, 字粒度, 特征提取, 序列标注模型, 条件随机场, 临床病历

Abstract: [Purpose/significance] According to the characteristics of Chinese language expression, this paper proposes a feature extraction method of words with word segmentation tag of character granularity, which can effectively improve the F1 value of Chinese clinical named entity recognition, and the method can be used for other Chinese sequence labeling model. [Method/process] This paper chose three kinds of features of Chinese-words, including part-of-speech Tagging, keyword weight and dependency parsing, to construct the clinical cases training text in sequence labeling model of the Chinese-character granularity, and the corpus source is CCKS2017:Task2. Then, in different feature combination modes, this paper adopted CRF algorithm to verify Method 1 and Method 2,which are two kinds of words feature extraction methods for character granularity. [Result/conclusion] Compared with Method 1, for the four different combinations of word features, Method 2 has been improved in the task of CNER, and the F1 value has increased by an average of 0.23% in the 4-fold cross-validation test. The experiment shows that in the context of mature Chinese word segmentation technology, Method2 can obtain better word feature representations than Method 1, and it has a lifting effect on the processing performance of Chinese-Character Granularity in Sequence Labeling Model.

Key words: named entity recognition, character granularity, feature extraction, sequential labeling model, conditional random field, clinical cases

中图分类号: