知识组织

序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例

  • 孙安 ,
  • 于英香 ,
  • 罗永刚 ,
  • 王祺
展开
  • 1. 上海大学图书情报档案系 上海 200444;
    2. 河南科技大学图书馆 洛阳 471023;
    3. 上海健康医学院医疗器械学院 上海 201318;
    4. 华东理工大学计算机科学与技术系 上海 200237
孙安(ORCID:0000-0002-2292-1308),馆员,博士研究生,E-mail:52127688@qq.com;于英香(ORCID:0000-0002-1822-6302),教授,博士生导师;罗永刚(ORCID:0000-0002-8572-335X),讲师,博士研究生;王祺(ORCID:0000-0002-6792-887X),硕士研究生。

收稿日期: 2017-10-24

  修回日期: 2018-03-12

  网络出版日期: 2018-06-05

基金资助

本文系国家社会科学基金一般项目"'区域-国家’电子文件管理整合模型构建与实证研究"(项目编号:11BTQ039)研究成果之一。

Research on Feature Extraction Scheme of Chinese-character Granularity in Sequence Labeling Model——A Case Study About Clinical Named Entity Recognition of CCKS2017: Task2

  • Sun An ,
  • Yu Yingxiang ,
  • Luo Yonggang ,
  • Wang Qi
Expand
  • 1. Information and Archival Department, Shanghai University, Shanghai 200444;
    2. Library, Henan University of Science and Technology, Luoyang 471023;
    3. College of Medical Instrument, Shanghai University of Medicine & Health Sciences, Shanghai 200444;
    4. Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237

Received date: 2017-10-24

  Revised date: 2018-03-12

  Online published: 2018-06-05

摘要

[目的/意义]针对中文语言表达特点,提出一种含分词标签的字粒度词语特征提取方法,有效提升了中文临床病历命名实体识别任务的F1值,同时该方法可以为其他中文序列标注模型所借鉴。[方法/过程]选取汉语词语的词性标注、关键词权值、依存句法分析三个特征,构筑字粒度序列标注模型的临床病历训练文本,语料来源CCKS2017:Task2。在不同特征组合方式下,采用条件随机场算法验证两种字粒度词语特征提取方案Method1与Method2。[结果/结论]在四种不同词语特征组合下,Method2相对于Method1在临床病历命名实体识别任务中性能均有所提升,四折交叉测试中F1值平均提升了0.23%。实验表明在中文分词技术日趋成熟的环境下,Method2相对Method1能够获得更好的词语特征表示,对中文字粒度序列标注模型的处理性能具有提升作用。

本文引用格式

孙安 , 于英香 , 罗永刚 , 王祺 . 序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例[J]. 图书情报工作, 2018 , 62(11) : 103 -111 . DOI: 10.13266/j.issn.0252-3116.2018.11.012

Abstract

[Purpose/significance] According to the characteristics of Chinese language expression, this paper proposes a feature extraction method of words with word segmentation tag of character granularity, which can effectively improve the F1 value of Chinese clinical named entity recognition, and the method can be used for other Chinese sequence labeling model. [Method/process] This paper chose three kinds of features of Chinese-words, including part-of-speech Tagging, keyword weight and dependency parsing, to construct the clinical cases training text in sequence labeling model of the Chinese-character granularity, and the corpus source is CCKS2017:Task2. Then, in different feature combination modes, this paper adopted CRF algorithm to verify Method 1 and Method 2,which are two kinds of words feature extraction methods for character granularity. [Result/conclusion] Compared with Method 1, for the four different combinations of word features, Method 2 has been improved in the task of CNER, and the F1 value has increased by an average of 0.23% in the 4-fold cross-validation test. The experiment shows that in the context of mature Chinese word segmentation technology, Method2 can obtain better word feature representations than Method 1, and it has a lifting effect on the processing performance of Chinese-Character Granularity in Sequence Labeling Model.

参考文献

[1] 杨锦锋, 于秋滨, 关毅,等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014, 40(8):1537-1562.
[2] UZUNER O, SOUTH B R, SHEN S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American medical informatics Association jamia, 2011, 18(5):552-556.
[3] SUN W, RUMSHISKY A, UZUNER O. Evaluating temporal relations in clinical text:2012 i2b2 challenge[J]. Journal of the American medical informatics Association jamia, 2013, 20(5):806-813.
[4] PRADHAN S, ELHADAD N, CHAPMAN W, et al. SemEval-2014 task 7:analysis of clinical text[C]//Conference:Proceedings of the 8th International workshop on semantic evaluation. SemEval 2014, Dublin, Ireland, 2014:54-62.
[5] BETHARD S, DERCZYNSKI L, SAVOVA G, et al. SemEval-2015 task 6:clinical tempeval[C]//Conference:proceedings of the 9th International workshop on semantic evaluation. SemEval 2015, Denver, Colorado, 2015:806-814.
[6] BETHARD S, SAVOVA G, CHEN W T, et al. SemEval-2016 Task 12:clinical tempeval[C]//Conference:Proceedings of the 10th International workshop on semantic evaluation. San Diego, California, 2016:1052-1062.
[7] CCKS2017-全国知识图谱与语义计算大会[EB/OL].[2017-08-26]. http://www.ccks2017.com/.
[8] 王云吉. 基于层叠条件随机场的电子病历命名实体识别[D]. 吉林:吉林大学, 2014.
[9] 汤步洲, 王晓龙, 王轩. 置信度加权在线序列标注算法[J]. 自动化学报, 2011, 37(2):188-195.
[10] TSURUOKA Y, TSUJⅡ J. Boosting precision and recall of dictionary-based protein name recognition[C]//ACL workshop on natural language processing in biomedicine. Association for computational linguistics, Sapporo, 2003:41-48.
[11] ALFRED R, LEONG L C, ON C K, et al. Malay named entity recognition based on rule-based approach[J]. International journal of machine learning & computing, 2014, 4(3):300-306.
[12] LAWRENCE R. RABINER. A tutorial on hidden markov models and selected applications in speech recognition[J]. Readings in speech recognition, 1990, 77(2):267-296.
[13] BERGER A L, PIETRA V J D, PIETRA S A D. A maximum entropy approach to natural language processing[J]. Computational linguistics, 1996, 22(1):39-71.
[14] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001). Williamstown:Morgan Kauf-mann, 2001:282-289.
[15] GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural networks the official journal of the international neural network society, 2005, 18(5-6):602.
[16] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[J]. Computer science, 2015, 20(2):508-517.
[17] JIANGLU H, XUE S, ZENGJIAN L, et al. HITSZ_CNER:a hybrid system for entity recognition from chinese clinical text[C]//Proceedings of the Evaluation Task at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2017). China, 2017:25-30.
[18] JINHANG W, XIAO H, RONGSHENG Z, et al. Clinical named entity recognition via bi-directional LSTM-CRF model[C]//Proceedings of the Evaluation Task at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2017). China, 2017:31-36.
[19] OUYANG E, LI Y X, JIN L, et al. Exploring n-gram character presentation in bidirectional RNN-CRF for chinese clinical named entity recognition[C]//Proceedings of the Evaluation Task at the China Conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:37-42.
[20] XIA Y H, WANG Q. Clinical named entity recognition:ECUST in the CCKS-2017 shared task 2[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:43-48.
[21] CHEN Y X, ZHANG G, FANG H Z, et al. Clinical named entity recognition method based on CRF[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:49-54.
[22] LI Z Z, ZHANG Q, LIU Y, FENG D W, et al. Recurrent neural networks with specialized word embedding for chinese clinical named entity recognition[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). Evaluation tasks at CCKS 2017, China, 2017:55-60.
[23] GENG D W. Clinical name entity recognition using conditional random field with augmented features[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:61-68.
[24] 章成志, 苏新宁. 基于条件随机场的自动标引模型研究[J]. 中国图书馆学报, 2008, 34(5):89-94.
[25] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012, 28(4):29-34.
[26] 李月伦, 常宝宝. 基于最大间隔马尔可夫网模型的汉语分词方法[J]. 中文信息学报, 2010, 24(1):8-14.
[27] 王昊, 邓三鸿, 苏新宁. 基于字序列标注的中文关键词抽取研究[J]. 现代图书情报技术, 2011, 27(12):39-45.
[28] 燕杨, 文敦伟, 王云吉,等. 基于层叠条件随机场的中文病历命名实体识别[J]. 吉林大学学报(工), 2014, 44(6):1843-1848.
[29] 张海楠, 伍大勇, 刘悦,等. 基于深度神经网络的中文命名实体识别[J]. 中文信息学报, 2017, 31(4):28-35.
[30] 来斯惟. 基于神经网络的词和文档语义向量表示方法研究[D]. 北京:中国科学院自动化研究所, 2016.
[31] 计峰. 自然语言处理中序列标注模型的研究[D]. 上海:复旦大学, 2012.
[32] ZHENG S, WANG F, BAO H, et al. Joint extraction of entities and relations based on a novel tagging scheme[C]//The 55th annual meeting of the association for computational linguistics (ACL). Association for Computational Linguistics. Vancouver, Canada, July 30-August 4, 2017:1227-1236.
[33] GU Q, LI Z, HAN J. Joint feature selection and subspace learning[C]//Conference:IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, AAAI press, Barcelona, Catalonia, Spain, 2011:1294-1299.
[34] 柯彼德. 试论汉语语素的分类[J]. 世界汉语教学, 1992(1):1-12.
[35] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013, 29(9):30-34.
[36] 唐晓波, 肖璐. 基于依存句法分析的微博主题挖掘模型研究[J]. 情报科学, 2015(9):61-65.
[37] 章成敏, 许鑫, 章成志. 条件随机场标引模型的性能影响因素分析[J]. 现代图书情报技术, 2008(6):34-40.
[38] 陈锋, 翟羽佳, 王芳. 基于条件随机场的学术期刊中理论的自动识别方法[J]. 图书情报工作, 2016,60(2):122-128.
[39] 周志华. 机器学习:=Machine learning[M]. 北京:清华大学出版社, 2016. 作者贡献说明:孙安:提出研究思路,制定实验方案,撰写论文初稿; 于英香:设计论文框架,提出修改建议; 罗永刚:为研究选题提供素材和指导; 王祺:提供技术指导。
文章导航

/