[Purpose/significance] According to the characteristics of Chinese language expression, this paper proposes a feature extraction method of words with word segmentation tag of character granularity, which can effectively improve the F1 value of Chinese clinical named entity recognition, and the method can be used for other Chinese sequence labeling model. [Method/process] This paper chose three kinds of features of Chinese-words, including part-of-speech Tagging, keyword weight and dependency parsing, to construct the clinical cases training text in sequence labeling model of the Chinese-character granularity, and the corpus source is CCKS2017:Task2. Then, in different feature combination modes, this paper adopted CRF algorithm to verify Method 1 and Method 2,which are two kinds of words feature extraction methods for character granularity. [Result/conclusion] Compared with Method 1, for the four different combinations of word features, Method 2 has been improved in the task of CNER, and the F1 value has increased by an average of 0.23% in the 4-fold cross-validation test. The experiment shows that in the context of mature Chinese word segmentation technology, Method2 can obtain better word feature representations than Method 1, and it has a lifting effect on the processing performance of Chinese-Character Granularity in Sequence Labeling Model.
[1] 杨锦锋, 于秋滨, 关毅,等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014, 40(8):1537-1562.
[2] UZUNER O, SOUTH B R, SHEN S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American medical informatics Association jamia, 2011, 18(5):552-556.
[3] SUN W, RUMSHISKY A, UZUNER O. Evaluating temporal relations in clinical text:2012 i2b2 challenge[J]. Journal of the American medical informatics Association jamia, 2013, 20(5):806-813.
[4] PRADHAN S, ELHADAD N, CHAPMAN W, et al. SemEval-2014 task 7:analysis of clinical text[C]//Conference:Proceedings of the 8th International workshop on semantic evaluation. SemEval 2014, Dublin, Ireland, 2014:54-62.
[5] BETHARD S, DERCZYNSKI L, SAVOVA G, et al. SemEval-2015 task 6:clinical tempeval[C]//Conference:proceedings of the 9th International workshop on semantic evaluation. SemEval 2015, Denver, Colorado, 2015:806-814.
[6] BETHARD S, SAVOVA G, CHEN W T, et al. SemEval-2016 Task 12:clinical tempeval[C]//Conference:Proceedings of the 10th International workshop on semantic evaluation. San Diego, California, 2016:1052-1062.
[7] CCKS2017-全国知识图谱与语义计算大会[EB/OL].[2017-08-26]. http://www.ccks2017.com/.
[8] 王云吉. 基于层叠条件随机场的电子病历命名实体识别[D]. 吉林:吉林大学, 2014.
[9] 汤步洲, 王晓龙, 王轩. 置信度加权在线序列标注算法[J]. 自动化学报, 2011, 37(2):188-195.
[10] TSURUOKA Y, TSUJⅡ J. Boosting precision and recall of dictionary-based protein name recognition[C]//ACL workshop on natural language processing in biomedicine. Association for computational linguistics, Sapporo, 2003:41-48.
[11] ALFRED R, LEONG L C, ON C K, et al. Malay named entity recognition based on rule-based approach[J]. International journal of machine learning & computing, 2014, 4(3):300-306.
[12] LAWRENCE R. RABINER. A tutorial on hidden markov models and selected applications in speech recognition[J]. Readings in speech recognition, 1990, 77(2):267-296.
[13] BERGER A L, PIETRA V J D, PIETRA S A D. A maximum entropy approach to natural language processing[J]. Computational linguistics, 1996, 22(1):39-71.
[14] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001). Williamstown:Morgan Kauf-mann, 2001:282-289.
[15] GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural networks the official journal of the international neural network society, 2005, 18(5-6):602.
[16] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[J]. Computer science, 2015, 20(2):508-517.
[17] JIANGLU H, XUE S, ZENGJIAN L, et al. HITSZ_CNER:a hybrid system for entity recognition from chinese clinical text[C]//Proceedings of the Evaluation Task at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2017). China, 2017:25-30.
[18] JINHANG W, XIAO H, RONGSHENG Z, et al. Clinical named entity recognition via bi-directional LSTM-CRF model[C]//Proceedings of the Evaluation Task at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2017). China, 2017:31-36.
[19] OUYANG E, LI Y X, JIN L, et al. Exploring n-gram character presentation in bidirectional RNN-CRF for chinese clinical named entity recognition[C]//Proceedings of the Evaluation Task at the China Conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:37-42.
[20] XIA Y H, WANG Q. Clinical named entity recognition:ECUST in the CCKS-2017 shared task 2[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:43-48.
[21] CHEN Y X, ZHANG G, FANG H Z, et al. Clinical named entity recognition method based on CRF[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:49-54.
[22] LI Z Z, ZHANG Q, LIU Y, FENG D W, et al. Recurrent neural networks with specialized word embedding for chinese clinical named entity recognition[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). Evaluation tasks at CCKS 2017, China, 2017:55-60.
[23] GENG D W. Clinical name entity recognition using conditional random field with augmented features[C]//Proceedings of the evaluation task at the China conference on knowledge graph and semantic computing (CCKS 2017). China, 2017:61-68.
[24] 章成志, 苏新宁. 基于条件随机场的自动标引模型研究[J]. 中国图书馆学报, 2008, 34(5):89-94.
[25] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012, 28(4):29-34.
[26] 李月伦, 常宝宝. 基于最大间隔马尔可夫网模型的汉语分词方法[J]. 中文信息学报, 2010, 24(1):8-14.
[27] 王昊, 邓三鸿, 苏新宁. 基于字序列标注的中文关键词抽取研究[J]. 现代图书情报技术, 2011, 27(12):39-45.
[28] 燕杨, 文敦伟, 王云吉,等. 基于层叠条件随机场的中文病历命名实体识别[J]. 吉林大学学报(工), 2014, 44(6):1843-1848.
[29] 张海楠, 伍大勇, 刘悦,等. 基于深度神经网络的中文命名实体识别[J]. 中文信息学报, 2017, 31(4):28-35.
[30] 来斯惟. 基于神经网络的词和文档语义向量表示方法研究[D]. 北京:中国科学院自动化研究所, 2016.
[31] 计峰. 自然语言处理中序列标注模型的研究[D]. 上海:复旦大学, 2012.
[32] ZHENG S, WANG F, BAO H, et al. Joint extraction of entities and relations based on a novel tagging scheme[C]//The 55th annual meeting of the association for computational linguistics (ACL). Association for Computational Linguistics. Vancouver, Canada, July 30-August 4, 2017:1227-1236.
[33] GU Q, LI Z, HAN J. Joint feature selection and subspace learning[C]//Conference:IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, AAAI press, Barcelona, Catalonia, Spain, 2011:1294-1299.
[34] 柯彼德. 试论汉语语素的分类[J]. 世界汉语教学, 1992(1):1-12.
[35] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013, 29(9):30-34.
[36] 唐晓波, 肖璐. 基于依存句法分析的微博主题挖掘模型研究[J]. 情报科学, 2015(9):61-65.
[37] 章成敏, 许鑫, 章成志. 条件随机场标引模型的性能影响因素分析[J]. 现代图书情报技术, 2008(6):34-40.
[38] 陈锋, 翟羽佳, 王芳. 基于条件随机场的学术期刊中理论的自动识别方法[J]. 图书情报工作, 2016,60(2):122-128.
[39] 周志华. 机器学习:=Machine learning[M]. 北京:清华大学出版社, 2016. 作者贡献说明:孙安:提出研究思路,制定实验方案,撰写论文初稿; 于英香:设计论文框架,提出修改建议; 罗永刚:为研究选题提供素材和指导; 王祺:提供技术指导。