知识组织

融合汉字多语义与文本统计特征的中文医学新词发现研究

  • 王巍洁 ,
  • 任慧玲 ,
  • 李晓瑛 ,
  • 王勖 ,
  • 张颖
展开
  • 北京协和医学院/中国医学科学院医学信息研究所/图书馆 北京 100020
王巍洁,硕士研究生;李晓瑛,副研究馆员,博士;王勖,硕士研究生;张颖,硕士研究生。

收稿日期: 2023-08-09

  修回日期: 2023-11-24

  网络出版日期: 2024-03-28

基金资助

本文系科技创新2030-“新一代人工智能”重大项目“面向医学人工智能服务的知识体系构建和应用研究”中文医学术语体系构建(项目编号:2020AAA0104901)研究成果之一。

Chinese Medical New Word Detection by Chinese Character’s Multi-Semantic Word Vector and Statistical Text Features

  • Wang Weijie ,
  • Ren Huiling ,
  • Li Xiaoying ,
  • Wang Xu ,
  • Zhang Ying
Expand
  • Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100020

Received date: 2023-08-09

  Revised date: 2023-11-24

  Online published: 2024-03-28

Supported by

This work is supported by the Major Program of National Key Research and Development Program of China, Scientific and Technological Innovation 2030 – “New Generation Artificial Intelligence” titled “Research on the Construction and Application of Knowledge System for Medical Artificial Intelligence Services” (Grant No. 2020AAA0104901).

摘要

[目的/意义] 为提高机器理解医学文本的能力,提高医学自然语言处理等上层任务效果,保障医学知识内容更新及时性、覆盖完整性,提出一种融合汉字多语义信息与文本统计特征的医学新词发现方法。[方法/过程] 以规范用词的医学文献摘要数据为新词发现来源,基于N-gram模型获取N元词串,将词串存入字典树,从词的内部凝固度、词的自由程度、词的语义相似度3个角度同时计算每个N-gram词串的关联置信度、左右邻接熵、多语义相似度(包括汉字细粒度字符语义信息、BERT词向量信息),遍历上述各指标阈值评估N-gram词串为医学新词的可能。[结果/结论] 从中华医学会收录的截至2022年10月20日的最新1 000篇文摘中发现医学新词3 263个,去除重复项后,共获得764个医学新词。提出的融合汉字多语义与文本统计特征的医学新词发现方法对比现有方法具有一定提升,且在应用上可以有效提高医学分词任务效果,使医学分词后的名词类别更清晰、概念更明确、内涵更丰富。结合汉字内在多语义信息与字词外部统计特征的医学新词发现方法,不仅可以提高计算机的新词发现能力,还可提高计算机面对专业且复杂的医学文本自然语言处理效果,对及时更新领域知识内容等具有重要帮助。

本文引用格式

王巍洁 , 任慧玲 , 李晓瑛 , 王勖 , 张颖 . 融合汉字多语义与文本统计特征的中文医学新词发现研究[J]. 图书情报工作, 2024 , 68(6) : 119 -128 . DOI: 10.13266/j.issn.0252-3116.2024.06.011

Abstract

[Purpose/Significance] In order to improve the machine’s ability of medical texts understanding and the effectiveness of upper-level tasks such as medical natural language processing, and guarantee the timeliness and coverage integrity of medical knowledge content updates, this paper proposes a medical new word detection algorithm that integrates Chinese characters’ multi-semantic information with statistical text features of texts.[Method/Process] Taking the abstract of medical literature with canonical words as the source of new word detection, the paper obtained N-gram word string based on the N-gram model and stored it into the dictionary tree. From the word’s internal solidification degree, the freedom degree, and the semantic similarity, it calculated the correlation confidence, left-right adjacency entropy, and multi-semantic similarity (including the semantic information of Chinese characters with fine-grained characters, BERT word vector information), and traversed the thresholds of each of the above indicators to evaluate the possibility of N-gram word strings as medical new words.[Result/Conclusion] From the latest 1 000 abstracts in the Chinese Medical Association as of October 20, 2022, the medical new word detection method identified 3 263 new words, of which 764 were retained after removing duplicates. The method incorporating multi-semantic information of Chinese characters has made some progress over existing methods, and can effectively improve the effectiveness of the medical segmentation task. After the medical word segmentation, the noun category is clearer, the concept is more explicit, and the connotation is richer. This algorithm can not only improve the computer’s new word detection ability, but also its natural language processing effect in the face of specialized and complex medical texts, which is important to timely update the domain knowledge content.

参考文献

[1] QIAN Y, DU Y, DENG X, et al. Detecting new Chinese words from massive domain texts with word embedding[J]. Journal of information science, 2018, 45(2): 196-211.
[2] CHEN X, SHI Z, QIU X, et al. Adversarial multi-criteria learning for Chinese word segmentation[C]//Proceedings of the 55th annual meeting of the association for computational linguistics. Vancouver: Association for Computational Linguistics, 2017: 1193-1203.
[3] LILEIKYTE R, FRAGA-SILVA T, LAMEL L, et al. Effective keyword search for low-resourced conversational speech[C]//2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). New Orleans: IEEE, 2017: 5785-5789.
[4] SHEIKH I, FOHR D, ILLINA I, et al. Modelling semantic context of OOV words in large vocabulary continuous speech recognition[J]. IEEE/ACM transactions on audio, speech, and language processing, 2017, 25(3): 589-610.
[5] HUANG C, ZHAO H. Chinese word segmentation: a decade review[J]. Journal of Chinese information processing, 2007, 21(3): 8-19.
[6] 曹艳, 杜慧平, 刘竟, 等. 基于词表和N-gram算法的新词识别实验[J]. 情报科学, 2007(11): 1687-1691. (CAO Y, DU H P, LIU J, et al. An experiment of new words identification based on vocabulary and N-gram algorithm[J]. Information science, 2007(11): 1687-1691.)
[7] GOH C, ASAHARA M, MATSUMOTO Y. Machine learning-based methods to Chinese unknown word detection and POS tag guessing[J]. Journal of Chinese language and computing, 2006, 16: 185-206.
[8] LI H, HUANG C, GAO J, et al. The use of SVM for Chinese new word identification[C]//Proceedings of the first international joint conference on natural language processing. Hainan Island: Springer-Verlag, 2004: 723-732.
[9] CHEN F, LIU Y, WEI C, et al. Open domain new word detection using condition random field method[J]. Journal of software, 2013, 24(5): 1051-1060.
[10] WANG A, KAN M. Mining informal language from Chinese microtext: joint word recognition and segmentation[C]//Proceedings of the 51st annual meeting of the association for computational linguistics. Sofia: Association for Computational Linguistics, 2013: 731-741.
[11] PENG F, FENG F, MCCALLUM A. Chinese segmentation and new word detection using conditional random fields[C]//Proceedings of the 20th international conference on computational linguistics. Geneva: Association for Computational Linguistics, 2004: 562.
[12] GANG Z, YANG L, QUN L. Internet-oriented Chinese new words detection[J]. Journal of Chinese information processing, 2004, 6(18): 1-9.
[13] ZHENG Y, LIU Z, SUN M, et al. Incorporating user behaviors in new word detection[C]//Proceedings of the 21st international joint conference on artificial intelligence. Pasadena: Morgan Kaufmann Publishers Inc, 2009: 2101-2106.
[14] 邹纲, 刘洋, 刘群, 等.面向Internet的中文新词语检测[J]. 中文信息学报, 2004, 18(6): 1-9. (ZOU G, LIU Y, LIU Q, et al. Internet-oriented Chinese new words detection[J]. Journal of Chinese information processing, 2004, 18(6): 1-9.)
[15] MA W Y, CHEN K J. A bottom-up merging algorithm for Chinese unknown word extraction[C]//Proceedings of the 2nd SIGHAN workshop on Chinese language processing. Sapporo: Association for Computational Linguistics, 2003: 31-38.
[16] 郑家恒, 李文花. 基于构词法的网络新词自动识别初探[J].山西大学学报(自然科学版), 2002(2): 115-119. (ZHANG J H, LI WH. A study on automatic identification for internet new words according to word-building rule[J]. Journal of Shanxi university (natural science edition), 2002(2): 115-119.)
[17] LI X, CHEN X. New word discovery algorithm based on N-gram for multi-word internal solidification degree and frequency[C]//2020 5th international conference on control, robotics and cybernetics (CRC). Piscataway: IEEE, 2020: 51-55.
[18] YAN L, BAI B, CHEN W, et al. New word extraction from Chinese financial documents[J]. IEEE signal processing letters, 2017, 24(6): 770-773.
[19] ZHU G L, LIU W T, ZHANG S X, et al. The method for extracting new login sentiment words from Chinese micro-blog based on improved mutual information[J]. Computer systems science and engineering, 2020, 35(3): 223-232.
[20] HUANG J, POWERS D. Chinese word segmentation based on contextual entropy[C]//Proceedings of the 17th Pacific Asia conference on language, information and computation. Sentosa: Colips Publications, 2003: 152-158.
[21] LEE C W, WU Y L, YU L C. Combining mutual information and entropy for unknown word extraction from multilingual code-switching sentences[J]. Journal of information science and engineering, 2019, 35(3): 597-610.
[22] CUI S, LIU Q, MENG Y, et al. New word detection based on large-scale corpus[J]. Journal of computer research and development, 2006, 43: 927.
[23] PECINA P, PAVEL S. Combining association measures for collocation extraction[C]//Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney: Association for Computational Linguistics, 2006: 651-658.
[24] DU L P, LI X G, LIN D Y. Chinese term extraction from web pages based on expected point-wise mutual information[C]//2016 12th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). Piscataway: IEEE, 2016: 1647-1651.
[25] 徐豪杰, 吴新丽, 杨文珍, 等. 基于改进PMI和最小邻接熵结合策略的未登录词识别[J]. 计算机系统应用, 2020, 29(6): 181-188. (XU H J, WU X L, YANG W Z, et al. Out-of-vocabulary detection based on combination strategy of improved PMI and minimum branch entropy[J]. Computer systems & applications, 2020, 29(6): 181-188.)
[26] 曹帅. 结合关联置信度与结巴分词的新词发现算法[J]. 计算机系统应用, 2020, 29(5): 144-151. (CAO S. New word detection algorithm combining correlation confidence and jieba word segmentation[J]. Computer systems & applications, 2020, 29(5): 144-151.)
[27] XIE T, WU B, WANG B. New word detection in ancient Chinese literature[C]//Asia-Pacific Web (APWeb) and Web-Age information management (WAIM) joint conference on Web and big data. Berlin: Springer, 2017, 10367: 260-275.
[28] JIANG D C, CHEN X Y, YANG X. A Chinese new word detection approach based on independence testing[C]//13th international conference on artificial intelligence and symbolic computation. Berlin: Springer, 2018, 11110: 227-236.
[29] JIANG D, JIANG A, TANG S. An adaptive method for Chinese new word detection based on hypothesis testing[J]. Pattern analysis and applications, 2022, 25: 993-999.
[30] 王欣. 一种基于多字互信息与邻接熵的改进新词合成算法[J]. 现代计算机(专业版), 2018, 4(1): 7-11. (WANG X. An improved new word synthesis algorithm based on multi word mutual information and branch entrop[J]. Modern computer, 2018, 4(1): 7-11.)
[31] 李文坤, 张仰森, 陈若愚. 基于词内部结合度和边界自由度的新词发现[J]. 计算机应用研究, 2015, 32(8): 2302-2304. (LI W K, ZHANG Y S, CHEN R Y. New word detection based on inner combination degree and boundary freedom degree of word[J]. Application research of computers, 2015, 32(8): 2302-2304.)
[32] 周霜霜, 徐金安, 陈钰枫, 等. 融合规则与统计的微博新词发现方法[J]. 计算机应用, 2017,37(4):1044-1050. (ZHOU S S, XU J A, CHEN Y F, et al. New words detection method for microblog text based on integrating of rules and statistics[J]. Journal of computer applications, 2017, 37(4): 1044-1050.)
[33] SUN X. Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection[C]//Proceedings of the 50th annual meeting of the association for computational linguistics. Jeju Island: Association for Computational Linguistics, 2012: 253-262.
[34] 杜丽萍, 李晓戈, 于根, 等. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报(自然科学版), 2016, 52(1): 35-40. (DU L P, LI X G, YU G, et al. New word detection based on an improved PMI algorithm for enhancing segmentation system[J]. Acta scientiarum naturalium universitatis Pekinensis, 2016, 52(1): 35-40.)
[35] MEI L, HUANG H, WEI X, et al. A novel unsupervised method for new word extraction[J]. Science China (information sciences), 2016, 59(9): 11-21.
[36] 刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019, 36(5): 1293-1296. (LIU W T, LIU P Y, L W F, et al. New word discovery algorithm based on mutual information and branch entropy[J]. Application research of computers, 2019, 36(5): 1293-1296.)
[37] JIA Y, LIU L, CHEN H, et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth[J]. Pattern analysis and applications, 2020, 23(2): 1011-1020.
[38] MEI L L, HUANG H Y, WEI X C, et al. A novel unsupervised method for new word extraction[J]. Science China-information sciences, 2016, 59: 92102.
[39] 赵京胜, 宋梦雪, 高祥, 等. 自然语言处理中的文本表示研究[J]. 软件学报, 2022, 33(1): 102–128. (ZHAO J S, SONG M X, GAO X, et al. Research on text representation in natural language processing[J]. Journal of software, 2022, 33(1): 102-128.
[40] 张婧, 黄锴宇, 梁晨, 等. 面向中文社交媒体语料的无监督新词识别研究[J]. 中文信息学报, 2018, 32(3):17-25. (ZHANG J, HUANG K Y, LIANG C, et al. Unsupervised new word extraction from Chinese social media data[J]. Journal of Chinese information processing, 2018, 32(3): 17-25.)
[41] QIAN Y, DU Y, DENG X W, et al. Detecting new Chinese words from massive domain texts with word embedding[J]. Journal of information science, 2019, 45(2): 196-211.
[42] 赵志滨, 石玉鑫, 李斌阳. 基于句法分析与词向量的领域新词发现方法[J]. 计算机科学, 2019,46(6):29-34. (ZHAO Z B, SHI Y X, LI B Y. Newly-emerging domain word detection method based on syntactic analysis and term vector[J]. Computer science, 2019, 46(6): 29-34.)
[43] DU Y, YUAN H, QIAN Y. A word vector representation based method for new words discovery in massive text[C]//5th CCF conference on natural language processing and Chinese computing (NLPCC 2016) and 24th international conference on computer processing of oriental languages (ICCPOL 2016). Kunming: Springer, 2016, 10102: 76-88.
[44] 张乐, 冷基栋, 吕学强, 等. MWEC:一种基于多语义词向量的中文新词发现方法[J].数据分析与知识发现, 2022, 6(1): 113-121. (ZHANG L, LENG J D, LV X Q, et al. Discovering Chinese new words based on multi-sense word embedding[J]. Data analysis and knowledge discovery, 2022, 6(1): 113-121.)
[45] 潘嘉鑫. 基于互信息和左右邻接熵改进的新词发现算法及情感分析[D]. 武汉: 华中科技大学, 2022. (PAN J X. Improved new word discovery algorithm and sentimentanalysis based on mutual information and left and right neighbor entropy[D]. Wuhan: Huazhong University of Science and Technology, 2022.)
[46] WANG Y, ANANIADOU S, TSUJII J. Improving clinical named entity recognition in Chinese using the graphical and phonetic feature[J]. BMC medical informatics and decision making, 2019, 19: 273.
[47] ULLMANN J R. A binary N-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words[J]. The computer journal, 1977, 20: 141-147.
[48] DEVLIN J, MING-WEI C, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[Z]. Ithaca: Cornell University Library, 2019: 4171-4186.
[49] WANG W J, LI X Y, REN H L, et al. Chinese clinical named entity recognition from electronic medical records based on multisemantic features by using robustly optimized bidirectional encoder representation from transformers pretraining approach whole word masking and convolutional neural networks: model development and validation[J]. JMIR medical informatics, 2023,11: e44597.
文章导航

/