Library and Information Service >
The Study on Consumer Health New Term Identification Based on Hybrid Method
Received date: 2015-11-11
Revised date: 2015-11-20
Online published: 2015-12-05
[Purpose/significance] Identify the health term by consumer understanding from Web query data, to provide fundamental term set for carrying out the mapping between the consumer-friendly terms and the professionals in medical domain.[Method/process] The consumer health term identification model is set up combining N-Gram and rule, and the Web query data is captured from consumers. Using these data as samples, implement experiment, the rationality of the model is verified by expert reviewing. [Result/conclusion] The method of new term identified in this paper is extracting rules from consumers' question data in Web query dataset, and combining statistical methods. The identified model in this paper has better identification capability, which can provide significant dataset for mapping the lay terms between the professionals in consumer health domain. The experimental results show that it can provide preprocessing text for follow-up experiment by processing the public Web data based on rules, the identified model of combining N-Gram and rules can identify new health terms in short text, and the model is reasonable and scientific.
Key words: Web query data; consumer health term; N-Gram; entity identification
Hou Li , Li Jiao , Hou Zhen , Chen Songjing . The Study on Consumer Health New Term Identification Based on Hybrid Method[J]. Library and Information Service, 2015 , 59(23) : 115 -123 . DOI: 10.13266/j.issn.0252-3116.2015.23.017
[1] 国家卫生和计划生育委员会宣传司,中国健康中心.2013年中国居民健康素养监测报告[EB/OL].[2015-06-10]. http://www.nhfpc.gov.cn/ewebeditor/uploadfile/2014/12/20141217091407553.pdf.
[2] 第八次中国公民科学素养调查结果发布[EB/OL].[2015-07-13]. http://www.cast.org.cn/n35081/n35473/n35518/12451858.html.
[3] 中东呼吸综合征:韩国疫情蔓延[EB/OL].[2015-07-13].http://world.people.com.cn/n/2015/0611/c1002-27141708.html.
[4] Miller T, Leroy G, Wood E. Dynamic generation of a table of contents with consumer-friendly labels[EB/OL].[2015-10-05].http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839557.
[5] Qing Z T, Tse T, Crowell J.Identifying consumer-friendly display(CFD) names for health concepts[EB/OL].[2015-10-05].http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732.
[6] Zhang Shaodian, Elhadad N.Unsupervised biomedical named entity recognition:Experiments with clinical and biological texts[J].Journal Biomedical Information. 2013,46(6):1-29.
[7] 宗成庆.统计自然语言处理[M].北京:清华大学出版社,2014:11-12.
[8] 栗伟, 赵大哲, 李博, 等. CRF 与规则相结合的医学病历实体识别[J]. 计算机应用研究, 2015,32(4):1082-1086.
[9] 郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报(自然科学版),2002,25(2):115-119.
[10] 穗志方.信息科学技术领域术语自动识别策略[C]//北京大学计算语言研究所.第二届中日自然语言处理专家研讨会论文集,北京:万方数据,2002:32-38.
[11] Sui Zhifang,Chen Yirong,Wei Zhouchao.Automatic recognition of chinese scientific and technological keyphrases using integrated linguistic knowledge[C]//Zong Chengqing. Proceedings of Natural Language Processing and Knowledge Engineering.New York:IEEE,2003:444-451.
[12] 叶枫, 陈莺莺, 周根贵,等. 电子病历中命名实体的智能识别[J].中国生物医学工程学报, 2011, 30(2):256-262.
[13] 段宇锋,鞠菲.基于N-Gram 的专业领域中文新词识别研究[J].现代图书情报技术,2012,216(2):41-47.
[14] Pantel P, Lin Dekang.A statistical corpus-based term extractor[C]//Stroullia E, Matwin S.Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence:Advances in Artificial Intelligence.Berlin:Springer Berlin Heidelberg, 2001:36-46.
[15] Thuy V U, Ai Ti Aw,Zhang Min.Term extraction through unithood and termhood unification[EB/OL].[2015-10-15].https://aclweb.org/anthology/I/I08/I08-2084.
[16] Chen L F.PAT-tree-based keyword extraction for chinese information retrieval[C]//SIGIR ACM Special Interest Group on Information Retrieval.Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,1997:50-59.
[17] 王丹, 樊兴华. 面向短文本的命名实体识别[J]. 计算机应用, 2009,29(1):143-146.
[18] 张金龙,王石,钱存发.基于CRF和规则的中文医疗机构名称识别[J].计算机科学应用与软件,2014,31(3):158-202.
[19] 唐涛,周俏丽,张桂平.统计与规则相结合的术语抽取[J].沈阳航空航天大学学报,2011,28(5):71-75.
[20] Wu Andi,Jiang Zixun.Statistically-enhanced new word identification in a rule-based chinese system[EB/OL].[2015-11-20].https://www.aclweb.org/anthology/W/W00/W00-1207.pdf.
[21] 王琳琳.规则与统计项结合的中文新词识别研究[J].嘉兴学院学报,2014,26(6):124-131.
[22] 贺敏,龚才春,张华平,等.一种基于大规模语料的新词识别方法[J].计算机工程与应用,2007,43(21):157-159.
[23] 张海军,史树敏,朱朝勇,等.中文新词识别技术综述[J].计算机科学,2010,37(3):6-12.
[24] 中国互联网络信息中心.2014年中国搜索引擎市场研究报告[EB/OL].[2015-11-20].http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/ssbg/201401/P020140127366465515288.pdf.
/
〈 |
|
〉 |