机器学习在术语抽取研究中的文献计量分析

doi:10.13266/j.issn.0252-3116.2020.14.010

图书情报工作 ›› 2020, Vol. 64 ›› Issue (14): 94-103.DOI: 10.13266/j.issn.0252-3116.2020.14.010

机器学习在术语抽取研究中的文献计量分析

邱科达, 马建玲

中国科学院兰州文献情报中心兰州 730000;中国科学院西北生态环境资源研究院兰州 730000;中国科学院大学经济与管理学院图书情报与档案管理系北京 100049

收稿日期:2019-08-23 修回日期:2020-04-16 出版日期:2020-07-20 发布日期:2020-07-20
通讯作者: 马建玲(0000-0003-4933-5904),信息系统部副主任,研究馆员,硕士生导师,通讯作者:E-mail:majl@lzb.ac.cn
作者简介:邱科达(ORCID:0000-0002-2826-8899),硕士研究生。
基金资助:
本文系国家自然科学基金面上项目"气候变化科学成果集成研究范式及其实现平台研究"（项目编号：41671535）和中国科学院文献情报能力建设专项"开放学术资源体系"（项目编号：Y7ZG081001）研究成果之一。

A Statistical Analysis of Literature on Term Extraction Based on Machine Learning

Qiu Keda, Ma Jianling

Lanzhou Library Chinese Academy of Sciences, Lanzhou 730000 Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou 730000 Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100049

Received:2019-08-23 Revised:2020-04-16 Online:2020-07-20 Published:2020-07-20

摘要/Abstract

摘要： [目的/意义] 梳理和总结基于机器学习的自动术语抽取的相关研究，为领域相关人员提供参考。[方法/过程] 在CNKI和EndNote的分析工具基础上，应用文献计量对主题的年度趋势和核心机构进行宏观分析，然后从抽取技术方法、数据集和评价以及应用3个方面进行主题内容分析。[结果/结论] 近些年，术语抽取研究取得了很大的进步，是知识系统、自然语言处理、情报分析等领域的基础工作。随着自然语言处理领域的迅猛发展，抽取技术开始朝着深度学习方向发展，但术语抽取的基础理论体系还有待完善，如评价指标、语料选取和效果评价方法。

关键词: 术语抽取, 机器学习, 知识组织, 文献计量

Abstract: [Purpose/significance] The purpose of this paper is to sort out and summarize the relevant content of the automatic term extraction research based on machine learning, and to provide a reference for related personnel in the field. [Method/process] Firstly, this paper applied literature measurement to conduct a macro analysis of the subject's annual trends and core institutions based on the analysis tools of CNKI and EndNote, then it carried out the subject analysis from 3 aspects:extraction of technical methods, data sets and evaluation, and application. [Result/conclusion] In recent years, term extraction research has made great progress, and is the basic work in the fields of knowledge systems, natural language processing, and information analysis. With the rapid development of natural language processing, extraction technology has begun to develop in the direction of deep learning, but the basic theoretical system of term extraction still needs to be improved, such as evaluation indicators, corpus selection and effect evaluation methods.

Key words: term extraction, machine learning, knowledge organization, bibliometrics

中图分类号:

G250

邱科达, 马建玲. 机器学习在术语抽取研究中的文献计量分析[J]. 图书情报工作, 2020, 64(14): 94-103.

Qiu Keda, Ma Jianling. A Statistical Analysis of Literature on Term Extraction Based on Machine Learning[J]. LIS, 2020, 64(14): 94-103.

参考文献

[1] 术语工作原则与方法[J].术语标准化与信息技术,2003(1):45-48.
[2] ZHANG Z, GAO J, CIRAVEGNA F. Semre-rank:improving automatic term extraction by incorporating semantic relatedness with personalised pagerank[J]. ACM transactions on knowledge discovery from data, 2018, 12(5):1-41.
[3] ASTRAKHANTSEV N. ATR4S:toolkit with state-of-the-art automatic terms recognition methods in scala[J]. Language resources and evaluation, 2018, 52(3):853-872.
[4] CASTELLVí M T C, BAGOT R E, PALATRESI J V. Automatic term detection:a review of current systems[J]. Recent advances in computational terminology, 2001(2):53-88.
[5] MARSHALL P, BANDAR Z. Working towards connectionist modeling of term formation[C]//Proceedings of the international conference on computational intelligence. Heidelberg:Springer,1999:522-529.
[6] BENGIO Y. A connectionist approach to speech recognition[J]. International journal of pattern recognition and artificial intelligence, 1993, 7(4):647-667.
[7] 陈文亮, 朱靖波, 姚天顺, 等. 基于Bootstrapping的领域词汇自动获取[C]//全国第七届计算语言学联合学术会议论文集. 北京:清华大学出版社, 2003:67-72.
[8] KAUSHIK N, CHATTERJEE N. A practical approach for term and relationship extraction for automatic ontology creation from agricultural text[C]//Proceedings of the 2016 international conference on information technology. Bhubaneshwar:IEEE, 2016:241-247.
[9] STANKOVI? R, KRSTEV C, OBRADOVI? I, et al. Rule-based automatic multi-word term extraction and lemmatization[C]//Proceedings of the 10th international conference on language resources and evaluation. Portoro?, Slovenia:European Language Resources Association, 2016:507-514.
[10] DU L, LI X, LIN D. Chinese term extraction from Web pages based on expected point-wise mutual information[C]//Proceedings of the 2016 12th international conference on natural computation, fuzzy systems and knowledge discovery. Changsha:IEEE, 2016:1647-1651.
[11] 李丽双,王意文,黄德根.基于信息熵和词频分布变化的术语抽取研究[J].中文信息学报,2015,29(1):82-87.
[12] FRANTZI K T, ANANIADOU S, TSUJⅡ J. The c-value/nc-value method of automatic recognition for multi-word terms[C]//Proceedings of the international conference on theory and practice of digital libraries. Berlin:Springer, 1998:585-604.
[13] 周浪, 史树敏, 冯冲, 等. 基于多策略融合的中文术语抽取方法[J]. 情报学报, 2010(3):460-467.
[14] 王思丽,祝忠明,刘巍,等. 基于深度学习的领域本体概念自动获取方法研究[J]. 情报理论与实践,2019(10):1-13.
[15] LOPEZ P, ROMARY L. HUMB:automatic key term extraction from scientific articles in GROBID[C]//Proceedings of the 5th international workshop on semantic evaluation. Los Angeles:Association for Computational Linguistics, 2010:248-251.
[16] 赵欣. 基于最大熵的中文术语抽取系统的设计与实现[D].西安:西安电子科技大学,2012.
[17] SHIRAKAWA M, NAKAYAMA K, HARA T, et al. Wikipedia-based semantic similarity measurements for noisy short texts using extended naive bayes[J]. IEEE transactions on emerging topics in computing, 2015, 3(2):205-219.
[18] ZENG W, LI X, LI H. Study on Chinese term extraction method based on machine learning[C]//Proceedings of the international conference of pioneering computer scientists, engineers and educators. Singapore:Springer, 2018:128-135.
[19] PAN H S, ZHAO J Y. Combining syntactic information with HMM for term extraction[C]//Proceedings of the 20152nd international conference on information science and control engineering. Washington, DC:IEEE Computer Society, 2015:170-173.
[20] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 数据分析与知识发现, 2008, 24(12):54-58.
[21] 章成志. 基于多层术语度的一体化术语抽取研究[J]. 情报学报, 2011, 30(3):275-285.
[22] ZHENG D, ZHAO T, YANG J. Research on domain term extraction based on conditional random fields[C]//International conference on computer processing of oriental languages. Heidelberg:Springer, 2009:290-296.
[23] ZHAN Q, WANG C. A hybrid strategy for Chinese domain-specific terminology extraction[C]//2015 11th international conference on semantics, knowledge and grids.Washington, DC:IEEE Computer Society, 2015:217-221.
[24] RIGOUTS TERRYN A, DROUIN P, HOSTE V, et al. Analysing the Impact of supervised machine learning on automatic term extraction:HAMLET vs TermoStat[C]//Proceedings of the international conference on recent advances in natural language processing.Varna, Bulgaria:INCOMA Ltd. 2019:1012-1021.
[25] CHI C Y, ZHANG Y. Information extraction from Chinese papers based on hidden markov model[J]. Advanced materials research, 2013, 846:1291-1294.
[26] 黄菡,王宏宇,王晓光.结合主动学习的条件随机场模型用于法律术语的自动识别[J].数据分析与知识发现,2019,3(6):66-74.
[27] CHALAPATHY R, BORZESHI E Z, PICCARDI M. Bidirectional LSTM-CRF for clinical concept extraction[C]//Proceedings of the clinical natural language processing workshop. Osaka:The COLING 2016 Organizing Committee, 2016:7-12.
[28] WANG R, LIU W, MCDONALD C. Featureless domain-specific term extraction with minimal labelled data[C]//Proceedings of the Australasian Language Technology Association workshop 2016. Australia:Australasian Language Technology Association,2016:103-112.
[29] 马建红, 张亚梅, 姚爽, 等. 基于BLSTM_Attention_CRF模型的新能源汽车领域术语抽取[J]. 计算机应用研究, 2019(5):1-8.
[30] 刘宇飞,尹力,张凯,等.基于深度迁移学习的技术术语识别——以数控系统领域为例[J].情报杂志,2019,38(10):168-175.
[31] ALFARONE D, DAVIS J. Unsupervised learning of an is-a taxonomy from a limited domain-specific corpus[C]//24th international joint conference on artificial intelligence. Buenos Aires:AAAI Press, 2015:1434-1441.
[32] TERRYN A R, HOSTE V, LEFEVER E. In no uncertain terms:a dataset for monolingual and multilingual automatic term extraction from comparable corpora[J]. Language resources and evaluation, 2019(6):1-34.
[33] L'HOMME M-C, BENALI L, BERTRAND C, et al. Definition of an evaluation grid for term-extraction software[J]. Terminology international journal of theoretical and applied issues in specialized communication, 1996, 3(2):291-312.
[34] SAURON V A. Tearing out the terms:evaluating terms extractors[C]//Proceedings of translating and the computer 2002. London:Aslib, 2002:1-18.
[35] 赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9):923-938.
[36] INKPEN D, PARIBAKHT T S, FAEZ F, et al. Term evaluator:a tool for terminology annotation and evaluation[J]. International journal of computational linguistics and applications, 2016, 7(2):145-165.
[37] IKEDA M, YAMAMOTO A. Extending various thesauri by finding synonym sets from a formal concept lattice[J]. Information and media technologies, 2017(12):240-266.
[38] KAWAMURA T, KOZAKI K, KUSHIDA T, et al. Expanding science and technology thesauri from bibliographic datasets using word embedding[C]//2016 IEEE 28th international conference on tools with artificial intelligence. San Jose:IEEE, 2016:857-864.
[39] 宋培彦,陈白雪,王星.语义网环境下叙词表构建方法研究[J].情报科学, 2018, 36(2):14-17.
[40] OMELAYENKO B. Learning of ontologies for the Web:the analysis of existent approaches[C]//Proceedings of the international workshop on Web dynamics. London:WebDyn@ICDT. 2001:16-25.
[41] 李丽双. 领域本体学习中术语及关系抽取方法的研究[D].大连:大连理工大学,2013.
[42] 蒋婷. 学科领域本体学习及学术资源语义标注研究[D].南京:南京大学,2017.
[43] GAIZAUSKAS R, PARAMITA M L, BARKER E, et al. Extracting bilingual terms from the Web[J]. Terminology international journal of theoretical and applied issues in specialized communication, 2015, 21(2):205-236.
[44] HUANG G, ZHANG J, ZHOU Y, et al. Learning from parenthetical sentences for term translation in machine translation[C]//Proceedings of the 9th SIGHAN workshop on Chinese language processing.Taipei:Association for Computational Linguistics.2017:37-45.
[45] KHIN N T W, YEE N N. Query classification based information retrieval system[C]//2018 international conference on intelligent informatics and biomedical sciences. Bangkok:IEEE, 2018:151-156.
[46] 曾文,李辉,徐红姣,等.深度学习技术在科技文献数据分析中的应用研究[J].情报理论与实践,2018,41(5):110-113.
[47] 曾文,车尧,张运良,等.服务于科技大数据情报分析的方法及工具研究[J].情报科学,2019,37(4):92-96.
[48] 俞琰,赵乃瑄.融入术语知识的专利主题发现方法[J].图书情报工作,2018,62(21):118-126.
[49] 王健,殷旭,吕学强,等.基于CRFs的专利文献领域术语抽取方法[J].计算机工程与设计,2019,40(1):279-284.

机器学习在术语抽取研究中的文献计量分析

A Statistical Analysis of Literature on Term Extraction Based on Machine Learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孙震, 冷伏海. 一种基于知识元变异的ESI研究前沿知识演进分析方法[J]. 图书情报工作, 2022, 66(2): 136-148.
[2]	陆颖颖, 孙裕彤, 张瑶, 李旭光. 人工智能、机器学习、自动化和机器人技术对信息行业的影响——2021年CILIP专题研讨会综述与启示[J]. 图书情报工作, 2022, 66(19): 143-152.
[3]	张彪, 吴红, 高道斌, 林艳秋. 基于潜在高被引论文与高价值专利的创新前沿识别研究[J]. 图书情报工作, 2022, 66(18): 72-83.
[4]	王伟, 梁继文, 杨建林. 基于引文网络的领域主题层次结构识别方法研究[J]. 图书情报工作, 2022, 66(17): 81-92.
[5]	孟凡, 黄文彬, 李孟阳. 文献老化的编码化表示及分析方法[J]. 图书情报工作, 2022, 66(15): 14-22.
[6]	司莉, 郭财强. 基于内容分析的数字人文领域中知识组织价值体现研究综述[J]. 图书情报工作, 2022, 66(13): 127-137.
[7]	赵蓉英, 朱伟杰, 常茹茹, 刘卓著. 话语权研究可视化分析：溯源、审视与演进[J]. 图书情报工作, 2022, 66(11): 4-13.
[8]	张硕, 汪雪锋, 乔亚丽, 刘玉琴. 技术预测研究现状、趋势及未来思考:数据分析视角[J]. 图书情报工作, 2022, 66(10): 4-18.
[9]	王铮, 张奕乐, 杨佳欣. 图书馆发展的“技术尺度”:基于涉图书馆专利的图书馆技术演进比较研究[J]. 图书情报工作, 2022, 66(10): 89-97.
[10]	严炜炜, 黄为, 温馨. 学术社交网络问答质量智能评价与服务优化研究[J]. 图书情报工作, 2021, 65(6): 129-137.
[11]	欧阳剑, 梁珠芳, 任树怀. 大规模中国历代存世典籍知识图谱构建研究[J]. 图书情报工作, 2021, 65(5): 126-135.
[12]	丁培. 学术图表知识发现技术框架及研究进展[J]. 图书情报工作, 2021, 65(23): 136-148.
[13]	李璐, 马捷, 孙恒宇, 王珏. 面向中医诊疗知识库的医案元数据模型构建研究[J]. 图书情报工作, 2021, 65(2): 4-16.
[14]	郭诗琪, 贠强, 陈亮, 周杰. 专利无效对比文件判定方法研究[J]. 图书情报工作, 2021, 65(2): 117-125.
[15]	高嵩, 张智雄, 丁颖. 不为繁华易匠心潜心钻研结硕果——孟连生在文献计量学和引文数据库建设方面的探索与影响[J]. 图书情报工作, 2021, 65(15): 22-29.