情报研究

机器学习在术语抽取研究中的文献计量分析

  • 邱科达 ,
  • 马建玲
展开
  • 中国科学院兰州文献情报中心 兰州 730000;中国科学院西北生态环境资源研究院 兰州 730000;中国科学院大学经济与管理学院图书情报与档案管理系 北京 100049
邱科达(ORCID:0000-0002-2826-8899),硕士研究生。

收稿日期: 2019-08-23

  修回日期: 2020-04-16

  网络出版日期: 2020-07-20

基金资助

本文系国家自然科学基金面上项目"气候变化科学成果集成研究范式及其实现平台研究"(项目编号:41671535)和中国科学院文献情报能力建设专项"开放学术资源体系"(项目编号:Y7ZG081001)研究成果之一。

A Statistical Analysis of Literature on Term Extraction Based on Machine Learning

  • Qiu Keda ,
  • Ma Jianling
Expand
  • Lanzhou Library Chinese Academy of Sciences, Lanzhou 730000 Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou 730000 Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100049

Received date: 2019-08-23

  Revised date: 2020-04-16

  Online published: 2020-07-20

摘要

[目的/意义] 梳理和总结基于机器学习的自动术语抽取的相关研究,为领域相关人员提供参考。[方法/过程] 在CNKI和EndNote的分析工具基础上,应用文献计量对主题的年度趋势和核心机构进行宏观分析,然后从抽取技术方法、数据集和评价以及应用3个方面进行主题内容分析。[结果/结论] 近些年,术语抽取研究取得了很大的进步,是知识系统、自然语言处理、情报分析等领域的基础工作。随着自然语言处理领域的迅猛发展,抽取技术开始朝着深度学习方向发展,但术语抽取的基础理论体系还有待完善,如评价指标、语料选取和效果评价方法。

本文引用格式

邱科达 , 马建玲 . 机器学习在术语抽取研究中的文献计量分析[J]. 图书情报工作, 2020 , 64(14) : 94 -103 . DOI: 10.13266/j.issn.0252-3116.2020.14.010

Abstract

[Purpose/significance] The purpose of this paper is to sort out and summarize the relevant content of the automatic term extraction research based on machine learning, and to provide a reference for related personnel in the field. [Method/process] Firstly, this paper applied literature measurement to conduct a macro analysis of the subject's annual trends and core institutions based on the analysis tools of CNKI and EndNote, then it carried out the subject analysis from 3 aspects:extraction of technical methods, data sets and evaluation, and application. [Result/conclusion] In recent years, term extraction research has made great progress, and is the basic work in the fields of knowledge systems, natural language processing, and information analysis. With the rapid development of natural language processing, extraction technology has begun to develop in the direction of deep learning, but the basic theoretical system of term extraction still needs to be improved, such as evaluation indicators, corpus selection and effect evaluation methods.

参考文献

[1] 术语工作原则与方法[J].术语标准化与信息技术,2003(1):45-48.
[2] ZHANG Z, GAO J, CIRAVEGNA F. Semre-rank:improving automatic term extraction by incorporating semantic relatedness with personalised pagerank[J]. ACM transactions on knowledge discovery from data, 2018, 12(5):1-41.
[3] ASTRAKHANTSEV N. ATR4S:toolkit with state-of-the-art automatic terms recognition methods in scala[J]. Language resources and evaluation, 2018, 52(3):853-872.
[4] CASTELLVí M T C, BAGOT R E, PALATRESI J V. Automatic term detection:a review of current systems[J]. Recent advances in computational terminology, 2001(2):53-88.
[5] MARSHALL P, BANDAR Z. Working towards connectionist modeling of term formation[C]//Proceedings of the international conference on computational intelligence. Heidelberg:Springer,1999:522-529.
[6] BENGIO Y. A connectionist approach to speech recognition[J]. International journal of pattern recognition and artificial intelligence, 1993, 7(4):647-667.
[7] 陈文亮, 朱靖波, 姚天顺, 等. 基于Bootstrapping的领域词汇自动获取[C]//全国第七届计算语言学联合学术会议论文集. 北京:清华大学出版社, 2003:67-72.
[8] KAUSHIK N, CHATTERJEE N. A practical approach for term and relationship extraction for automatic ontology creation from agricultural text[C]//Proceedings of the 2016 international conference on information technology. Bhubaneshwar:IEEE, 2016:241-247.
[9] STANKOVI? R, KRSTEV C, OBRADOVI? I, et al. Rule-based automatic multi-word term extraction and lemmatization[C]//Proceedings of the 10th international conference on language resources and evaluation. Portoro?, Slovenia:European Language Resources Association, 2016:507-514.
[10] DU L, LI X, LIN D. Chinese term extraction from Web pages based on expected point-wise mutual information[C]//Proceedings of the 2016 12th international conference on natural computation, fuzzy systems and knowledge discovery. Changsha:IEEE, 2016:1647-1651.
[11] 李丽双,王意文,黄德根.基于信息熵和词频分布变化的术语抽取研究[J].中文信息学报,2015,29(1):82-87.
[12] FRANTZI K T, ANANIADOU S, TSUJⅡ J. The c-value/nc-value method of automatic recognition for multi-word terms[C]//Proceedings of the international conference on theory and practice of digital libraries. Berlin:Springer, 1998:585-604.
[13] 周浪, 史树敏, 冯冲, 等. 基于多策略融合的中文术语抽取方法[J]. 情报学报, 2010(3):460-467.
[14] 王思丽,祝忠明,刘巍,等. 基于深度学习的领域本体概念自动获取方法研究[J]. 情报理论与实践,2019(10):1-13.
[15] LOPEZ P, ROMARY L. HUMB:automatic key term extraction from scientific articles in GROBID[C]//Proceedings of the 5th international workshop on semantic evaluation. Los Angeles:Association for Computational Linguistics, 2010:248-251.
[16] 赵欣. 基于最大熵的中文术语抽取系统的设计与实现[D].西安:西安电子科技大学,2012.
[17] SHIRAKAWA M, NAKAYAMA K, HARA T, et al. Wikipedia-based semantic similarity measurements for noisy short texts using extended naive bayes[J]. IEEE transactions on emerging topics in computing, 2015, 3(2):205-219.
[18] ZENG W, LI X, LI H. Study on Chinese term extraction method based on machine learning[C]//Proceedings of the international conference of pioneering computer scientists, engineers and educators. Singapore:Springer, 2018:128-135.
[19] PAN H S, ZHAO J Y. Combining syntactic information with HMM for term extraction[C]//Proceedings of the 20152nd international conference on information science and control engineering. Washington, DC:IEEE Computer Society, 2015:170-173.
[20] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 数据分析与知识发现, 2008, 24(12):54-58.
[21] 章成志. 基于多层术语度的一体化术语抽取研究[J]. 情报学报, 2011, 30(3):275-285.
[22] ZHENG D, ZHAO T, YANG J. Research on domain term extraction based on conditional random fields[C]//International conference on computer processing of oriental languages. Heidelberg:Springer, 2009:290-296.
[23] ZHAN Q, WANG C. A hybrid strategy for Chinese domain-specific terminology extraction[C]//2015 11th international conference on semantics, knowledge and grids.Washington, DC:IEEE Computer Society, 2015:217-221.
[24] RIGOUTS TERRYN A, DROUIN P, HOSTE V, et al. Analysing the Impact of supervised machine learning on automatic term extraction:HAMLET vs TermoStat[C]//Proceedings of the international conference on recent advances in natural language processing.Varna, Bulgaria:INCOMA Ltd. 2019:1012-1021.
[25] CHI C Y, ZHANG Y. Information extraction from Chinese papers based on hidden markov model[J]. Advanced materials research, 2013, 846:1291-1294.
[26] 黄菡,王宏宇,王晓光.结合主动学习的条件随机场模型用于法律术语的自动识别[J].数据分析与知识发现,2019,3(6):66-74.
[27] CHALAPATHY R, BORZESHI E Z, PICCARDI M. Bidirectional LSTM-CRF for clinical concept extraction[C]//Proceedings of the clinical natural language processing workshop. Osaka:The COLING 2016 Organizing Committee, 2016:7-12.
[28] WANG R, LIU W, MCDONALD C. Featureless domain-specific term extraction with minimal labelled data[C]//Proceedings of the Australasian Language Technology Association workshop 2016. Australia:Australasian Language Technology Association,2016:103-112.
[29] 马建红, 张亚梅, 姚爽, 等. 基于BLSTM_Attention_CRF模型的新能源汽车领域术语抽取[J]. 计算机应用研究, 2019(5):1-8.
[30] 刘宇飞,尹力,张凯,等.基于深度迁移学习的技术术语识别——以数控系统领域为例[J].情报杂志,2019,38(10):168-175.
[31] ALFARONE D, DAVIS J. Unsupervised learning of an is-a taxonomy from a limited domain-specific corpus[C]//24th international joint conference on artificial intelligence. Buenos Aires:AAAI Press, 2015:1434-1441.
[32] TERRYN A R, HOSTE V, LEFEVER E. In no uncertain terms:a dataset for monolingual and multilingual automatic term extraction from comparable corpora[J]. Language resources and evaluation, 2019(6):1-34.
[33] L'HOMME M-C, BENALI L, BERTRAND C, et al. Definition of an evaluation grid for term-extraction software[J]. Terminology international journal of theoretical and applied issues in specialized communication, 1996, 3(2):291-312.
[34] SAURON V A. Tearing out the terms:evaluating terms extractors[C]//Proceedings of translating and the computer 2002. London:Aslib, 2002:1-18.
[35] 赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9):923-938.
[36] INKPEN D, PARIBAKHT T S, FAEZ F, et al. Term evaluator:a tool for terminology annotation and evaluation[J]. International journal of computational linguistics and applications, 2016, 7(2):145-165.
[37] IKEDA M, YAMAMOTO A. Extending various thesauri by finding synonym sets from a formal concept lattice[J]. Information and media technologies, 2017(12):240-266.
[38] KAWAMURA T, KOZAKI K, KUSHIDA T, et al. Expanding science and technology thesauri from bibliographic datasets using word embedding[C]//2016 IEEE 28th international conference on tools with artificial intelligence. San Jose:IEEE, 2016:857-864.
[39] 宋培彦,陈白雪,王星.语义网环境下叙词表构建方法研究[J].情报科学, 2018, 36(2):14-17.
[40] OMELAYENKO B. Learning of ontologies for the Web:the analysis of existent approaches[C]//Proceedings of the international workshop on Web dynamics. London:WebDyn@ICDT. 2001:16-25.
[41] 李丽双. 领域本体学习中术语及关系抽取方法的研究[D].大连:大连理工大学,2013.
[42] 蒋婷. 学科领域本体学习及学术资源语义标注研究[D].南京:南京大学,2017.
[43] GAIZAUSKAS R, PARAMITA M L, BARKER E, et al. Extracting bilingual terms from the Web[J]. Terminology international journal of theoretical and applied issues in specialized communication, 2015, 21(2):205-236.
[44] HUANG G, ZHANG J, ZHOU Y, et al. Learning from parenthetical sentences for term translation in machine translation[C]//Proceedings of the 9th SIGHAN workshop on Chinese language processing.Taipei:Association for Computational Linguistics.2017:37-45.
[45] KHIN N T W, YEE N N. Query classification based information retrieval system[C]//2018 international conference on intelligent informatics and biomedical sciences. Bangkok:IEEE, 2018:151-156.
[46] 曾文,李辉,徐红姣,等.深度学习技术在科技文献数据分析中的应用研究[J].情报理论与实践,2018,41(5):110-113.
[47] 曾文,车尧,张运良,等.服务于科技大数据情报分析的方法及工具研究[J].情报科学,2019,37(4):92-96.
[48] 俞琰,赵乃瑄.融入术语知识的专利主题发现方法[J].图书情报工作,2018,62(21):118-126.
[49] 王健,殷旭,吕学强,等.基于CRFs的专利文献领域术语抽取方法[J].计算机工程与设计,2019,40(1):279-284.
文章导航

/