知识组织

融合论文关键词知识的专利术语抽取方法

  • 俞琰 ,
  • 陈磊 ,
  • 姜金德 ,
  • 赵乃瑄
展开
  • 1 南京工业大学信息服务部 南京 210009;
    2 东南大学成贤学院计算机工程系 南京 211816;
    3 南京晓庄学院商学院 南京 211171
俞琰(ORCID:0000-0002-9654-8614),教授,博士,E-mail:yuyanyuyan2004@126.com;陈磊,硕士研究生;姜金德(ORCID:0000-0002-5504-7493),教授,博士;赵乃瑄(ORCID:0000-0001-9072-7315),教授,博士。

收稿日期: 2018-12-23

  修回日期: 2019-10-05

  网络出版日期: 2020-07-20

基金资助

本文系国家社会科学基金一般规划项目"大数据时代支持创新设计的多维度多层次专利文本挖掘研究"(项目编号:17BTQ059)研究成果之一。

Patent Term Extraction by Integrating Keyword Knowledge From Paper

  • Yu Yan ,
  • Chen Lei ,
  • Jiang Jinde ,
  • Zhao Naixuan
Expand
  • 1 Information Service Department, Nanjing Tech University, Nanjing 210009;
    2 Computer Science Department, Southeast University Chengxian College, Nanjing 211816;
    3 School of Business, Nanjing Xiaozhuang University, Nanjing 211171

Received date: 2018-12-23

  Revised date: 2019-10-05

  Online published: 2020-07-20

摘要

[目的/意义] 提出利用丰富的论文关键词知识获取专利文本之外的有效特征,以弥补因专利文本集自身信息不足而制约专利术语抽取效果这一缺陷,提高专利术语抽取准确率。[方法/过程] 根据相关论文的关键词知识,分别提出领域相关度和首尾度两个特征,以衡量候选术语成为术语的可能性,并将这些特征融入到专利术语抽取的传统方法之中。[结果/结论] 实验结果表明,利用论文关键词得到的候选术语领域相关度和首尾度信息,可使结合论文关键词知识的方法比传统的术语抽取方法的准确率有了明显的提升。

本文引用格式

俞琰 , 陈磊 , 姜金德 , 赵乃瑄 . 融合论文关键词知识的专利术语抽取方法[J]. 图书情报工作, 2020 , 64(14) : 104 -111 . DOI: 10.13266/j.issn.0252-3116.2020.14.011

Abstract

[Purpose/significance] In order to make up for the shortcomings of the patent text collection itself to limit the effect of patent term extraction, this paper proposes to use the rich keyword knowledge to obtain effective features outside the patent text to improve the patent term extraction effect. [Method/process] According to the keyword knowledge of related papers, two kinds of characteristic, degree of domain relevance and degree of head & tail are proposed to measure the possibility that candidate terms become terminology, and these characteristics are incorporated into the traditional method of patent term extraction. [Result/conclusion] The experimental results show that the degree of domain relevance and the degree of head & tail of the candidate terms obtained by using the keyword information of the papers make the method of combining the keyword knowledge of the papers significantly higher than the accuracy of the traditional term extraction method.

参考文献

[1] FRANTZI K, ANANIADOU S, MIMA H. Automatic recognition of multi-word terms:the C-value/NC-value method[J]. International journal on digital libraries, 2000, 3(2):115-130.
[2] 周霜霜,徐金安, 陈钰枫等. 融合规则与统计的微博新词发现方法[J]. 计算机应用, 2017, 37(4):1044-1050.
[3] HIROYUKI T, TAKAKAYUKI T. A bibliometric analysis of scientific literatures cited by influential patents[J]. Journal of information processing and management, 2006, 49(1):2-10.
[4] 陈红媚. 科技论文关键词选取[J]. 西安石油大学学报(自然科学版), 2011,26(4):109-110.
[5] 李娜, 戎文慧, 边志英. 如何确定关键词[J]. 临床荟萃, 2003, 18(12):674-674.
[6] 覃佳慧, 何耶奇, 叶鹰. 科学论文和技术专利的引用时滞及循环周期研究[J]. 情报理论与实践, 2018, 41(7):23-25.
[7] 曾文, 徐硕, 张运良, 等. 科技文献术语的自动抽取技术研究与分析[J]. 现代图书情报技术, 2014(1):51-55.
[8] SPASIC I, GREENWOOD M, PREECE A, et al. FlexiTerm:a flexible term recognition method[J]. Journal of biomedical semantics, 2013, 27(4):1-15.
[9] 韩红旗, 朱东华, 汪雪锋. 专利技术术语的抽取方法[J]. 情报学报, 2011, 30(12):1280-1285.
[10] 胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013, 230(2):24-29.
[11] 张雷瀚, 吕学强, 李卓,等. 领域本体术语的抽取方法研究[J]. 情报学报, 2014, 33(2):167-174.
[12] 周霜霜, 徐金安, 陈钰枫,等. 融合规则与统计的微博新词发现方法[J]. 计算机应用, 2017, 37(4):1044-1050.
[13] 俞琰, 赵乃瑄. 基于通用词与术语部件的专利术语抽取[J]. 情报学报, 2018, 37(7):742-752.
[14] 丁杰, 吕学强, 刘克会. 基于边界标记集的专利文献术语抽取方法[J]. 计算机工程与科学, 2015, 37(8):1591-1598.
[15] 刘剑, 唐慧丰, 刘伍颖. 一种基于统计技术的中文术语抽取方法[J]. 中国科技术语, 2014, 16(5):10-14.
[16] 杜丽萍, 李晓戈, 于根,等. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报(自然科学版), 2016, 52(1):35-40.
[17] ZHANG W, YOSHIDA T, TANG X, et al. Improving effectiveness of mutual information for substantival multiword expression extraction[J]. Expert systems with applications an international journal, 2009, 36(8):10919-10930.
[18] 木合亚提·尼亚孜别克, 古力沙吾利·塔里甫. 哈萨克语IT领域术语识别研究与实现[J]. 中文信息学报, 2016(3):68-73.
[19] 王昊, 王密平, 苏新宁. 面向本体学习的中文专利术语抽取研究[J]. 情报学报, 2016, 35(6):573-585.
[20] ZENG D, SUN C, LIN L, et al. LSTM-CRF for drug-named entity recognition[J]. Entropy, 2017, 19(6):283-295.
[21] CONRADO M, PARDO T, REZENDE S. A machine learning approach to automatic term extraction using a rich feature set[C]//The 2013 conference of the north American chapter of the association for computational Linguistics:human language technologies. Atlanta, Geoogia:Association for Computational Linguistics, 2013:16-23.
[22] BHATTACHARYA S, KRETSCHMER H, MEYER M. Characterizing intellectual spaces between science and technology[J]. Scientometrics, 2003, 58(2):369-390.
[23] NARIN F, NOMA E. Is technology becoming science?[J]. Scientometrics, 1985, 7(3):369-381.
[24] NARIN F, HAMILTON K S, OLIVASTRO D. The increasing linkage between U.S. technology and public science[J]. Research policy, 1997, 26(3):317-330.
[25] MAGERMAN T, LOOY B V, SONG X. Exploring the feasibility and accuracy of latent semantic analysis based text mining techniques to detect similarity between patent documents and scientific publications[J]. Scientometrics, 2010, 82(2):289-306.
[26] QI Y, ZHU N, ZHAI Y, et al. The mutually beneficial relationship of patents and scientific literature:topic evolution in nanoscience[J]. Scientometrics, 2018, 115(1):893-911.
[27] HUANG M H, YANG H W, CHEN D Z. Increasing science and technology linkage in fuel cells:a cross citation analysis of papers and patents[J]. Journal of informetrics, 2015, 9(2):237-249.
[28] 吴菲菲, 黄鲁成, 石媛嫄. 基于文献和专利相互引用的科学与技术关系分析[J]. 科学学与科学技术管理, 2013, 34(10):13-20.
[29] 彭彦淇, 覃佳慧, 叶鹰. 石墨烯研究中专利与论文的交叉引用分析[J]. 情报理论与实践, 2018, 41(7):18-21.
[30] 黄鲁成, 王静静, 李欣,等. 基于论文和专利的钙钛矿太阳能电池的技术机会分析[J]. 情报学报, 2016, 35(7):686-695.
[31] 陈二静, 姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 6(6):1-11.
文章导航

/