知识组织

基于信息增益与相似度的专利关键词抽取算法评价模型

  • 俞琰 ,
  • 鞠鹏 ,
  • 尚明杰
展开
  • 1. 南京工业大学信息管理与技术研究所 南京 210009;
    2. 东南大学成贤学院计算机工程系 南京 211816
俞琰,教授,博士,E-mail:yuyanyuyan2004@126.com;鞠鹏,硕士研究生;尚明杰,硕士研究生。

收稿日期: 2021-07-08

  修回日期: 2021-10-30

  网络出版日期: 2022-03-30

基金资助

本文系国家社会科学基金项目"大数据时代支持创新设计的多维度多层次专利文本挖掘研究"(项目编号:17BTQ059)研究成果之一。

Research on the Evaluation Method of Patent Keyword Extraction Algorithm Based on Information Gain and Similarity

  • Yu Yan ,
  • Ju Peng ,
  • Shang Mingjie
Expand
  • 1. Institute of the Information Management and Technology, Nanjing Technology University, Nanjing 210009;
    2. School of Electronics and Computer, Chengxian College, Southeast University, Nanjing 211816

Received date: 2021-07-08

  Revised date: 2021-10-30

  Online published: 2022-03-30

摘要

[目的/意义] 针对目前专利关键词抽取算法评价中主要采用抽取的关键词与专家人工标注关键词进行匹配存在的问题,提出一种基于信息增益与相似度的专利关键词抽取算法评价模型。[方法/过程] 提出的评价模型从内部和外部两个层面评估专利关键词抽取算法的准确性。其中,内部评价模型度量待评价算法抽取的每个关键词的信息增益,以评估被抽取的关键词的新颖性与创造性;外部评价模型使用待评价算法抽取的关键词集表示专利,计算相关专利的相似度,衡量算法抽取的关键词描述专利主题的有效性。[结果/结论] 通过评价模型有效性验证实验与评价模型应用实证研究,结果表明提出的基于信息增益与相似度的评价模型具有可行性与有效性。

本文引用格式

俞琰 , 鞠鹏 , 尚明杰 . 基于信息增益与相似度的专利关键词抽取算法评价模型[J]. 图书情报工作, 2022 , 66(6) : 108 -117 . DOI: 10.13266/j.issn.0252-3116.2022.06.012

Abstract

[Purpose/significance] Aiming at the problems existing in the evaluation of patent keyword extraction algorithm, which mainly uses the extracted keywords to match the keywords manually labeled by experts, an evaluation model of patent keyword extraction algorithm based on information gain and similarity is proposed.[Method/process] The proposed evaluation model evaluated the accuracy of the patent keyword extraction algorithm from intrinsic and extrinsic levels. The intrinsic evaluation model measured the information gain of each keyword extracted by the evaluation algorithm to evaluate the novelty and creativity of the extracted keywords. The extrinsic evaluation model used the keyword set extracted by the evaluation algorithm to represent the patents, and measured the effectiveness of the keywords extracted by the algorithm to describe the patent topic by calculating the similarity of relevant patents.[Result/conclusion] Through the validation experiment of the evaluation model and the empirical research on the application of the evaluation model, the results show that the evaluation model based on information gain and similarity is feasible and effective.

参考文献

[1] HU J, LI S, YAO Y, et al. Patent keyword extraction algorithm based on distributed representation for patent classification[J]. Entropy, 2018, 20(2):104-124.
[2] JOUNG J, KIM K. Monitoring emerging technologies for technology planning using technical keyword based analysis from patent data[J]. Technological forecasting and social change, 2017, 114:281-292.
[3] 周胜生. 关键词在专利文献检索中的应用[J]. 情报理论与实践, 2010, 33(5):67-70.
[4] 王坤, 王京安, 汤月,等. 基于专利和科技论文的技术机会识别研究——以金属3D打印技术为例[J]. 科技管理研究, 2018(7):73-79.
[5] FIROOZEH N, NAZARENKO A, ALIZON F, et al. Keyword extraction:Issues and methods[J]. Natural language engineering, 2020, 26(3):259-291.
[6] RISTAD E S, YIANILOS P N. Learning string-edit distance[J]. IEEE transactions on pattern analysis and machine intelligence, 1998, 20(5):522-532.
[7] DAGAN I, PEREIRA F. Similarity-based estimation of word cooccurrence probabilities[EB/OL].[2021-10-28]. https://arxiv.org/pdf/cmp-lg/9405001.pdf.
[8] 章成志, 周冬敏. 自动标引通用评价模型研究[J]. 情报学报, 2009, 28(1):40-47.
[9] 俞琰, 尚明杰, 赵乃瑄. 权利要求特征驱动的专利关键词抽取方法[J]. 情报学报, 2021, 40(6):610-620.
[10] 马慧芳, 刘芳, 夏琴, 等. 基于加权超图随机游走的文献关键词提取算法[J]. 电子学报, 2018, 46(6):1410-1414.
[11] 王志宏, 过弋. 基于词句重要性的中文专利关键词自动抽取研究[J]. 情报理论与实践, 2018, 41(9):123-129.
[12] SINGHAL A, KASTURI R, SRIVASTAVA J, et al. Leveraging web resources for keyword assignment to short text documents[EB/OL].[2021-10-28]. https://arxiv.org/ftp/arxiv/papers/1706/1706.05985.pdf.
[13] VOORHEES E M. The TREC-8 question answering track report[EB/OL].[2021-10-28]. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.6392&rep=rep1&type=pdf.
[14] FLORESCU C, CARAGEA C. Positionrank:an unsupervised approach to keyphrase extraction from scholarly documents[C]//Proceedings of the 55th annual meeting of the Association for Computational Linguistics. Stroudsburg:ACL, 2017:1105-1115.
[15] ZHANG Y, CHANG Y, LIU X, et al. Mike:keyphrase extraction by integrating multidimensional information[C]//Proceedings of the 2017 ACM on conference on information and knowledge management. New York:ACM, 2017:1349-1358.
[16] BUCKLEY C, VOORHEES E M. Retrieval evaluation with incomplete information[C]//Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. New York:ACM, 2004:25-32.
[17] LIU Z, HUANG W, ZHENG Y, et al. Automatic keyphrase extraction via topic decomposition[C]//Proceedings of the 2010 conference on empirical methods in natural language processing. Stroudsburg:ACL, 2010:366-376.
[18] ZHANG K, XU H, TANG J, et al. Keyword extraction using support vector machine[C]//Proceedings of the 6th international conference on advances in Web-age information management conference. Berlin:Springer-Verlag, 2006:85-96.
[19] TURNEY P D. Mining the Web for lexical knowledge to improve keyphrase extraction:learning from labeled and unlabeled data[EB/OL].[2021-10-31]. https://arxiv.org/ftp/cs/papers/0212/0212011.pdf.
[20] KIM S N, MEDELYAN O, KAN M-Y, et al. SemEval-2010 task 5:automatic keyphrase extraction from scientific articles[EB/OL].[2021-10-31]. https://aclanthology.org/S10-1004.pdf.
[21] AUGENSTEIN I, DAS M, RIEDEL S, et al. SemEval 2017 task 10:ScienceIE-Extracting keyphrases and relations from scientific publications[EB/OL].[2021-10-31]. https://arxiv.org/pdf/1704.02853.pdf.
[22] RODRIGUEZ A, KIM B, TURKOZ M, et al. New multi-stage similarity measure for calculation of pairwise patent similarity in a patent citation network[J]. Scientometrics, 2015, 103(2):565-581.
[23] 李睿, 张玲玲, 郭世月. 专利同被引聚类与专利引用耦合聚类的对比分析[J]. 图书情报工作, 2012, 56(8):91-95.
[24] LU Y, XIONG X, ZHANG W, et al. Research on classification and similarity of patent citation based on deep learning[J]. Scientometrics, 2020, 123(1):813-839.
文章导航

/