基于深度学习的中文专利自动分类方法研究

吕璐成; 韩涛; 周健; 赵亚娟

doi:10.13266/j.issn.0252-3116.2020.10.009

图书情报工作 >

2020 , Vol. 64 >Issue 10: 75 - 85

DOI: https://doi.org/10.13266/j.issn.0252-3116.2020.10.009

情报研究

基于深度学习的中文专利自动分类方法研究

吕璐成 ,
韩涛 ,
周健 ,
赵亚娟

展开

1 中国科学院文献情报中心, 北京, 100190;
2 中国科学院大学经济与管理学院图书情报与档案管理系, 北京, 100190;
3 中国科学院计算技术研究所, 北京, 100190

吕璐成(ORCID:0000-0002-2318-1073),助理研究员,博士研究生,E-mail:lucheng918@126.com;韩涛(ORCID:0000-0001-5955-7813),研究员,博士,硕士生导师;周健(ORCID:0000-0001-8674-6062),博士研究生;赵亚娟(ORCID:0000-0003-3501-8131),研究员,博士,博士生导师。

收稿日期: 2019-11-11

修回日期: 2019-12-27

网络出版日期: 2020-05-20

基金资助

本文系中国科学院青年人才项目"基于深度学习的专利所属产业分类"（项目编号：G180161001）研究成果之一。

收起

Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning

Lyu Lucheng ,
Han Tao ,
Zhou Jian ,
Zhao Yajuan

Expand

1 National Science Library, Chinese Academy of Sciences, Beijing 100190;
2 Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190;
3 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190

Received date: 2019-11-11

Revised date: 2019-12-27

Online published: 2020-05-20

Fold

摘要

[目的/意义] 面向当前国内专利审查和专利情报分析工作中对于海量专利分类的客观需求，设计了7种基于深度学习的专利自动分类方法，对比各种方法的分类效果，从而助力专利分类效率和效果的提升。[方法/过程] 针对传统机器学习方法存在的缺陷，基于Word2Vec、CNN、RNN、Attention机制等深度学习技术，考虑专利文本语序特征、上下文特征以及分类关键特征，设计Word2Vec+TextCNN、Word2Vec+GRU、Word2Vec+BiGRU、Word2Vec+BiGRU+TextCNN等7种深度学习模型，以中国专利为例，选取IPC主分类号的"部"作为分类依据，对比这7种模型与3种传统分类模型在中文专利分类任务中的效果。[结果/结论] 实证研究效果显示，采用考虑语序特征、上下文特征及强化关键特征的深度学习方法进行中文专利分类具有更优的分类效果。

关键词： 专利自动分类; 深度学习; 词嵌入; 专利文本挖掘

本文引用格式

吕璐成 , 韩涛 , 周健 , 赵亚娟 . 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020 , 64(10) : 75 -85 . DOI: 10.13266/j.issn.0252-3116.2020.10.009

Abstract

[Purpose/significance] In order to meet the needs of classifying massive patent automatically in current patent examination and patent information analysis work, this paper studies a series of patent automatic classification methods based on deep learning and compares the classification effects. This will promote the efficiency and effectiveness of patent classification. [Method/process] Aiming at the shortcoming of traditional machine learning methods, 7 deep learning models was designed, including Word2Vec+TextCNN, Word2Vec+GRU, Word2Vec+BiGRU, Word2Vec+ BiGRU+TextCNN and so on. These models based on the deep learning technology, such as Word2Vec, CNN, RNN, Attention mechanism and so on and considered the characteristics of patent text word order, context features and other key features in classification. Selecting the ‘Section’ of main International Patent Classification (IPC) was as the class labels, the study classified the Chinese patents by above 7 deep learning models and 3 traditional machine learning methods. And there was a comparison about the effect of classification in different models. [Result/conclusion] The empirical research indicated that it reached the better effect of Chinese patent classification by using deep learning methods which considered the characteristics of patent text word order, context features and other key features in classification.

Key words： patent automatic classification; deep learning; word embedding; patent text mining

参考文献

[1] 央视新闻.中国发明专利申请量连续8年居世界首位[EB/OL].[2019-08-02].http://dy.163.com/v2/article/detail/EGM6VQS60511A3UP.html.
[2] 国家知识产权局.国内专利申请年度状况[EB/OL].[2019-08-02].http://www.cnipa.gov.cn/tjxx/jianbao/year2018/a/a3.html.
[3] 田创,赵亚娟.一种基于相似度的专利与产业类目映射模型——以《国际专利分类》与《国民经济行业分类》为例[J].图书情报工作,2016,60(20):123-131.
[4] LAI S W, XU L H, LIU K, et al. Recurrent convolutional neural networks for text classification[C]//Proceedings of the twenty-ninth AAAI conference on artificial intelligence. Austin:AAAI, 2015:2267-2273.
[5] YANG Z C, YANG D Y, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics:human language technologies. San Diego:NAACL, 2016:1480-1489.
[6] ZHANG X, ZHAO J, LECUN Y, et al. Character-level convolutional networks for text classification[C]//Advances in neural information processing Systems. Montreal:Neural information processing systems foundation, 2015:649-657.
[7] CHEN Y L, CHANG Y C. A three-phase method for patent classification[J]. Information processing and management, 2012, 48(6):1017-1030.
[8] FALL C J, TORCSVARI A, BENZINEB K, et al. Automated categorization in the international patent classification[C]//ACM SIGIR forum. Toronto:Association for Computing Machinery, 2003, 37(1):10-25.
[9] TRAPPEY A J C, HSU F C, TRAPPEY C V, et al. Development of a patent document classification and search platform using a back-propagation network[J]. Expert systems with applications, 2006, 31(4):755-765.
[10] HODREA I B, BOT R I, WANKA G. The Rose-Gurewitz-Fox approach applied for patents classification[J]. European journal of operational research, 2006, 173(3):815-826.
[11] KRIER M, FRANCESCO Z. Automatic categorisation applications at the European patent office[J]. World patent information, 2002, 24(3):187-196.
[12] KOSTER C H A, SEUTTER M, BENEY J. Multi-Classification of patent applications with Winnow[C]//International Andrei Ershov memorial conference on perspectives of system informatics. Berlin:Springer Berlin Heidelberg, 2003:546-555.
[13] IWAYAMA M, FUJII A,KANDO N. Overview of classification subtask at NTCIR-5 patent retrieval task[C]//Proceedings of NTCIR-5 workshop meeting. Tokyo:NTCIR, 2005.
[14] KIM J H, CHOI K S. Patent document categorization based on semantic structural information[J]. Information processing and management, 2007, 43(5):1200-1215.
[15] HE C, LOH H T. Grouping of TRIZ inventive principles to facilitate automatic patent classification[J]. Expert systems with applications, 2008, 34(1):788-795.
[16] HE C, LOH H T. Pattern-oriented associative rule-based patent classification[J]. Expert systems with applications, 2010, 37(3):2395-2404.
[17] 胡正银, 方曙, 文奕,等. 面向TRIZ的专利自动分类研究[J]. 现代图书情报技术, 2015, 31(1):66-74.
[18] 翟继强, 王克奇. 依据TRIZ发明原理的中文专利自动分类[J]. 哈尔滨理工大学学报, 2013, 18(3).
[19] 刘龙繁,李彦,侯超异,等.基于功能基的专利信息挖掘与自动分类实验研究[J].四川大学学报(工程科学版),2016,48(5):105-113.
[20] ZHANG X Y. Interactive patent classification based on multi-classifier fusion and active learning[J]. Neurocomputing, 2014, 127:200-205.
[21] CHANG S B, LAI K K, CHANG S M. Exploring technology diffusion and classification of business methods:using the Patent Citation Network[J]. Technological forecasting and social change, 2009, 76(1):107-117.
[22] LAI K K, WU S J. Using the Patent Co-Citation approach to establish a new patent classification system[J]. Information processing and management, 2005, 41(2):313-330.
[23] 李程雄,丁月华,文贵华.SVM-KNN组合改进算法在专利文本分类中的应用[J].计算机工程与应用,2006(20):193-195,212.
[24] 贾杉杉,刘畅,孙连英,等.基于多特征多分类器集成的专利自动分类研究[J].数据分析与知识发现,2017,1(8):76-84.
[25] VERBERNE S and D'HONDT E. Patent classification experiments with the linguistic classification system LCS in CLEF-IP 2011[C]//CLEF 2011 working notes. Amsterdam:CLEF, 2011.
[26] STUTZKI J, MATTHIAS S. Geodata supported classification of patent applications[C]//Proceedings of the third international ACM SIGMOD workshop on managing and mining enriched geo-spatial data. San Francisco:Association for Computing Machinery, 2016:1-6.
[27] LIM S, KWON Y J. IPC multi-label classification based on the field functionality of patent documents[C]//SIGIR Forum. Gold Coast:Association for Computing Machinery, 2016:677-691.
[28] 马双刚. 基于深度学习理论与方法的中文专利文本自动分类研究[D]. 苏州:江苏大学,2016.
[29] 胡杰,李少波,于丽娅,等.基于卷积神经网络与随机森林算法的专利文本分类模型[J].科学技术与工程,2018,18(6):268-272.
[30] 马建红, 王瑞杨, 姚爽,等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018, 44(10):215-220.
[31] LI S B, HU J, CUI Y X, et al. DeepPatent:Patent classification with convolutional neural networks and word embedding[J]. Scientometrics, 2018, 117(2):721-744.
[32] 肖立中,王广仲,刘源,等. 安全领域专利文本的分类方法[P].中国:109033402A.2018-12-18.
[33] KIM Y. Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 conference on empirical methods in natural language processing. Doha:EMNLP, 2014:1746-1751.
[34] CHO K, MERRIENBOER B V, GULCEHRE C, et al. Learning phrase representations using RNN encoder——decoder for statistical machine translation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing. Doha:EMNLP, 2014:1724-1734.
[35] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[C]//NIPS 2014 deep learning and representation learning workshop. arXiv:1412.3555. Montreal:NIPS, 2014.
[36] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//arXiv:1706.03762. Long Beach:NIPS, 2017.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献