基于深度学习的中文专利自动分类方法研究

doi:10.13266/j.issn.0252-3116.2020.10.009

图书情报工作 ›› 2020, Vol. 64 ›› Issue (10): 75-85.DOI: 10.13266/j.issn.0252-3116.2020.10.009

基于深度学习的中文专利自动分类方法研究

吕璐成^1,2, 韩涛^1,2, 周健³, 赵亚娟^1,2

1 中国科学院文献情报中心, 北京, 100190;
2 中国科学院大学经济与管理学院图书情报与档案管理系, 北京, 100190;
3 中国科学院计算技术研究所, 北京, 100190

收稿日期:2019-11-11 修回日期:2019-12-27 出版日期:2020-05-20 发布日期:2020-05-20
作者简介:吕璐成(ORCID:0000-0002-2318-1073),助理研究员,博士研究生,E-mail:lucheng918@126.com;韩涛(ORCID:0000-0001-5955-7813),研究员,博士,硕士生导师;周健(ORCID:0000-0001-8674-6062),博士研究生;赵亚娟(ORCID:0000-0003-3501-8131),研究员,博士,博士生导师。
基金资助:
本文系中国科学院青年人才项目"基于深度学习的专利所属产业分类"（项目编号：G180161001）研究成果之一。

Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning

Lyu Lucheng^1,2, Han Tao^1,2, Zhou Jian³, Zhao Yajuan^1,2

1 National Science Library, Chinese Academy of Sciences, Beijing 100190;
2 Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190;
3 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190

Received:2019-11-11 Revised:2019-12-27 Online:2020-05-20 Published:2020-05-20

摘要/Abstract

摘要： [目的/意义] 面向当前国内专利审查和专利情报分析工作中对于海量专利分类的客观需求，设计了7种基于深度学习的专利自动分类方法，对比各种方法的分类效果，从而助力专利分类效率和效果的提升。[方法/过程] 针对传统机器学习方法存在的缺陷，基于Word2Vec、CNN、RNN、Attention机制等深度学习技术，考虑专利文本语序特征、上下文特征以及分类关键特征，设计Word2Vec+TextCNN、Word2Vec+GRU、Word2Vec+BiGRU、Word2Vec+BiGRU+TextCNN等7种深度学习模型，以中国专利为例，选取IPC主分类号的"部"作为分类依据，对比这7种模型与3种传统分类模型在中文专利分类任务中的效果。[结果/结论] 实证研究效果显示，采用考虑语序特征、上下文特征及强化关键特征的深度学习方法进行中文专利分类具有更优的分类效果。

关键词: 专利自动分类, 深度学习, 词嵌入, 专利文本挖掘

Abstract: [Purpose/significance] In order to meet the needs of classifying massive patent automatically in current patent examination and patent information analysis work, this paper studies a series of patent automatic classification methods based on deep learning and compares the classification effects. This will promote the efficiency and effectiveness of patent classification. [Method/process] Aiming at the shortcoming of traditional machine learning methods, 7 deep learning models was designed, including Word2Vec+TextCNN, Word2Vec+GRU, Word2Vec+BiGRU, Word2Vec+ BiGRU+TextCNN and so on. These models based on the deep learning technology, such as Word2Vec, CNN, RNN, Attention mechanism and so on and considered the characteristics of patent text word order, context features and other key features in classification. Selecting the ‘Section’ of main International Patent Classification (IPC) was as the class labels, the study classified the Chinese patents by above 7 deep learning models and 3 traditional machine learning methods. And there was a comparison about the effect of classification in different models. [Result/conclusion] The empirical research indicated that it reached the better effect of Chinese patent classification by using deep learning methods which considered the characteristics of patent text word order, context features and other key features in classification.

Key words: patent automatic classification, deep learning, word embedding, patent text mining

中图分类号:

G254.11

吕璐成, 韩涛, 周健, 赵亚娟. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10): 75-85.

Lyu Lucheng, Han Tao, Zhou Jian, Zhao Yajuan. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. LIS, 2020, 64(10): 75-85.

参考文献

[1] 央视新闻.中国发明专利申请量连续8年居世界首位[EB/OL].[2019-08-02].http://dy.163.com/v2/article/detail/EGM6VQS60511A3UP.html.
[2] 国家知识产权局.国内专利申请年度状况[EB/OL].[2019-08-02].http://www.cnipa.gov.cn/tjxx/jianbao/year2018/a/a3.html.
[3] 田创,赵亚娟.一种基于相似度的专利与产业类目映射模型——以《国际专利分类》与《国民经济行业分类》为例[J].图书情报工作,2016,60(20):123-131.
[4] LAI S W, XU L H, LIU K, et al. Recurrent convolutional neural networks for text classification[C]//Proceedings of the twenty-ninth AAAI conference on artificial intelligence. Austin:AAAI, 2015:2267-2273.
[5] YANG Z C, YANG D Y, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics:human language technologies. San Diego:NAACL, 2016:1480-1489.
[6] ZHANG X, ZHAO J, LECUN Y, et al. Character-level convolutional networks for text classification[C]//Advances in neural information processing Systems. Montreal:Neural information processing systems foundation, 2015:649-657.
[7] CHEN Y L, CHANG Y C. A three-phase method for patent classification[J]. Information processing and management, 2012, 48(6):1017-1030.
[8] FALL C J, TORCSVARI A, BENZINEB K, et al. Automated categorization in the international patent classification[C]//ACM SIGIR forum. Toronto:Association for Computing Machinery, 2003, 37(1):10-25.
[9] TRAPPEY A J C, HSU F C, TRAPPEY C V, et al. Development of a patent document classification and search platform using a back-propagation network[J]. Expert systems with applications, 2006, 31(4):755-765.
[10] HODREA I B, BOT R I, WANKA G. The Rose-Gurewitz-Fox approach applied for patents classification[J]. European journal of operational research, 2006, 173(3):815-826.
[11] KRIER M, FRANCESCO Z. Automatic categorisation applications at the European patent office[J]. World patent information, 2002, 24(3):187-196.
[12] KOSTER C H A, SEUTTER M, BENEY J. Multi-Classification of patent applications with Winnow[C]//International Andrei Ershov memorial conference on perspectives of system informatics. Berlin:Springer Berlin Heidelberg, 2003:546-555.
[13] IWAYAMA M, FUJII A,KANDO N. Overview of classification subtask at NTCIR-5 patent retrieval task[C]//Proceedings of NTCIR-5 workshop meeting. Tokyo:NTCIR, 2005.
[14] KIM J H, CHOI K S. Patent document categorization based on semantic structural information[J]. Information processing and management, 2007, 43(5):1200-1215.
[15] HE C, LOH H T. Grouping of TRIZ inventive principles to facilitate automatic patent classification[J]. Expert systems with applications, 2008, 34(1):788-795.
[16] HE C, LOH H T. Pattern-oriented associative rule-based patent classification[J]. Expert systems with applications, 2010, 37(3):2395-2404.
[17] 胡正银, 方曙, 文奕,等. 面向TRIZ的专利自动分类研究[J]. 现代图书情报技术, 2015, 31(1):66-74.
[18] 翟继强, 王克奇. 依据TRIZ发明原理的中文专利自动分类[J]. 哈尔滨理工大学学报, 2013, 18(3).
[19] 刘龙繁,李彦,侯超异,等.基于功能基的专利信息挖掘与自动分类实验研究[J].四川大学学报(工程科学版),2016,48(5):105-113.
[20] ZHANG X Y. Interactive patent classification based on multi-classifier fusion and active learning[J]. Neurocomputing, 2014, 127:200-205.
[21] CHANG S B, LAI K K, CHANG S M. Exploring technology diffusion and classification of business methods:using the Patent Citation Network[J]. Technological forecasting and social change, 2009, 76(1):107-117.
[22] LAI K K, WU S J. Using the Patent Co-Citation approach to establish a new patent classification system[J]. Information processing and management, 2005, 41(2):313-330.
[23] 李程雄,丁月华,文贵华.SVM-KNN组合改进算法在专利文本分类中的应用[J].计算机工程与应用,2006(20):193-195,212.
[24] 贾杉杉,刘畅,孙连英,等.基于多特征多分类器集成的专利自动分类研究[J].数据分析与知识发现,2017,1(8):76-84.
[25] VERBERNE S and D'HONDT E. Patent classification experiments with the linguistic classification system LCS in CLEF-IP 2011[C]//CLEF 2011 working notes. Amsterdam:CLEF, 2011.
[26] STUTZKI J, MATTHIAS S. Geodata supported classification of patent applications[C]//Proceedings of the third international ACM SIGMOD workshop on managing and mining enriched geo-spatial data. San Francisco:Association for Computing Machinery, 2016:1-6.
[27] LIM S, KWON Y J. IPC multi-label classification based on the field functionality of patent documents[C]//SIGIR Forum. Gold Coast:Association for Computing Machinery, 2016:677-691.
[28] 马双刚. 基于深度学习理论与方法的中文专利文本自动分类研究[D]. 苏州:江苏大学,2016.
[29] 胡杰,李少波,于丽娅,等.基于卷积神经网络与随机森林算法的专利文本分类模型[J].科学技术与工程,2018,18(6):268-272.
[30] 马建红, 王瑞杨, 姚爽,等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018, 44(10):215-220.
[31] LI S B, HU J, CUI Y X, et al. DeepPatent:Patent classification with convolutional neural networks and word embedding[J]. Scientometrics, 2018, 117(2):721-744.
[32] 肖立中,王广仲,刘源,等. 安全领域专利文本的分类方法[P].中国:109033402A.2018-12-18.
[33] KIM Y. Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 conference on empirical methods in natural language processing. Doha:EMNLP, 2014:1746-1751.
[34] CHO K, MERRIENBOER B V, GULCEHRE C, et al. Learning phrase representations using RNN encoder——decoder for statistical machine translation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing. Doha:EMNLP, 2014:1724-1734.
[35] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[C]//NIPS 2014 deep learning and representation learning workshop. arXiv:1412.3555. Montreal:NIPS, 2014.
[36] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//arXiv:1706.03762. Long Beach:NIPS, 2017.

基于深度学习的中文专利自动分类方法研究

Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	袁毅, 陶鑫琪, 李瑾萱, 刘娅娴, 汪晓芸, 景香玉. 基于招聘文本实体挖掘的人才供需分析——以人工智能领域为例[J]. 图书情报工作, 2022, 66(14): 101-118.
[2]	韩普, 顾亮. 基于混合深度学习的中文医学实体抽取研究[J]. 图书情报工作, 2022, 66(14): 119-127.
[3]	周潇, 许银彪, 史益. 基于深度学习与语义挖掘的技术创新组合识别与追踪[J]. 图书情报工作, 2022, 66(10): 33-44.
[4]	李娜, 姜恩波, 朱一真, 刘婷. 政策工具自动识别方法与实证研究[J]. 图书情报工作, 2021, 65(7): 115-122.
[5]	杜悦, 王东波, 江川, 徐润华, 李斌, 许超, 徐晨飞. 数字人文下的典籍深度学习实体自动识别模型构建及应用研究[J]. 图书情报工作, 2021, 65(3): 100-108.
[6]	桂美增, 许学国. 基于深度学习的技术机会预测研究——以新能源汽车为例[J]. 图书情报工作, 2021, 65(19): 130-141.
[7]	孔德婧, 董放, 陈子婧, 刘宇涵, 周源. 离群专利视角下的新兴技术预测——基于BERT模型和深度神经网络[J]. 图书情报工作, 2021, 65(17): 131-141.
[8]	钱力, 刘细文, 张智雄, 刘会洲. AI+智慧知识服务生态体系研究设计与应用实践——以中国科学院文献情报中心智慧服务平台建设为例[J]. 图书情报工作, 2021, 65(15): 78-90.
[9]	雷兵, 刘小, 钟镇. 基于题录信息的领域学术文献细粒度分类方法研究[J]. 图书情报工作, 2021, 65(14): 128-137.
[10]	林泽斐, 欧石燕. 融合结构与文本特征的知识图谱关系预测方法研究[J]. 图书情报工作, 2020, 64(21): 99-110.
[11]	易明, 张婷婷, 李梓. 多维特征下社会化问答社区答案排序研究[J]. 图书情报工作, 2020, 64(17): 103-113.
[12]	余传明, 王曼怡, 安璐. 跨语言情境下基于对抗的实体关系抽取模型研究[J]. 图书情报工作, 2020, 64(17): 131-144.
[13]	黄水清, 王东波. 新时代人民日报分词语料库构建、性能及应用(二)——深度学习自动分词模型构建[J]. 图书情报工作, 2019, 63(23): 5-12.
[14]	邱云飞, 张伟竹. 基于网络结构和文本内容的群体画像构建方法研究[J]. 图书情报工作, 2019, 63(22): 21-30.
[15]	王佳敏, 陆伟, 刘家伟, 程齐凯. 多层次融合的学术文本结构功能识别研究[J]. 图书情报工作, 2019, 63(13): 95-104.