图书情报工作 ›› 2020, Vol. 64 ›› Issue (10): 75-85.DOI: 10.13266/j.issn.0252-3116.2020.10.009

• 情报研究 • 上一篇    下一篇

基于深度学习的中文专利自动分类方法研究

吕璐成1,2, 韩涛1,2, 周健3, 赵亚娟1,2   

  1. 1 中国科学院文献情报中心, 北京, 100190;
    2 中国科学院大学经济与管理学院图书情报与档案管理系, 北京, 100190;
    3 中国科学院计算技术研究所, 北京, 100190
  • 收稿日期:2019-11-11 修回日期:2019-12-27 出版日期:2020-05-20 发布日期:2020-05-20
  • 作者简介:吕璐成(ORCID:0000-0002-2318-1073),助理研究员,博士研究生,E-mail:lucheng918@126.com;韩涛(ORCID:0000-0001-5955-7813),研究员,博士,硕士生导师;周健(ORCID:0000-0001-8674-6062),博士研究生;赵亚娟(ORCID:0000-0003-3501-8131),研究员,博士,博士生导师。
  • 基金资助:
    本文系中国科学院青年人才项目"基于深度学习的专利所属产业分类"(项目编号:G180161001)研究成果之一。

Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning

Lyu Lucheng1,2, Han Tao1,2, Zhou Jian3, Zhao Yajuan1,2   

  1. 1 National Science Library, Chinese Academy of Sciences, Beijing 100190;
    2 Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190;
    3 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190
  • Received:2019-11-11 Revised:2019-12-27 Online:2020-05-20 Published:2020-05-20

摘要: [目的/意义] 面向当前国内专利审查和专利情报分析工作中对于海量专利分类的客观需求,设计了7种基于深度学习的专利自动分类方法,对比各种方法的分类效果,从而助力专利分类效率和效果的提升。[方法/过程] 针对传统机器学习方法存在的缺陷,基于Word2Vec、CNN、RNN、Attention机制等深度学习技术,考虑专利文本语序特征、上下文特征以及分类关键特征,设计Word2Vec+TextCNN、Word2Vec+GRU、Word2Vec+BiGRU、Word2Vec+BiGRU+TextCNN等7种深度学习模型,以中国专利为例,选取IPC主分类号的"部"作为分类依据,对比这7种模型与3种传统分类模型在中文专利分类任务中的效果。[结果/结论] 实证研究效果显示,采用考虑语序特征、上下文特征及强化关键特征的深度学习方法进行中文专利分类具有更优的分类效果。

关键词: 专利自动分类, 深度学习, 词嵌入, 专利文本挖掘

Abstract: [Purpose/significance] In order to meet the needs of classifying massive patent automatically in current patent examination and patent information analysis work, this paper studies a series of patent automatic classification methods based on deep learning and compares the classification effects. This will promote the efficiency and effectiveness of patent classification. [Method/process] Aiming at the shortcoming of traditional machine learning methods, 7 deep learning models was designed, including Word2Vec+TextCNN, Word2Vec+GRU, Word2Vec+BiGRU, Word2Vec+ BiGRU+TextCNN and so on. These models based on the deep learning technology, such as Word2Vec, CNN, RNN, Attention mechanism and so on and considered the characteristics of patent text word order, context features and other key features in classification. Selecting the ‘Section’ of main International Patent Classification (IPC) was as the class labels, the study classified the Chinese patents by above 7 deep learning models and 3 traditional machine learning methods. And there was a comparison about the effect of classification in different models. [Result/conclusion] The empirical research indicated that it reached the better effect of Chinese patent classification by using deep learning methods which considered the characteristics of patent text word order, context features and other key features in classification.

Key words: patent automatic classification, deep learning, word embedding, patent text mining

中图分类号: