[目的/意义]针对专利文本主题建模中领域停用词自动选取尚未有充分研究的问题,提出一种新的领域停用词自动选取方法,用于专利文本主题模型分析,以提高专利主题模型的区分度与建模质量。[方法/过程]领域停用词本质上是信息比较少,在不同类别专利文本中区分度低的词。因此,引入辅助专利文本集,使用类别熵衡量词的分布情况,然后依据词的类别熵进行排序,选取类别熵最大的若干词作为领域停用词。[结果/结论]实验通过专利文本数据,验证了该方法的可行性与有效性,能够有效地提高专利主题模型的区分度。
[Purpose/significance] Because the research that automatic selection of domain-specific stopwords in topic model of patent text is insufficient, this paper proposes a new method of automatic selection of domain-specific stopwords, for patent text topic model analysis, in order to improve the differentiation and modeling quality of the patent topic model. [Method/process] In essence, domain-specific stopwords are less important words which contain relatively less information,such words are poorly differentiated in different kinds of patent. Therefore, this paper introduced the auxiliary multi-category patent text dataset and measured the distributions of words through the category entropy. Then, according to the category entropy of words. It chose some words that have the maximum category entropy as the domain-specific stopwords. [Result/conclusion] Experimental results show the feasibility and validity of the method proposed in this paper, which can improve the differentiation and quality of topic model for patent text analysis.
[1] YOON B, PARK Y. A text-mining-based patent network:analytical tool for high-technology trend[J]. Journal of high technology management research, 2004, 15(1):37-50.
[2] 郭炜强, 戴天, 文贵华. 基于领域知识的专利自动分类[J]. 计算机工程, 2005, 31(23):52-54.
[3] KIM M, PARK Y, YOON J. Generating patent development maps for technology monitoring using semantic patent-topic analysis[J]. Computers & industrial engineering, 2016, 98(1):289-299.
[4] 高利丹, 肖国华, 张娴,等. 共现分析在专利地图中的应用研究[J]. 现代情报, 2009, 29(7):36-39.
[5] 张杰, 刘美佳, 翟东升. 基于专利共词分析的RFID领域技术主题研究[J]. 科技管理研究, 2013, 33(10):129-132.
[6] TANG J, WANG B, YANG Y, et al. PatentMiner:topic-driven patent analysis and mining[C]//ACM SIGKDD international conference on knowledge discovery and data mining. New York:ACM, 2012:1366-1374.
[7] WANG B, LIU S, DING K, et al. Identifying technological topics and institution-topic distribution probability for patent competitive intelligence analysis:a case study in LTE technology[J]. Scientometrics, 2014, 101(1):685-704.
[8] CHEN H, ZHANG G, LU J, et al. A fuzzy approach for measuring development of topics in patents using Latent Dirichlet Allocation[C]//IEEE international conference on fuzzy systems. Piscataway:IEEE, 2015:1116-1116.
[9] SUOMINEN A, TOIVANEN H, SEPPÄNEN M. Firms' knowledge profiles:Mapping patent data with unsupervised learning[J]. Technological forecasting & social change, 2017, 115(1):131-142.
[10] 范宇, 符红光, 文奕. 基于LDA模型的专利信息聚类技术[J]. 计算机应用, 2013, 33(S1):87-89.
[11] 王博, 刘盛博, 丁堃,等. 基于LDA主题模型的专利内容分析方法[J]. 科研管理, 2015, 36(3):111-117.
[12] 吴菲菲, 张亚茹, 黄鲁成,等. 基于AToT模型的技术主题多维动态演化分析——以石墨烯技术为例[J]. 图书情报工作, 2017,61(5):95-102.
[13] 廖列法, 勒孚刚. 基于LDA模型和分类号的专利技术演化研究[J]. 现代情报, 2017, 37(5):13-18.
[14] 陈亮, 张静, 张海超,等. 层次主题模型在技术演化分析上的应用研究[J]. 图书情报工作, 2017,61(5):103-108.
[15] FRAKES W B, BAEZA-YATES R. Information retrieval:data structures and algorithms[M]. 出版地:Prentice-Hall, Inc., 1992.
[16] SILVA C, RIBEIRO B. The importance of stop word removal on recall values in text categorization[C]//International joint conference on neural networks. Piscataway:IEEE, 2003:1661-1666.
[17] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究[J]. 现代图书情报技术, 2017(3):72-80.
[18] CROW D, DESANTO J. A hybrid approach to concept extraction and recognition-based matching in the domain of human resources[C]//IEEE international conference on TOOLS with Artificial Intelligence.Piscataway:IEEE, 2004:535-541.
[19] SEKI K, MOSTAFA J. An application of text categorization methods to gene ontology annotation[C]//International Conference on Research and Development in Information Retrieval.New York:ACM, 2005:138-145.
[20] TONG S, LERNER U, SINGHAL A, et al. Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems[EB/OL].[2018-04-06]. http://www.google.com/patents/US8626787.
[21] WHITE B J. Impact of domain-specific stop-sord lists on ECommerce website search performance[J]. Journal of strategic e-commerce, 2007, 5(2):83-101.
[22] LO T W, HE B, OUNIS I. Automatically building a stopword list for an information retrieval system[J]. Journal of digital information management, 2005, 3(1):3-8.
[23] HAO L, HAO L. Automatic identification of stop words in Chinese text classification[C]//International conference on computer science and software engineering. Piscataway:IEEE Computer Society, 2008:718-722.
[24] SINKA M P, CORNE D W. Evolving better stoplists for document clustering and Web intelligence[C]//T Design and application of hybrid intelligent systems, His03, thhe third international conference on hybrid intelligent system.New York:ACM,2008:1015-1023.
[25] JUNGIEWICZ M, LOPUSZYNSKI M. Unsupervised keyword extraction from polish legal texts[C]//International conference on natural language processing. Berlin:Springer International Publishing, 2014:65-70.
[26] MAKREHCHI M, KAMEL M S. Extracting domain-specific stop words for text classifiers[J]. 期刊名,2017, 21(1):39-62.
[27] 顾益军, 樊孝忠, 王建华,等. 中文停用词表的自动选取[J]. 北京理工大学学报, 2005, 25(4):337-340.
[28] 巩政, 关高娃. 蒙古文停用词和英文停用词比较研究[J]. 中文信息学报, 2011, 25(4):35-38.
[29] 珠杰, 李天瑞. 藏文停用词选取与自动处理方法研究[J]. 中文信息学报, 2015, 29(2):125-132.
[30] BLEI D M, NG A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of machine learning research, 2003, 3(1):993-1022. 作者贡献说明:俞琰:提出研究思路,设计研究方案,进行实验,撰写论文; 赵乃瑄:采集、清洗和分析数据,修改论文。