知识组织

融合语义联想和BERT的图情领域SAO短文本分类研究

  • 张玉洁 ,
  • 白如江 ,
  • 刘明月 ,
  • 于纯良
展开
  • 1 山东理工大学信息管理研究院 淄博 255049;
    2 烟台大学图书馆 烟台 264005
张玉洁(ORCID:0000-0002-6819-031X),硕士研究生;刘明月(ORCID:0000-0002-4335-9369),硕士研究生;于纯良(ORCID:0000-0002-3013-8022),副研究馆员。

收稿日期: 2021-01-27

  修回日期: 2021-05-12

  网络出版日期: 2021-08-20

基金资助

本文系山东省高等学校青创科技支持计划"科技大数据驱动的智慧决策支持创新团队-面向新旧动能转换的新兴科学研究前沿识别研究"(项目编号:2019RWG033)和山东省社科规划处项目"数字环境下科学论文的内容标注模型研究"(项目编号:20CSDJ65)研究成果之一。

Research on SAO Short Text Classification in LIS Based on Semantic Association and BERT

  • Zhang Yujie ,
  • Bai Rujiang ,
  • Liu Mingyue ,
  • Yu Chunliang
Expand
  • 1 Institute of Information Management, Shandong University of Technology, Zibo 255049;
    2 Yantai University Library, Yantai 264005

Received date: 2021-01-27

  Revised date: 2021-05-12

  Online published: 2021-08-20

摘要

[目的/意义] 针对SAO结构短文本分类时面临的语义特征短缺和领域知识不足问题,提出一种融合语义联想和BERT的SAO分类方法,以期提高短文本分类效果。[方法/过程] 以图情领域SAO短文本为数据源,首先设计了一种包含"扩展-重构-降噪"三环节的语义联想方案,即通过语义扩展和SAO重构延展SAO语义信息,通过语义降噪解决扩展后的噪声干扰问题;然后利用BERT模型对语义联想后的SAO短文本进行训练;最后在分类部分实现自动分类。[结果/结论] 在分别对比了不同联想值、学习率和分类器后,实验结果表明当联想值为10、学习率为4e-5时SAO短文本分类效果达到最优,平均F1值为0.852 2,与SVM、LSTM和单纯的BERT相比,F1值分别提高了0.103 1、0.153 8和0.140 5。

本文引用格式

张玉洁 , 白如江 , 刘明月 , 于纯良 . 融合语义联想和BERT的图情领域SAO短文本分类研究[J]. 图书情报工作, 2021 , 65(16) : 118 -129 . DOI: 10.13266/j.issn.0252-3116.2021.16.013

Abstract

[Purpose/significance] Aiming at the shortage of semantic features and insufficient domain knowledge in the classification of SAO structure short texts, this paper proposes a SAO classification method combining semantic association and BERT in order to improve the classification effect.[Method/process] Taking the SAO short text in the library and information science field as the data source, firstly, a semantic association scheme including the three links of "Expansion-Reconstruction-NoiseReduction" was designed. The semantic information of SAO was extended through semantic expansion and SAO reconstruction, and the extended noise interference problem was solved by semantic noise reduction; then used the BERT model to train the SAO short text after semantic association; finally realized automatic classification in the classification part.[Result/conclusion] After comparing different association values, learning rates and classifiers, the experimental results show that when the association value is 10 and the learning rate is 4e-5, the SAO short text classification effect is optimal, and the average F1 value is 0.852 2, which is comparable to SVM and LSTM compared with pure BERT, the F1 value is increased by 0.103 1, 0.153 8 and 0.140 5 respectively.

参考文献

[1] CASCINI G, FANTECHI A, SPINICCI E. Natural language processing of patents and technical documentation[C]//International workshop on document analysis systems. Berlin:Springer, 2004:508-520.
[2] CHOI S, PARK H, KANG D, et al. An sao-based text mining approach to building a technology tree for technology planning[J]. Expert systems with applications, 2012, 39(13):11443-11455.
[3] WANG X, WANG Z, HUANG Y, et al. Identifying r&d partners through subject-action-object semantic analysis in a problem & solution pattern[J]. Technology analysis & strategic management, 2017, 29(10):1167-1180.
[4] TSOURIKOV V M, BATCHILO L S, SOVPEL I V. Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (sao) structures:U.S. Patent 6,167,370[P]. 2000-12-26.
[5] 付芸,汪雪锋,李佳,等.基于SAO结构的创新解决方案遴选研究——以空气净化技术为例[J].图书情报工作,2019,63(6):75-84.
[6] 许海云,王振蒙,胡正银,等.利用专利文本分析识别技术主题的关键技术研究综述[J].情报理论与实践,2016,39(11):131-137.
[7] 胡正银,刘春江,隗玲,等,文奕.面向TRIZ的领域专利技术挖掘系统设计与实践[J].图书情报工作,2017,61(1):117-124.
[8] 杨超,朱东华,汪雪锋,等.专利技术主题分析:基于SAO结构的LDA主题模型方法[J].图书情报工作,2017,61(3):86-96.
[9] CHANG P L, WU C C, Leu H J. Using patent analyses to monitor the technological trends in an emerging field of technology:a case of carbon nanotube field emission display[J]. Scientometrics, 2010, 82(1):5-19.
[10] GUO J, WANG X, LI Q, et al. Subject-action-object-based morphology analysis for determining the direction of technological change[J]. Technological forecasting and social change, 2016, 105:27-40.
[11] LI X, WANG J J, YANG Z. Identifying emerging technologies based on subject-action-object[J]. Journal of intelligence, 2016, 35(3):80-84.
[12] 王晓宇,苗红,王芳.技术知识的跨领域应用及潜在技术方案的识别[J].图书情报工作,2016,60(23):87-96.
[13] 胡正银,方曙,张娴,等.个性化语义TRIZ构建研究[J].图书情报工作,2015,59(7):123-131.
[14] MANEK A S, SHENOY P D, MOHAN M C, et al. Aspect term extraction for sentiment analysis in large movie reviews using gini index feature selection method and SVM classifier[J]. World Wide Web,2017, 20(2):135-154.
[15] BACHHETY S, DHINGRA S, JAIN R, et al. Improved multinomial naïve bayes approach for sentiment analysis on social media[J]. International journal of information systems & management science, 2018, 1(1).
[16] RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2):257-286.
[17] BREIMAN L. Random forests[J]. Machine learning, 2001, 45(1):5-32.
[18] 高金勇,徐朝军,冯奕竸.基于迭代的TFIDF在短文本分类中的应用[J].情报理论与实践,2011,34(6):120-122.
[19] 范云杰,刘怀亮.基于维基百科的中文短文本分类研究[J].现代图书情报技术,2012(3):47-52.
[20] MINAEE S, KALCHBRENNER N, CAMBRIA E, et al. Deep learning based text classification:a comprehensive review[J]. arXiv preprint arXiv:2004.03705, 2020.
[21] YIN W, KANN K, YU M, et al. Comparative study of cnn and rnn for natural language processing[J]. arXiv preprint arXiv:1702.01923, 2017.
[22] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8):1735-1780.
[23] OLAH C. Understanding lstm networks[EB/OL] [2015-8-27]. https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
[24] CHO K, VAN MERRIÄNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017:5998-6008.
[26] 邓三鸿,傅余洋子,王昊.基于LSTM模型的中文图书多标签分类研究[J].数据分析与知识发现,2017,1(7):52-60.
[27] 吕璐成,韩涛,周健,等.基于深度学习的中文专利自动分类方法研究[J].图书情报工作,2020,64(10):75-85.
[28] LEE J, DERNONCOURT F. Sequential short-text classification with recurrent and convolutional neural networks[J]. arXiv preprint arXiv:1603.03827, 2016.
[29] 秦成磊,章成志.基于层次注意力网络模型的学术文本结构功能识别[J].数据分析与知识发现,2020,4(11):26-42.
[30] 陶志勇,李小兵,刘影,等.基于双向长短时记忆网络的改进注意力短文本分类方法[J].数据分析与知识发现,2019,3(12):21-29.
[31] 余本功,朱梦迪.基于层级注意力多通道卷积双向GRU的问题分类研究[J].数据分析与知识发现,2020,4(8):50-62.
[32] DEVLIN J, CHANG M W, LEE K, et al. Bert:Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[33] SUN C, QIU X, XU Y, et al. How to fine-tune bert for text classification?[C]//China national conference on Chinese computational linguistics. Cham:Springer, 2019:194-206.
[34] LEE J S, HSIANG J. Patentbert:Patent classification with fine-tuning a pre-trained bert model[J]. arXiv preprint arXiv:1906.02124, 2019.
[35] LU X, NI B. BERT-CNN:A hierarchical patent classifier based on a pre-trained language model[J]. arXiv preprint arXiv:1911.06241, 2019.
[36] 刘欢,张智雄,王宇飞.BERT模型的主要优化改进方法研究综述[J/OL].数据分析与知识发现:1-17[2021-01-05]. https://doi.org/10.11925/infotech.2096-3467.2020.0965.
[37] LIU W, ZHOU P, ZHAO Z, et al. K-BERT:Enabling Language Representation with Knowledge Graph[J]. arXiv preprint arXiv:1909.07606, 2019.
[38] YU S, SU J, LUO D. Improving BERT-based text classification with auxiliary sentence and domain knowledge[J]. IEEE access, 2019, 7:176600-176612.
[39] ORKPHOL K, YANG W. Word sense disambiguation using cosine similarity collaborates with Word2vecand WordNet[J]. Future Internet, 2019, 11(5):114.
[40] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[41] 中国大百科全书总编辑委员会.中国大百科全书图书馆学·情报学·档案学[M].北京:中国大百科全书出版社, 2002.
[42] 丘东江. 新编图书馆学情报学辞典[M].北京:科学技术文献出版社, 2006.
[43] 白如江,张庆芝,孙一钢.科技文献知识基因表达及遗传与变异研究[J].图书情报工作,2020,64(4):78-87.
[44] 图书馆·情报与文献学名词审定委员会. 图书馆·情报与文献学名词2019[M].北京:科学出版社,2019.
[45] ASHKAN J, HAMED E, MIHAN H, et al. Improvement in automatic classification of Persian documents by means of support vector machine and representative vector[C]//International conference on innovative computing technology. Berlin:Springer, 2011:282-292.
[46] 杨敏,谷俊.基于SVM的中文书目自动分类及应用研究[J].图书情报工作,2012,56(9):114-119.
[47] 王东波,何琳,黄水清.基于支持向量机的先秦诸子典籍自动分类研究[J].图书情报工作,2017,61(12):71-76.
[48] WANG J H, LIU T W, LUO X, et al. An LSTM approach to short text sentiment classification with word embeddings[C]//Proceedings of the 30th conference on computational linguistics and speech processing (ROCLING 2018). Hsinchu:ACLCLP, 2018:214-223.
文章导航

/