[目的/意义] 针对领域学术文献,基于题录信息构建按照"研究内容"与"研究方法"的双标签分类模型,为学术文献的细粒度分类提供方法借鉴。[方法/过程] 以深度学习中卷积神经网络为基础模型,将题名、摘要、关键词、刊名、作者、机构等题录信息分为显性特征和隐性特征,通过显性特征提取、隐性特征映射等步骤,形成特征词数组,在此基础上生成词向量矩阵,经过卷积层、池化层与Softmax层处理,完成分类任务。[结果/结论] 以电子商务领域文献为例进行实验验证,结果显示,该模型按"研究内容"与"研究方法"双标签分类的宏F1值分别为0.74、0.81,不仅明显优于传统机器学习方法,也比仅使用显性特征的深度学习分类方法高。
[Purpose/significance] Targeting the academic literature in a specific field, a dual classification model in "research content" and "research method" is constructed based on bibliographies, aiming to provide method reference for fine-grain classification of academic literature. [Method/process] Using the convolutional neural network in deep learning as the basic model, the title, abstract, keyword, source, author, organ and other bibliographies were divided into dominant feature and invisible feature. Through dominant feature extraction, invisible feature mapping and other steps, a feature word array was formed. On this basis, the word vector matrix was constructed, which processed by the convolutional layer, pooling layer and Softmax layer to complete the classification task. [Result/conclusion] Take the literature in the e-commerce field as an example for experimental verification. The results show that the macro F1 values of this model are 0.74 and 0.81 respectively according to the two categories of "research content" and "research method". The classification results are not only significantly better than traditional machine learning methods, but also higher than deep learning classification methods that only use dominant feature.
[1] 刘爱军, 俞立平. 文献计量指标的客观分类及其启示——以JCR 2015经济学期刊为例[J]. 情报理论与实践, 2017, 40(7):33-37, 49.
[2] CHAKRABORTY V, CHIU V, VASARHELYI M. Automatic classification of accounting literature[J]. International journal of accounting information systems, 2014, 15(2):122-148.
[3] 武建光, 苏云梅, 于琦,等. 基于知识元的学术文献分类研究[J]. 情报理论与实践, 2019, 42(3):160-165.
[4] CHU H, KE Q. Research methods:what's in the name?[J]. Library & information science research, 2017, 39(4):284-294.
[5] 周丽红, 刘勘. 基于关联规则的科技文献分类研究[J]. 图书情报工作, 2012, 56(4):12-16, 119.
[6] 李慧,玄洪升. 专利视角下融合多属性的技术创新主题挖掘方法——以芯片领域专利为例[J].图书情报工作, 2020, 64(11):96-107.
[7] 李湘东, 刘康, 丁丛,等. 基于《知网》的多种类型文献混合自动分类研究[J]. 现代图书情报技术, 2016(2):59-66.
[8] 李湘东, 阮涛, 刘康. 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017, 1(10):43-52.
[9] 李湘东, 高凡, 李悠海. 共通语义空间下的跨文献类型文本自动分类研究[J]. 数据分析与知识发现, 2018, 2(9):66-73.
[10] 苏燕, 徐萍, 孔亮亮,等. 基于MeSH的生物医学分类主题词表重构探索——以干细胞研究文献为例[J]. 图书馆杂志, 2015, 34(3):47-52.
[11] 潘东华, 徐珂珂. 基于专利文献分类码的技术知识图谱绘制方法研究[J]. 情报学报, 2015, 34(8):866-874.
[12] BAKER S, SILINS I, GUO Y, et al. Automatic semantic classification of scientific literature according to the hallmarks of cancer[J]. Bioinformatics, 2016, 32(3):432-440.
[13] JIANG L, CAI Z, ZHANG H, et al. Naive Bayes text classifiers:a locally weighted learning approach[J]. Journal of experimental & theoretical artificial intelligence, 2013, 25(2):273-286.
[14] 白小明, 邱桃荣. 基于SVM和KNN算法的科技文献自动分类研究[J]. 微计算机信息, 2006(36):275-276, 265.
[15] WANG S, HUANG M, DENG Z. Densely connected CNN with multi-scale feature attention for text classification[C]//Proceedings of the 27th international joint conference on artificial intelligence(IJCAI).Stockholm:International Joint Conferences on Artificial Intelligence Organization,2018:4468-4474.
[16] GUTIERREZ B J, ZENG J, ZHANG D, et al. Document classification for COVID-19 literature[BE/OL].[2020-09-04].https://arxiv.org/abs/2006.13816.
[17] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[18] 杜德慧, 李长玲, 相富钟,等. 基于引文关键词的跨学科相关知识发现方法探讨[J]. 情报杂志,2020, 39(9):189-194.
[19] 俞琰,赵乃瑄.基于辅助集的专利主题分析领域停用词选取[J].数据分析与知识发现,2018,2(11):95-103.
[20] 肖连杰, 孟涛, 王伟,等. 基于深度学习的情报分析方法识别研究——以安全情报领域为例[J]. 数据分析与知识发现, 2019, 3(10):20-28.
[21] MARCO B, GEORGIANA D, GERMAN K. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]//Proceedings of the 52nd annual meeting of the Association for Computational Linguistics. Baltimore:Association for Computational Linguistics, 2014:238-247.
[22] ZHANG Y, WALLACE B. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification[J]. Computer science, 2015(10):253-263.
[23] TIMOSHENKO A, HAUSER J R. Identifying customer needs from user-generated content[J]. Marketing science, 2019, 38(1):1-20.
[24] YAN Y, YIN X-C, YANG C, et al. Biomedical literature classification with a CNNs-based hybrid learning network[J]. Plos one, 2018, 13(7):1-31.
[25] KIM Y. Convolutional neural networks for sentence classification[BE/OL].[2020-09-07].https://arxiv.org/abs/1408.5882.
[26] 章成志, 李卓, 储荷婷. 基于全文内容的学术论文研究方法自动分类研究[J]. 情报学报, 2020, 39(8):852-862.
[27] 唐琳, 郭崇慧, 陈静锋,等. 基于中文学术文献的领域本体概念层次关系抽取研究[J]. 情报学报, 2020, 39(4):387-398.
[28] WILLIS C G, LAW E, WILLIAMS A C, et al. CrowdCurio:an online crowdsourcing platform to facilitate climate change studies using herbarium specimens[J]. New phytologist, 2017, 215(1):479-488.
[29] 张华鑫,庞建刚.基于SVM和KNN的文本分类研究[J].现代情报,2015,35(5):73-77.
[30] 萧莉明,于宽,蔡珣.一种基于Bayes分类器的中文期刊自动分类系统[J].现代情报,2007(4):146-147,150.