知识组织

基于题录信息的领域学术文献细粒度分类方法研究

  • 雷兵 ,
  • 刘小 ,
  • 钟镇
展开
  • 1 河南工业大学管理学院 郑州 450001;
    2 河南工业大学商务智能与知识工程实验室 郑州 450001
雷兵(ORCID:0000-0002-1073-4724),教授,博士;刘小(ORCID:0000-0002-6770-7583),硕士研究生

收稿日期: 2020-11-05

  修回日期: 2021-02-25

  网络出版日期: 2021-07-21

基金资助

本文系国家自然科学基金项目"作者、期刊与数据库错误引文的科学计量学研究:识别方法、产生机理与抑控对策"(项目编号:71603073)和河南省高校哲学社会科学创新团队资助项目"大数据与管理决策"(项目编号:2019-CXTD-04)研究成果之一。

Research on Fine-Grain Classification Method of Academic Literature Based on Bibliographies

  • Lei Bing ,
  • Liu Xiao ,
  • Zhong Zhen
Expand
  • 1 School of Management, Henan University of Technology, Zhengzhou 450001;
    2 Business Intelligence and Knowledge Engineering Laboratory, Henan University of Technology, Zhengzhou 450001

Received date: 2020-11-05

  Revised date: 2021-02-25

  Online published: 2021-07-21

摘要

[目的/意义] 针对领域学术文献,基于题录信息构建按照"研究内容"与"研究方法"的双标签分类模型,为学术文献的细粒度分类提供方法借鉴。[方法/过程] 以深度学习中卷积神经网络为基础模型,将题名、摘要、关键词、刊名、作者、机构等题录信息分为显性特征和隐性特征,通过显性特征提取、隐性特征映射等步骤,形成特征词数组,在此基础上生成词向量矩阵,经过卷积层、池化层与Softmax层处理,完成分类任务。[结果/结论] 以电子商务领域文献为例进行实验验证,结果显示,该模型按"研究内容"与"研究方法"双标签分类的宏F1值分别为0.74、0.81,不仅明显优于传统机器学习方法,也比仅使用显性特征的深度学习分类方法高。

本文引用格式

雷兵 , 刘小 , 钟镇 . 基于题录信息的领域学术文献细粒度分类方法研究[J]. 图书情报工作, 2021 , 65(14) : 128 -137 . DOI: 10.13266/j.issn.0252-3116.2021.14.015

Abstract

[Purpose/significance] Targeting the academic literature in a specific field, a dual classification model in "research content" and "research method" is constructed based on bibliographies, aiming to provide method reference for fine-grain classification of academic literature. [Method/process] Using the convolutional neural network in deep learning as the basic model, the title, abstract, keyword, source, author, organ and other bibliographies were divided into dominant feature and invisible feature. Through dominant feature extraction, invisible feature mapping and other steps, a feature word array was formed. On this basis, the word vector matrix was constructed, which processed by the convolutional layer, pooling layer and Softmax layer to complete the classification task. [Result/conclusion] Take the literature in the e-commerce field as an example for experimental verification. The results show that the macro F1 values of this model are 0.74 and 0.81 respectively according to the two categories of "research content" and "research method". The classification results are not only significantly better than traditional machine learning methods, but also higher than deep learning classification methods that only use dominant feature.

参考文献

[1] 刘爱军, 俞立平. 文献计量指标的客观分类及其启示——以JCR 2015经济学期刊为例[J]. 情报理论与实践, 2017, 40(7):33-37, 49.
[2] CHAKRABORTY V, CHIU V, VASARHELYI M. Automatic classification of accounting literature[J]. International journal of accounting information systems, 2014, 15(2):122-148.
[3] 武建光, 苏云梅, 于琦,等. 基于知识元的学术文献分类研究[J]. 情报理论与实践, 2019, 42(3):160-165.
[4] CHU H, KE Q. Research methods:what's in the name?[J]. Library & information science research, 2017, 39(4):284-294.
[5] 周丽红, 刘勘. 基于关联规则的科技文献分类研究[J]. 图书情报工作, 2012, 56(4):12-16, 119.
[6] 李慧,玄洪升. 专利视角下融合多属性的技术创新主题挖掘方法——以芯片领域专利为例[J].图书情报工作, 2020, 64(11):96-107.
[7] 李湘东, 刘康, 丁丛,等. 基于《知网》的多种类型文献混合自动分类研究[J]. 现代图书情报技术, 2016(2):59-66.
[8] 李湘东, 阮涛, 刘康. 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017, 1(10):43-52.
[9] 李湘东, 高凡, 李悠海. 共通语义空间下的跨文献类型文本自动分类研究[J]. 数据分析与知识发现, 2018, 2(9):66-73.
[10] 苏燕, 徐萍, 孔亮亮,等. 基于MeSH的生物医学分类主题词表重构探索——以干细胞研究文献为例[J]. 图书馆杂志, 2015, 34(3):47-52.
[11] 潘东华, 徐珂珂. 基于专利文献分类码的技术知识图谱绘制方法研究[J]. 情报学报, 2015, 34(8):866-874.
[12] BAKER S, SILINS I, GUO Y, et al. Automatic semantic classification of scientific literature according to the hallmarks of cancer[J]. Bioinformatics, 2016, 32(3):432-440.
[13] JIANG L, CAI Z, ZHANG H, et al. Naive Bayes text classifiers:a locally weighted learning approach[J]. Journal of experimental & theoretical artificial intelligence, 2013, 25(2):273-286.
[14] 白小明, 邱桃荣. 基于SVM和KNN算法的科技文献自动分类研究[J]. 微计算机信息, 2006(36):275-276, 265.
[15] WANG S, HUANG M, DENG Z. Densely connected CNN with multi-scale feature attention for text classification[C]//Proceedings of the 27th international joint conference on artificial intelligence(IJCAI).Stockholm:International Joint Conferences on Artificial Intelligence Organization,2018:4468-4474.
[16] GUTIERREZ B J, ZENG J, ZHANG D, et al. Document classification for COVID-19 literature[BE/OL].[2020-09-04].https://arxiv.org/abs/2006.13816.
[17] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[18] 杜德慧, 李长玲, 相富钟,等. 基于引文关键词的跨学科相关知识发现方法探讨[J]. 情报杂志,2020, 39(9):189-194.
[19] 俞琰,赵乃瑄.基于辅助集的专利主题分析领域停用词选取[J].数据分析与知识发现,2018,2(11):95-103.
[20] 肖连杰, 孟涛, 王伟,等. 基于深度学习的情报分析方法识别研究——以安全情报领域为例[J]. 数据分析与知识发现, 2019, 3(10):20-28.
[21] MARCO B, GEORGIANA D, GERMAN K. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]//Proceedings of the 52nd annual meeting of the Association for Computational Linguistics. Baltimore:Association for Computational Linguistics, 2014:238-247.
[22] ZHANG Y, WALLACE B. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification[J]. Computer science, 2015(10):253-263.
[23] TIMOSHENKO A, HAUSER J R. Identifying customer needs from user-generated content[J]. Marketing science, 2019, 38(1):1-20.
[24] YAN Y, YIN X-C, YANG C, et al. Biomedical literature classification with a CNNs-based hybrid learning network[J]. Plos one, 2018, 13(7):1-31.
[25] KIM Y. Convolutional neural networks for sentence classification[BE/OL].[2020-09-07].https://arxiv.org/abs/1408.5882.
[26] 章成志, 李卓, 储荷婷. 基于全文内容的学术论文研究方法自动分类研究[J]. 情报学报, 2020, 39(8):852-862.
[27] 唐琳, 郭崇慧, 陈静锋,等. 基于中文学术文献的领域本体概念层次关系抽取研究[J]. 情报学报, 2020, 39(4):387-398.
[28] WILLIS C G, LAW E, WILLIAMS A C, et al. CrowdCurio:an online crowdsourcing platform to facilitate climate change studies using herbarium specimens[J]. New phytologist, 2017, 215(1):479-488.
[29] 张华鑫,庞建刚.基于SVM和KNN的文本分类研究[J].现代情报,2015,35(5):73-77.
[30] 萧莉明,于宽,蔡珣.一种基于Bayes分类器的中文期刊自动分类系统[J].现代情报,2007(4):146-147,150.
文章导航

/