图书情报工作 ›› 2021, Vol. 65 ›› Issue (12): 112-121.DOI: 10.13266/j.issn.0252-3116.2021.12.011

• 研究论文 • 上一篇    下一篇

融合多维特征的学术文献下载行为预测研究

谢豪, 吴雪华, 陈茜, 唐晶, 白云, 毛进   

  1. 武汉大学信息资源研究中心 武汉 430072
  • 收稿日期:2020-11-26 修回日期:2021-03-03 出版日期:2021-06-20 发布日期:2021-07-03
  • 通讯作者: 毛进(ORCID:0000-0001-9572-6709),副教授,博士,通讯作者,E-mail:danveno@163.com
  • 作者简介:谢豪(ORCID:0000-0002-1788-0468),硕士研究生;吴雪华(ORCID:0000-0003-0231-2975),硕士研究生;陈茜(ORCID:0000-0003-1640-7270),硕士研究生;唐晶(ORCID:0000-0002-1211-5812),硕士研究生;白云(ORCID:0000-0002-7590-1263),硕士研究生。
  • 基金资助:
    本文系国家自然科学基金创新研究群体项目"信息资源管理"(项目编号:71921002)和国家自然科学基金青年项目"基于学术异质网络表示学习的知识群落发现"(项目编号:71804135)研究成果之一。

Predicting Download Behavior of Academic Literature Based on Multi-dimensional Features

Xie Hao, Wu Xuehua, Chen Xi, Tang Jing, Bai Yun, Mao Jin   

  1. Center for Studies of Information Resources, Wuhan University, Wuhan 430072
  • Received:2020-11-26 Revised:2021-03-03 Online:2021-06-20 Published:2021-07-03

摘要: [目的/意义] 学术文献下载行为是科研人员文献检索行为的重要一环,对其预测的研究有助于深度理解科研人员检索行为,为学术资源检索平台优化检索结果、重构排序提供依据,从而提升检索系统的服务质量。[方法/过程] 构建用户学术文献下载行为的多维特征体系,在机器学习算法基础上构造基于查询相关性和基于用户行为的子分类器,并采取加权策略构建学术文献下载行为预测混合模型。[结果/结论] 实验结果表明,随机森林算法在两种分类器上均取得最佳性能;相较于仅基于查询相关性特征训练的模型,混合模型的准确率提高了2.3%,F1值提升了1.3%。在混合模型中,基于用户行为的子分类器拥有更高权重;"下载量""是否采用专业/高级检索"和"发表时间"特征的贡献度较大。

关键词: 文献下载预测, 多维特征, 机器学习, 混合模型

Abstract: [Purpose/significance] The behavior of academic literature downloading is an essential step in the process of academic retrieval. Predicting download behavior of academic literature is conducive to the in-depth understanding of the retrieval behavior of researchers, and provides a basis for optimizing retrieval results of academic resource retrieval platforms and restructuring ranking, to improve the retrieval function and service quality of retrieval system.[Method/process] This paper constructed a multi-dimensional feature system of researchers' academic literature download behavior, and proposed two sub-classifiers based on query relevance and user behavior respectively relying on machine learning algorithms. A weighted strategy was adopted to construct a hybrid model of download behavior prediction of academic literature.[Result/conclusion] The experiment results show that the Random Forest algorithm achieves the best performance in both classifiers. Compared to the model trained with only query relevance features, the accuracy of the hybrid model is increased by 2.3%, and the F1 value is increased by 1.3%. The sub-classifiers based on user behavior have higher weights in the hybrid model. "downloads" "whether professional/advanced search is used"and "published time" make a significant contribution to the academic literature download prediction task.

Key words: literature download prediction, multi-dimensional features, machine learning, hybrid model

中图分类号: