[Purpose/significance] The behavior of academic literature downloading is an essential step in the process of academic retrieval. Predicting download behavior of academic literature is conducive to the in-depth understanding of the retrieval behavior of researchers, and provides a basis for optimizing retrieval results of academic resource retrieval platforms and restructuring ranking, to improve the retrieval function and service quality of retrieval system.[Method/process] This paper constructed a multi-dimensional feature system of researchers' academic literature download behavior, and proposed two sub-classifiers based on query relevance and user behavior respectively relying on machine learning algorithms. A weighted strategy was adopted to construct a hybrid model of download behavior prediction of academic literature.[Result/conclusion] The experiment results show that the Random Forest algorithm achieves the best performance in both classifiers. Compared to the model trained with only query relevance features, the accuracy of the hybrid model is increased by 2.3%, and the F1 value is increased by 1.3%. The sub-classifiers based on user behavior have higher weights in the hybrid model. "downloads" "whether professional/advanced search is used"and "published time" make a significant contribution to the academic literature download prediction task.
[1] 熊泽泉, 段宇锋. 论文早期下载量可否预测后期被引量?——以图书情报领域期刊为例[J]. 图书情报知识, 2018(4):32-42.
[2] 谢娟, 龚凯乐, 成颖, 等. 论文下载量与被引量相关关系的元分析[J]. 情报学报, 2017, 36(12):1255-1269.
[3] 王超. 期刊论文被引量与下载量关系研究[J]. 情报探索, 2020(6):33-39.
[4] 张维, 代国强. 国内高校图书馆数据库过量下载的特点及对策分析[J]. 办公自动化, 2016, 21(14):25-27.
[5] 张敏, 张磊. 数字图书馆电子资源过量下载意愿的使能因素和抑能因素平衡研究[J]. 图书馆学研究, 2016(16):51-57,69.
[6] 张敏, 张磊. 数字图书馆电子资源过度下载意愿的影响因素研究——基于任务驱动与惩罚抑制的双重情境[J]. 图书情报工作, 2016, 60(7):116-122.
[7] 孙利. 广州大学城大学生下载数据库论文行为研究[J]. 图书情报导刊, 2016, 1(11):150-153.
[8] 刘颖. 基于ARIMA模型和神经网络对论文下载量进行预测[D]. 大连:大连理工大学, 2015.
[9] LI X, DE RIJKE M. Characterizing and predicting downloads in academic search[J]. Information processing & management, 2019, 56(3):394-407.
[10] 张海涛, 张枭慧, 魏萍, 等. 网络用户信息检索行为研究进展[J]. 情报科学, 2020, 38(5):169-176.
[11] BRODER A. A taxonomy of Web search[J]. SIGIR forum, 2002, 36(2):3-10.
[12] DOU Z, SONG R, WEN J. A large-scale evaluation and analysis of personalized search strategies[C]//Proceedings of the 16th international conference on World Wide Web. New York:ACM, 2007:581-590.
[13] KHABSA M, WU Z, GILES C L. Towards better understanding of academic search[C]//Proceedings of the 16rh ACM/IEEE-CS on joint conference on digital libraries. New York:ACM, 2016:111-114.
[14] 张晓娟. 信息类、导航类与事务类查询个性化潜力的对比析究[J]. 数字图书馆论坛, 2017(9):35-41.
[15] 吴丹, 孙浩东. 移动图书馆WAP和APP用户检索行为比较分析[J]. 图书情报工作, 2016, 60(18):14-20.
[16] LI X, SCHIJVENAARS B J A, DE RIJKE M. Investigating queries and search failures in academic search[J]. Information processing & management, 2017, 53(3):666-683.
[17] WILSON T D. Human information behavior[J]. Informing science, 2000, 3(2):49-56.
[18] 王建冬, 王继民. 基于日志挖掘的高校用户期刊数据库检索行为研究[J]. 北京大学学报(自然科学版), 2012, 48(1):29-36.
[19] 楼海萍, 潘杏梅, 方红, 等. 我国学术论文下载指标研究综述[J]. 图书馆研究与工作, 2018(10):50-55.
[20] 郭强, 赵瑾, 刘思源, 等. 科技论文下载次数的统计性质研究[J]. 情报科学, 2009, 27(5):690-694.
[21] GARFIELD E. Fortnightly review:How can impact factors be improved?[J]. BMJ, 1996, 313(7054):411-413.
[22] 赵一权, 王振民, 熊文炳, 等. 科学论文的下载与引用关系研究:以ACM数字图书馆为例[J]. 中国科技期刊研究, 2014, 25(6):818-823.
[23] 赵星. 学术文献用量级数据Usage的测度特性研究[J]. 中国图书馆学报, 2017, 43(7):44-57.
[24] 杨莉, 熊泽泉, 段宇锋. 基于分位数回归的期刊论文被引量预测研究[J]. 情报科学, 2019, 37(10):60-66.
[25] 牛昱昕, 宗乾进, 袁勤俭. 开放存取论文下载与引用情况计量研究[J]. 中国图书馆学报, 2012, 38(4):119-127.
[26] O'LEARY D. On the relationship between citations and appearances on "top 25" download lists in the international journal of accounting information systems[J]. International journal of accounting information systems, 2008, 9(1):61-75.
[27] 徐文贤, 陈雪梅. 高校图书馆数据库过量下载行为研究[J]. 图书馆理论与实践, 2014(11):20-23.
[28] BARKAN O, KOENIGSTEIN N. Item2vec:neural item embedding for collaborative filtering[C]//2016 IEEE 26th international workshop on machine learning for signal processing. Piscataway:IEEE, 2016:1-6.
[29] 杨书新, 徐慧琴, 谭伟. 结合查询相关性的关键词查询排序方法[J]. 计算机工程与设计, 2013, 34(9):3136-3140.
[30] 吴丽华, 罗云锋, 张宏斌. 信息检索模型及相关性算法的研究[J]. 情报杂志, 2006(12):25-27.
[31] 张李义, 张然. 技术接受模型(TAM)关键变量前因分析[J]. 信息资源管理学报, 2015, 5(2):11-20.
[32] 王海涛, 谭宗颖, 陈挺. 论文被引频次影响因素研究——兼论被引频次评估科研质量的合理性[J]. 科学学研究, 2016, 34(2):171-177.
[33] 沈敏, 杨新涯, 王楷. 基于机器学习的高校图书馆用户偏好检索系统研究[J]. 图书情报工作, 2015, 59(11):143-148.
[34] 陆伟, 周红霞, 张晓娟. 查询意图研究综述[J]. 中国图书馆学报, 2013, 39(1):100-111.
[35] BELKIN N J, KELLY D, KIM G, et al. Query length in interactive information retrieval[C]//Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. New York:ACM, 2003:205-212.
[36] TANG J, WANG K. Personalized Top-N sequential recommendation via convolutional sequence embedding[C]//The eleventh ACM international conference. New York:ACM, 2018.
[37] WANG J, HUANG P, ZHAO H, et al. Billion-scale commodity embedding for e-commerce recommendation in alibaba[C]//Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. New York:ACM, 2018:839-848.
[38] ZHANG W, DU Y, YOSHIDA T, et al. DeepRec:a deep neural network approach to recommendation with item embedding and weighted loss function[J]. Information sciences, 2019, 470(2019):121-140.
[39] "慧源共享"全国高校开放数据创新研究大赛组委会. "慧源共享"全国高校开放数据创新研究大赛-参赛作品提交须知[EB/OL].[2020-07-01]. http://hdl.handle.net/20.500.12291/10232 V2[Version].
[40] 陆伟, 钱坤, 唐祥彬. 文献下载频次与被引频次的相关性研究——以图书情报领域为例[J]. 情报科学, 2016, 34(1):3-8.