图书情报工作 ›› 2021, Vol. 65 ›› Issue (3): 93-99.DOI: 10.13266/j.issn.0252-3116.2021.03.012

• 知识组织 • 上一篇    下一篇

基于深度学习算法的学术查询意图分类器构建

王瑞雪1, 方婧1, 桂思思2, 陆伟1,3, 张显4   

  1. 1. 武汉大学信息管理学院 武汉 430072;
    2. 南京农业大学信息管理系 南京 210095;
    3. 武汉大学信息检索与知识挖掘研究所 武汉 430072;
    4. 百度时代网络技术(北京)有限公司 北京 100085
  • 收稿日期:2020-06-17 修回日期:2020-10-14 出版日期:2021-02-05 发布日期:2021-02-05
  • 作者简介:王瑞雪(ORCID:0000-0001-5932-9036),博士研究生,E-mail:ruixue_wang@whu.edu.cn;方婧(ORCID:0000-0002-9538-7812),硕士研究生;桂思思(ORCID:0000-0001-7562-7447),讲师,博士;陆伟(ORCID:0000-0002-0929-7416),院长,教授,博士;张显(ORCID:0000-0002-8274-9523),硕士研究生。
  • 基金资助:
    本文系国家社会科学基金青年项目"面向学术搜索的查询意图研究"(项目编号:19CTQ023)研究成果之一。

Based on Deep Learning Algorithm to Construct the Classifier of Academic Query Intent

Wang Ruixue1, Fang Jing1, Gui Sisi2, Lu Wei1,3, Zhang Xian4   

  1. 1 School of Information Management, Wuhan University, Wuhan 430072;
    2 College of Information Science&Technology, Nanjing Agricultural University, Nanjing 210095;
    3 Institute for Information Retrieval and Knowledge Mining, Wuhan University, Wuhan 430072;
    4 Baidu Times Network Technology(Beijing) Co., Ltd. Beijing 100085
  • Received:2020-06-17 Revised:2020-10-14 Online:2021-02-05 Published:2021-02-05
  • Supported by:
     

摘要: [目的/意义] 实现学术查询意图的自动识别,提高学术搜索引擎的效率。[方法/过程] 结合已有查询意图特征和学术搜索特点,从基本信息、特定关键词、实体和出现频率4个层面对查询表达式进行特征构造,运用Naive Bayes、Logistic回归、SVM、Random Forest四种分类算法进行查询意图自动识别的预实验,计算不同方法的准确率、召回率和F值。提出了一种将Logistic回归算法所预测的识别结果扩展到大规模数据集、提取“关键词类”特征的方法构建学术查询意图识别的深度学习两层分类器。[结果/结论] 两层分类器的宏平均F1值为0.651,优于其他算法,能够有效平衡不同学术查询意图的类别准确率与召回率效果。两层分类器在学术探索类的效果最好,F1值为0.783。

 

关键词: 学术查询意图, 自动识别, 两层分类器

Abstract: [Purpose/significance] To find the solutions of automatically identifying search query intent and improve the efficiency of academic search engines. [Method/process] Combining the features of query intent and academic search, we constructed the feature from four aspects, which are the basic descriptive statistics, the special keywords, entity information and the frequency. For the experiments, we examined four types of classifiers which are the Naive Bayes, Logistic regression, SVM, Random Forest and calculated precision, recall and F-measure. A method which is extending the recognition results of academic query intent predicted by Logistic regression algorithm to large-scale data sets and extracting "keyword type" features is proposed to construct a two-layer classifier based on deep learning algorithm for academic query intent recognition. [Result/conclusion] The macro-average F1 value of the two-layer classifier is 0.651, which is superior to other algorithms. This method can effectively balance the precision and recall rate of different academic query intentions. The final second-layer prediction model receives the best classification performance, the score of F1 is 0.783.

Key words: academic query intent, automatic identification, two-layer classification

中图分类号: