图书情报工作 ›› 2023, Vol. 67 ›› Issue (9): 132-140.DOI: 10.13266/j.issn.0252-3116.2023.09.014

• 知识组织 • 上一篇    下一篇

科技文献研究问题句识别方法研究

李雪思1,2, 张智雄1,2, 刘熠1,2, 王宇飞1,2   

  1. 1 中国科学院文献情报中心 北京 100190;
    2 中国科学院大学经济与管理学院信息资源管理系 北京 100190
  • 收稿日期:2022-09-05 修回日期:2022-12-07 出版日期:2023-05-11 发布日期:2023-05-11
  • 作者简介:李雪思,博士研究生,E-mail:lixuesi@mail.las.ac.cn;张智雄,研究馆员,博士;刘熠,馆员,博士;王宇飞,博士研究生。
  • 基金资助:
    本文系国家科技图书文献中心(NSTL)“下一代开放知识服务平台关键技术优化集成与系统研发—科研综述智能生成工具优化与集成”(项目编号:2022XM28)研究成果之一。

A Study on the Method of Identifying Research Question Sentences in Scientific Articles

Li Xuesi1,2, Zhang Zhixiong1,2, Liu Yi1,2, Wang Yufei1,2   

  1. 1 National Science Library, Chinese Academy of Sciences, Beijing 100190;
    2 Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190
  • Received:2022-09-05 Revised:2022-12-07 Online:2023-05-11 Published:2023-05-11

摘要: [目的/意义] 科技文献是记录科学问题提出和求解过程的重要载体,其中的研究问题句对于揭示科学问题具体内容、把握文章的研究主题具有重要作用。自动识别科技文献研究问题句是科技文本挖掘中一个重要的任务。[方法/过程]针对研究问题句自动识别,首先提出一种基于迭代的半自动标注策略,结合模型预判的置信度进行人工校对,对研究问题句数据进行标注。在此基础上,采用BERT-CNN的架构设计句子分类模型,该模型引入BERT生成文本向量,并使用CNN提取文本特征,然后通过实验与基准模型进行对比验证了模型的有效性。[结果/结论]基于提出的标注策略构建了大规模规范可用的数据集,经过人工检验准确率达到95%;并在此基础上设计了基于BERT-CNN架构的识别模型,在研究问题句识别任务中F1值达到94.8%。本文的研究为科技文献研究问题的挖掘与分析提供了高质量的数据支持和有效的模型方法。

关键词: 研究问题句, 自动识别, 预训练语言模型, 深度学习, 文本挖掘

Abstract: [Purpose/Significance] Scientific articles are an important tool for recording the process of scientific problem formulation and solution. And the research problem sentences in scientific articles play an important role in revealing the specific content of scientific problems and understand the research themes of articles. Automatic recognition of research question sentences in scientific articles is an important task in scientific text mining. [Method/Process] In order to automatically recognize research question sentences, a semi-automatic annotation strategy based on iteration was firstly proposed. Combined with the confidence of model prediction, manual proofreading was carried out to annotate the data of research question sentences. On this basis, a sentence classification model was designed using a BERT-CNN architecture, which introduced BERT to generate text vectors and used CNN to extract text features. Then, the effectiveness of the model was verified by comparison with experiments and the benchmark model. [Result/Conclusion] A large-scale and available dataset was constructed based on the proposed annotation strategy, and the precision reached 95% after manual inspection. As for the recognition model based on the BERTCNN architecture, the F1 value reaches 94.8% in the research question sentence recognition task. This paper provides high-quality data support and effective modeling methods for the mining and analysis of research questions in scientific articles.

Key words: research problem sentences, automatic recognition, pre-trained language models, deep learning, text mining

中图分类号: