面向检索结果集的结构化综述智能生成研究

孟旭阳; 陈阳; 白海燕

doi:10.13266/j.issn.0252-3116.2024.06.012

图书情报工作 >

2024 , Vol. 68 >Issue 6: 129 - 141

DOI: https://doi.org/10.13266/j.issn.0252-3116.2024.06.012

知识组织

面向检索结果集的结构化综述智能生成研究

孟旭阳 ,
陈阳 ,
白海燕

展开

中国科学技术信息研究所北京 100038

孟旭阳，助理研究员，硕士，E-mail:mengxy@istic.ac.cn；陈阳，助理工程师，硕士；白海燕，研究馆员，硕士，硕士生导师。

收稿日期: 2023-08-04

修回日期: 2023-11-17

网络出版日期: 2024-03-28

基金资助

本文系国家重点研发计划项目“科技文献内容深度挖掘及智能分析关键技术和软件”（项目编号：2022YFF0711900）与中国科学技术信息研究所创新研究基金青年项目“面向检索结果集的结构化综述研究”（项目编号：QN2023-11）研究成果之一。

收起

Research on Intelligent Generation of Structured Review for Retrieval Result Set

Meng Xuyang ,
Chen Yang ,
Bai Haiyan

Expand

Institute of Scientific and Technical of Information of China, Beijing 100038

Received date: 2023-08-04

Revised date: 2023-11-17

Online published: 2024-03-28

Supported by

This work is supported by the National Key R&D Program of China titled “Key Technologies and Software for Deep Mining and Intelligent Analysis of Scientific and Technological Literature Content” (Grant No. 2022YFF0711900) and Innovation Research Fund Youth Project of the Institute of Scientific and Technical Information of China titled “Research on Structured Review for Retrieval Result Set” (Grant No. QN2023-11).

Fold

摘要

[目的/意义] 在学术文献检索与阅读场景下，当前学术信息量已远超用户信息处理能力，造成信息堆积。为应对用户阅读效率与知识吸收难题，面向学术文献检索结果集开展内容的综合挖掘揭示。[方法/过程] 一方面从阅读体验出发，针对文献检索场景的特点，进行结构化综述表达设计；另一方面从技术方法与内容质量提升出发，利用基于深度学习的文本自动生成技术，构建科技文献数据集，训练并优化文本摘要模型，在此基础上利用大语言模型技术实现结构化的综述文本生成。[结果/结论] 训练优化后的摘要模型在各指标的召回率和F1值上平均增长2.07%。基于大模型的结构化综述生成，在实际测评中能够有效地提炼、总结和归纳内容要点，验证本文技术路线和应用实践的可行性，为进一步提升学术文献的知识化服务水平、智能辅助阅读与语义内容综合挖掘揭示等方面提供应用实践指南。

关键词： 文献检索; 结构化综述; 大语言模型; 文本自动生成

本文引用格式

孟旭阳 , 陈阳 , 白海燕 . 面向检索结果集的结构化综述智能生成研究[J]. 图书情报工作, 2024 , 68(6) : 129 -141 . DOI: 10.13266/j.issn.0252-3116.2024.06.012

Abstract

[Purpose/Significance] In the academic document retrieval and reading, the current amount of academic information has far exceeded the user’s information processing ability and caused information accumulation. In order to improve users’ reading efficiency and knowledge absorption, this paper conducts comprehensive mining and revealing of the academic document retrieval result set.[Method/Process] On one hand, based on the reading experience and the document retrieval scenarios, it carried out a structured review expression design. On the other hand, starting from the improvement of technical methods and content quality, it utilized deep learning based text automatic generation technology to construct an academic document dataset, trained and optimized a text abstract model, and used large language model technology to achieve structured review text generation.[Result/Conclusion] The optimized abstract model has an average increase of 2.07% in the recall rate and F1 value of each indicator after training. Structured review generation based on the big model can effectively extract and summarize the main points of the content in the actual evaluation, which verifies the feasibility of the technology roadmap and application practice, and provides a guide for the knowledge-based service level of academic literature, intelligent assisted reading and comprehensive mining and disclosure of semantic content.

Key words： literature search; structured overview; large language model; automatic text generation

参考文献

[1] 邓小昭. 因特网用户信息检索与浏览行为研究[J]. 情报学报, 2003, 22(6): 653-658. (DENG X Z. On internet users' information behavior in searching and browsing[J]. Journal of the China Society for Scientific and Technical Information, 2003, 22(6): 653-658.)
[2] JANSEN B J, SPINK A, SARACEVIC T. Real life, real users, and real needs: a study and analysis of user queries on the web[J]. Information processing & management, 2000, 36(2): 207-227.
[3] 索红光, 安迪, 李健. 基于名实体的新闻专题自动综述系统研究与实现[J]. 情报学报, 2010(1): 6. (SUO H G, AN D, LI J. Research and implementation of automatic news summarization based on named entity[J]. Journal of the China Society for Scientific and Technical Information, 2010(1): 6.)
[4] LUHN H P. The automatic creation of literature abstracts[J]. IBM journal of research and development, 1958, 2(2): 159-165.
[5] 李鹏. 面向主题的多文档自动文摘关键技术研究[D]. 上海: 上海交通大学, 2013. (LI P. The research on topic-oriented multi-document summarization[D]. Shanghai: Shanghai Jiao Tong University, 2013.)
[6] 杨浩正. 面向问题的多文档文本摘要技术研究[D]. 哈尔滨: 哈尔滨工业大学, 2021. (YANG H Z. Research on query-focused multi-document summarization[D]. Harbin: Harbin Institute of Technology, 2021.)
[7] 王凯祥. 面向查询的自动文本摘要技术研究综述[J]. 计算机科学, 2018, 45(S2): 12-16. (WANG K X.Survey of query-oriented automatic summarization technology[J]. Computer science, 2018, 45(S2): 12-16.)
[8] 李芳. 面向查询的多模式自动摘要研究[D]. 武汉: 华中师范大学, 2009. (LI F. Research on query-directed multi-mode automatic summarization[D]. Wuhan: Central China Normal University, 2009.)
[9] 薛竹君. 面向网络媒体的文本自动综述技术的研究与实现[D]. 长沙: 国防科学技术大学, 2015. (XUE Z J. Research and implementation of network media for automatic text summary[D]. Changsha: National University of Defense Technology, 2015)
[10] CHENG J, LAPATA M. Neural summarization by extracting sentences and words[C]//Proceedings of the 54th annual meeting of the association for computational linguistics. Stroudsburg: ACL, 2016: 484-494.
[11] NEMA P, KHAPRA M M, LAHA A, et al. Diversity driven attention model for query-based abstractive summarization[C]//Proceedings of the 55th annual meeting of the association for computational linguistics. Stroudsburg: ACL, 2017: 1063-1072.
[12] 徐晓丹. 基于子主题和用户查询的多文档摘要系统[J]. 计算机系统应用, 2011, 20(3): 112-115. (XU X D. Multi-document summarization system based on sub topic partition and user's query[J]. Computer systems & applications, 2011, 20(3): 112-115.)
[13] SHEN C, LI T. Learning to rank for query-focused multi- document summarization[C]//IEEE International Conference on Data Mining. Piscataway: IEEE, 2012: 626-634.
[14] 王红斌, 金子铃, 毛存礼. 结合层级注意力的抽取式新闻文本自动摘要[J]. 计算机科学与探索, 2022, 16(4): 877-887. (WANG H B, JIN Z L, MAO C L. Extractive news text automatic summarization combined with hierarchical attention[J]. Journal of frontiers of computer science and technology, 2022, 16(4): 877-887.)
[15] CHEN J, ZHUGE H. Summarization of related work through citations[C]//International Conference on Semantics. Piscataway: IEEE, 2017: 54-61.
[16] 王勇臻. 基于深度学习的学术文献自动摘要方法研究[D]. 大连: 大连海事大学, 2018. (WANG Y Z. Automatic summarization of academic literature based on deep learning[D]. Dalian: Dalian Maritime University, 2018.)
[17] 王茂发, 章赫, 黄鸿亮, 等. BiLSTM与多头注意力机制结合的生成式中文自动文摘[J]. 山西大学学报（自然科学版）, 2022, 45(4): 996-1003. (WANG M F, ZHANG H, HUANG H L, et al. Automatic Chinese abstractive summarization based on BiLSTM and multi-head attention mechanism[J]. Journal of Shanxi university(natural science edition), 2022, 45(4): 996-1003.)
[18] 刘迪, 奚雪峰, 崔志明, 等. 抽取-生成式自动文本摘要技术研究综述[J]. 计算机技术与发展, 2023, 33(5): 1-8. (LIU D, XI X F, CUI Z M, et al. Review of research on extractive-abstractive automatic text summarization technology[J]. Computer technology and development, 2023, 33(5): 1-8.
[19] OpenAI. ChatGPT: optimizing language models for dialogue.[EB/OL].[2023-07-21]. https://openai.com/blog/chatgpt.OpenAI.
[20] KALYAN K S. A survey of GPT-3 family large language models including ChatGPT and GPT-4.[2023-06-16]. https://arxiv.org/pdf/2310.12321.pdf.
[21] MA C B, ZHANG W E, GUO M Y, et al. Multi-document summarization via deep learning techniques: a survey[J]. ACM computing surveys, 2023, 55(5): 1-35.
[22] 刘峥, 孙坦, 张建勇. NSTL资源的深度组织和揭示: 从资源描述到语义描述[J]. 数字图书馆论坛, 2020(7): 60-66. (LIU Z, SUN T, ZHANG J Y. The knowledge organization of NSTL resources: from resource description to semantic description[J]. Digital library forum, 2020(7): 60-66.)
[23] ZHANG J, ZHAO Y, SALEH M, et al. Pegasus: pre-training with extracted gap sentences for abstractive summarization[C]//International conference on machine learning. New York: ACM Press, 2020: 11328-11339
[24] TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: open and efficient foundation language models[EB/OL].[2023-07-07]. https://arxiv.org/pdf/2302.13971.pdf.
[25] TAORI R, GULRAJANI I, ZHANG T, et al. Stanford alpaca: an instruction-following Llama model[EB/OL].[2023-07-07]. https://github.com/tatsu-lab/stanford_alpaca.
[26] Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality[EB/OL].[2023-06-06]. https://lmsys.org/blog/2023-03-30-vicuna/.
[27] GENG X, GUDIBANDE A, LIU H, et al. Koala: a dialogue model for academic research[EB/OL].[2023-07-10]. https://bair.berkeley.edu/blog/2023/04/03/koala/.
[28] LIN C Y. Rouge: a package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Barcelona: ACL Press, 2004: 74-81.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献