[目的/意义] 在学术文献检索与阅读场景下,当前学术信息量已远超用户信息处理能力,造成信息堆积。为应对用户阅读效率与知识吸收难题,面向学术文献检索结果集开展内容的综合挖掘揭示。[方法/过程] 一方面从阅读体验出发,针对文献检索场景的特点,进行结构化综述表达设计;另一方面从技术方法与内容质量提升出发,利用基于深度学习的文本自动生成技术,构建科技文献数据集,训练并优化文本摘要模型,在此基础上利用大语言模型技术实现结构化的综述文本生成。[结果/结论] 训练优化后的摘要模型在各指标的召回率和F1值上平均增长2.07%。基于大模型的结构化综述生成,在实际测评中能够有效地提炼、总结和归纳内容要点,验证本文技术路线和应用实践的可行性,为进一步提升学术文献的知识化服务水平、智能辅助阅读与语义内容综合挖掘揭示等方面提供应用实践指南。
[Purpose/Significance] In the academic document retrieval and reading, the current amount of academic information has far exceeded the user’s information processing ability and caused information accumulation. In order to improve users’ reading efficiency and knowledge absorption, this paper conducts comprehensive mining and revealing of the academic document retrieval result set.[Method/Process] On one hand, based on the reading experience and the document retrieval scenarios, it carried out a structured review expression design. On the other hand, starting from the improvement of technical methods and content quality, it utilized deep learning based text automatic generation technology to construct an academic document dataset, trained and optimized a text abstract model, and used large language model technology to achieve structured review text generation.[Result/Conclusion] The optimized abstract model has an average increase of 2.07% in the recall rate and F1 value of each indicator after training. Structured review generation based on the big model can effectively extract and summarize the main points of the content in the actual evaluation, which verifies the feasibility of the technology roadmap and application practice, and provides a guide for the knowledge-based service level of academic literature, intelligent assisted reading and comprehensive mining and disclosure of semantic content.
[1] 邓小昭. 因特网用户信息检索与浏览行为研究[J]. 情报学报, 2003, 22(6): 653-658. (DENG X Z. On internet users' information behavior in searching and browsing[J]. Journal of the China Society for Scientific and Technical Information, 2003, 22(6): 653-658.)
[2] JANSEN B J, SPINK A, SARACEVIC T. Real life, real users, and real needs: a study and analysis of user queries on the web[J]. Information processing & management, 2000, 36(2): 207-227.
[3] 索红光, 安迪, 李健. 基于名实体的新闻专题自动综述系统研究与实现[J]. 情报学报, 2010(1): 6. (SUO H G, AN D, LI J. Research and implementation of automatic news summarization based on named entity[J]. Journal of the China Society for Scientific and Technical Information, 2010(1): 6.)
[4] LUHN H P. The automatic creation of literature abstracts[J]. IBM journal of research and development, 1958, 2(2): 159-165.
[5] 李鹏. 面向主题的多文档自动文摘关键技术研究[D]. 上海: 上海交通大学, 2013. (LI P. The research on topic-oriented multi-document summarization[D]. Shanghai: Shanghai Jiao Tong University, 2013.)
[6] 杨浩正. 面向问题的多文档文本摘要技术研究[D]. 哈尔滨: 哈尔滨工业大学, 2021. (YANG H Z. Research on query-focused multi-document summarization[D]. Harbin: Harbin Institute of Technology, 2021.)
[7] 王凯祥. 面向查询的自动文本摘要技术研究综述[J]. 计算机科学, 2018, 45(S2): 12-16. (WANG K X.Survey of query-oriented automatic summarization technology[J]. Computer science, 2018, 45(S2): 12-16.)
[8] 李芳. 面向查询的多模式自动摘要研究[D]. 武汉: 华中师范大学, 2009. (LI F. Research on query-directed multi-mode automatic summarization[D]. Wuhan: Central China Normal University, 2009.)
[9] 薛竹君. 面向网络媒体的文本自动综述技术的研究与实现[D]. 长沙: 国防科学技术大学, 2015. (XUE Z J. Research and implementation of network media for automatic text summary[D]. Changsha: National University of Defense Technology, 2015)
[10] CHENG J, LAPATA M. Neural summarization by extracting sentences and words[C]//Proceedings of the 54th annual meeting of the association for computational linguistics. Stroudsburg: ACL, 2016: 484-494.
[11] NEMA P, KHAPRA M M, LAHA A, et al. Diversity driven attention model for query-based abstractive summarization[C]//Proceedings of the 55th annual meeting of the association for computational linguistics. Stroudsburg: ACL, 2017: 1063-1072.
[12] 徐晓丹. 基于子主题和用户查询的多文档摘要系统[J]. 计算机系统应用, 2011, 20(3): 112-115. (XU X D. Multi-document summarization system based on sub topic partition and user's query[J]. Computer systems & applications, 2011, 20(3): 112-115.)
[13] SHEN C, LI T. Learning to rank for query-focused multi- document summarization[C]//IEEE International Conference on Data Mining. Piscataway: IEEE, 2012: 626-634.
[14] 王红斌, 金子铃, 毛存礼. 结合层级注意力的抽取式新闻文本自动摘要[J]. 计算机科学与探索, 2022, 16(4): 877-887. (WANG H B, JIN Z L, MAO C L. Extractive news text automatic summarization combined with hierarchical attention[J]. Journal of frontiers of computer science and technology, 2022, 16(4): 877-887.)
[15] CHEN J, ZHUGE H. Summarization of related work through citations[C]//International Conference on Semantics. Piscataway: IEEE, 2017: 54-61.
[16] 王勇臻. 基于深度学习的学术文献自动摘要方法研究[D]. 大连: 大连海事大学, 2018. (WANG Y Z. Automatic summarization of academic literature based on deep learning[D]. Dalian: Dalian Maritime University, 2018.)
[17] 王茂发, 章赫, 黄鸿亮, 等. BiLSTM与多头注意力机制结合的生成式中文自动文摘[J]. 山西大学学报(自然科学版), 2022, 45(4): 996-1003. (WANG M F, ZHANG H, HUANG H L, et al. Automatic Chinese abstractive summarization based on BiLSTM and multi-head attention mechanism[J]. Journal of Shanxi university(natural science edition), 2022, 45(4): 996-1003.)
[18] 刘迪, 奚雪峰, 崔志明, 等. 抽取-生成式自动文本摘要技术研究综述[J]. 计算机技术与发展, 2023, 33(5): 1-8. (LIU D, XI X F, CUI Z M, et al. Review of research on extractive-abstractive automatic text summarization technology[J]. Computer technology and development, 2023, 33(5): 1-8.
[19] OpenAI. ChatGPT: optimizing language models for dialogue.[EB/OL].[2023-07-21]. https://openai.com/blog/chatgpt.OpenAI.
[20] KALYAN K S. A survey of GPT-3 family large language models including ChatGPT and GPT-4.[2023-06-16]. https://arxiv.org/pdf/2310.12321.pdf.
[21] MA C B, ZHANG W E, GUO M Y, et al. Multi-document summarization via deep learning techniques: a survey[J]. ACM computing surveys, 2023, 55(5): 1-35.
[22] 刘峥, 孙坦, 张建勇. NSTL资源的深度组织和揭示: 从资源描述到语义描述[J]. 数字图书馆论坛, 2020(7): 60-66. (LIU Z, SUN T, ZHANG J Y. The knowledge organization of NSTL resources: from resource description to semantic description[J]. Digital library forum, 2020(7): 60-66.)
[23] ZHANG J, ZHAO Y, SALEH M, et al. Pegasus: pre-training with extracted gap sentences for abstractive summarization[C]//International conference on machine learning. New York: ACM Press, 2020: 11328-11339
[24] TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: open and efficient foundation language models[EB/OL].[2023-07-07]. https://arxiv.org/pdf/2302.13971.pdf.
[25] TAORI R, GULRAJANI I, ZHANG T, et al. Stanford alpaca: an instruction-following Llama model[EB/OL].[2023-07-07]. https://github.com/tatsu-lab/stanford_alpaca.
[26] Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality[EB/OL].[2023-06-06]. https://lmsys.org/blog/2023-03-30-vicuna/.
[27] GENG X, GUDIBANDE A, LIU H, et al. Koala: a dialogue model for academic research[EB/OL].[2023-07-10]. https://bair.berkeley.edu/blog/2023/04/03/koala/.
[28] LIN C Y. Rouge: a package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Barcelona: ACL Press, 2004: 74-81.