知识组织

图模型框架下的报道性新闻自动摘要方法研究

  • 袁琳 ,
  • 孙巍 ,
  • 马晓敏 ,
  • 李周晶 ,
  • 项芮
展开
  • 1 中国农业科学院农业信息研究所 北京 100081;
    2 北京夏初科技集团有限公司 北京 100020;
    3 农业农村部农业大数据重点实验室 北京 100081
袁琳,科技管理经理,硕士;孙巍,研究员,博士,博士生导师,通信作者,E-mail:sunwei@caas.cn;马晓敏,副研究馆员,博士,硕士生导师;李周晶,助理研究员,博士;项芮,硕士研究生。

收稿日期: 2023-12-25

  修回日期: 2024-02-10

  网络出版日期: 2024-09-12

基金资助

本文系国家重点研发计划项目“科技文献内容深度挖掘及智能分析关键技术和软件”(项目编号:2022YFF0711900)和“中国农业科学院基本科研业务经费专项农业科技政策发展动向分析解读”(项目编号:Y2022ZK06)研究成果之一。

Research on Automatic Summary Methods for Reportable News under the Graph Model Framework*

  • Yuan Lin ,
  • Sun Wei ,
  • Ma Xiaomin ,
  • Li Zhoujing ,
  • Xiang Rui
Expand
  • 1 Institute of Agricultural Information, Chinese Academy of Agricultural Sciences, Beijing 100081;
    2 Beijing Xiachu Technology Group Co., Ltd. Beijing 100020;
    3 Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081

Received date: 2023-12-25

  Revised date: 2024-02-10

  Online published: 2024-09-12

Supported by

This work is supported by the project of National Key R&D Program titled “key technologies and software for deep mining and intelligent analysis of scientific and technological literature content” (Grant No. 2022YFF0711900) and “analysis and interpretation of development trends of agricultural science and technology policies for basic research business funds of the Chinese Academy of Agricultural Sciences” (Grant No. Y2022ZK06).

摘要

[目的/意义] 针对现有文本自动摘要形成过程中重要技术节点——图模型框架下摘要知识表达方式中内容语义揭示深度不够的问题,提出报道性新闻自动摘要模型方案,为相关领域利用经过摘要处理后的网页报道性新闻文本数据开展实践研究提供借鉴参考。[方法/过程] 利用ETM(Embedded Topic Model)融合词向量的主题模型分析工具,在图模型框架下针对目标摘要句的主题构造环节,加入主题重要度特征和语义相关性特征并重新设计报道性新闻句间统计特征,对报道性新闻文本深层次主题语义信息进行挖掘、过滤,以此初步形成报道性新闻自动摘要抽取模型;后续依据报道性新闻摘要主要功能需求提出摘要主题测度功能量化指标体系,建立测度标准与句子统计特征量化方法的对应关系,以此优化调整提出的报道性新闻自动摘要抽取模型。[结果/结论] 利用图模型框架下的报道性新闻自动摘要方法具体选取农业领域科技动态报道性新闻的摘要抽取过程进行实证,建立报道性新闻自动摘要测度标准进一步得到优化后报道性新闻摘要模型方案,结果显示在外部报道性功能及内部ROUGE评价测评综合表现上优于对比方法,可以有效提高报道性新闻自动摘要抽取的准确性。

本文引用格式

袁琳 , 孙巍 , 马晓敏 , 李周晶 , 项芮 . 图模型框架下的报道性新闻自动摘要方法研究[J]. 图书情报工作, 2024 , 68(17) : 122 -135 . DOI: 10.13266/j.issn.0252-3116.2024.17.010

Abstract

[Purpose/Significance] With the graph model framework, the representation of summary knowledge is an important technical node in the automatic text summarization process. To address the issue of insufficient depth of semantic disclosure of summary content, this paper proposes a model for automatic summarization of news articles, providing a reference for practical research in related fields using summarized web reportable news text data. [Method/Process] With ETM (Embedded Topic Model), a topic model analysis tool integrating word vectors, this paper introduced topic importance and semantic relevance features into the topic construction link of the target summary sentence in the graph model framework. And it redesigned the statistical features between reportable news sentences to mine and filter the in-depth topic semantic information of the texts. Based on this, it formed the automatic summary extraction model for reportable news under the method proposed in this paper. Subsequently, according to the main functional requirement, it proposed a quantitative index system of the summary topic measurement function, and established the corresponding relationship between the measurement standard and the quantitative method to optimize and adjust the proposed model of reportable news. [Result/Conclusion] Using the graph model framework, the automatic summarization method for reportage news specifically selects the summarization process of agricultural science and technology dynamic reportage news for empirical research, establishes a measurement standard for automatic summarization of reportage news, and further obtains an optimized reportage news summarization model scheme. The results show that it performs better than the comparative method in terms of external reportage function and internal ROUGE evaluation, which can effectively improve the accuracy of automatic summarization extraction for reportage news.

参考文献

[1] 章永宏. 重建客观: 中国大陆精确新闻报道研究[M]. 北京: 中国书籍出版社, 2013: 78. (ZHANG Y H. Rebuilding objectivity: research on accurate news reporting in Chinese mainland [M]. Beijing: China Book Publishing House, 2013: 78.)
[2] 吴振东. 基于图模型聚类的文本摘要方法研究[D]. 杭州: 浙江工商大学, 2015. (WU Z D. Research on text abstraction method based on graph model clustering [D]. Hangzhou: Zhejiang Gongshang University, 2015.)
[3] XIONG A, YU X, LIU D, Tian H. News keywords extraction algorithm based on TextRank and classified TF-IDF[C]// 2020 International wireless communications and mobile computing. Limassol, Cyprus: IEEE, 2020: 1364-1369.
[4] RUIP S, CORDEIR M, BRAZDIL P, et al. Incremental TextRank-automatic keyword extraction for text streams[C]//20th International conference on enterprise information systems. Funchal, Madeira: ICEIS, 2018: 363-370.
[5] 于广川, 贺瑞芳, 刘洋, 等. 融合语境分析的时序推特摘要方法[J]. 软件学报, 2017, 28(10): 20. (YU G C, HE R F, LIU Y, et al. A temporal Twitter summarization method integrating contextual analysis [J]. Journal of software, 2017, 28(10): 20.)
[6] 孟彩霞, 张琰, 李楠楠. 基于TextRank的关键词提取改进方法研究[J]. 计算机与数字工程, 2020, 48(12): 3022-3026. (MENG C X, ZHANG Y, LI N N. Research on improved keyword extraction method based on TextRank [J]. Computer and digital engineering, 2020, 48(12): 3022-3026.)
[7] 郝慧丽. 浅谈新闻倾向性与客观报道[J]. 新闻界, 1999(6): 12-13. (HAO H L. On news tendency and objective reporting [J]. Journalism, 1999(6): 12-13.)
[8] 黄荣昌. 论新闻报道中的新闻倾向性[J]. 新闻研究导刊, 2018, 9(18): 90-91. (HUANG R C. On news tendency in news reporting [J]. News research guide, 2018, 9(18): 90-91.)
[9] 蒋建科. 农业科技新闻采访与写作[J]. 科技传播, 2015(22): 1-6. (JIANG J K. Interviewing and writing of agricultural science and technology news [J]. Science and technology communication, 2015(22): 1-6.)
[10] JAMES A, GUPTA R, KHANDELWAL V. Temporal summaries of news topics[C]// Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. New York: ACM, 2001: 10-18.
[11] 郭艳卿, 赵锐, 孔祥维, 等. 基于事件要素加权的新闻摘要提取方法[J]. 计算机科学, 2016, 43(1): 237-241. (GUO Y Q, ZHAO R, KONG X W, et al. A news summary extraction method based on event element weighting [J]. Computer science, 2016, 43(1): 237-241.)
[12] 叶雷, 余正涛, 高盛祥, 等. 多特征融合的汉越双语新闻摘要方法[J]. 中文信息学报, 2018, 32(12): 84-91. (YE L, YU Z T, GAO S X, et al. A Chinese Vietnamese bilingual news abstract method with multi feature fusion [J]. Chinese journal of information science, 2018, 32(12): 84-91.)
[13] 李峰, 黄金柱, 李舟军, 等. 使用关键词扩展的新闻文本自动摘要方法[J]. 计算机科学与探索, 2016, 10(3): 372-380. (LI F, HUANG J Z, LI Z J, et al. Automatic summarization method of news text using keyword expansion[J]. Computer science and exploration, 2016, 10(3): 372-380.)
[14] 虞金中, 杨先凤, 陈雁, 等. 基于混合模型的新闻事件要素提取方法[J]. 计算机系统应用, 2018, 27(12): 169-174. (YU J Z, YANG X F, CHEN Y, et al. News event element extraction method based on mixed model[J]. Computer system application, 2018, 27(12): 169-174.)
[15] 严睿. 演进式动态新闻文档摘要生成方法研究[D]. 北京: 北京大学, 2013. (YAN R. Research on the method of generating abstracts of evolutionary dynamic news documents[D]. Beijing: Peking University, 2013.)
[16] ELHADAD M, BARZILAY R. Using lexical chains for text summarization[C]//Intelligent scalable text summarization: workshop held at ACL 1997. Association for Computational Linguistics. Madrid: Springer, 1997: 10-17.
[17] HARANDIZADEH B, PRINISKI J H, MORSTATTER F. Keyword assisted embedded topic model [C]//Proceedings of the Fifteenth ACM international conference on Web search and data mining. New York: Association for Computing Machinery, 2022: 372-380.
[18] 蔡中祥. 基于自动文本摘要的党建新闻标题生成系统的设计与实现[D]. 北京: 中国科学院大学, 2020. (CAI Z X. Design and implementation of news headline generation system for party building based on automatic text summary[D]. Beijing: University of Chinese Academy of Sciences, 2020.)
[19] 苏海菊, 王永成. 中文科技文献文摘的自动编写[J]. 情报学报, 1989, 8(6): 433-439. (SU H J, WANG Y C. Automatic compilation of Chinese scientific and technological literature abstracts[J]. Journal of the China Society for Scientific and Technical Information, 1989, 8(6): 433-439.)
[20] 李小滨, 徐越. 自动文摘系统EAAS[J]. 软件学报, 1991, 2(4): 12-18. (LI X B, XU Y. Automatic abstracting system EAAS [J]. Journal of software, 1991, 2(4): 12-18.)
[21] 王子璇, 乐小虬, 何远标. 基于WMD语义相似度的TextRank改进算法识别论文核心主题句研究[J]. 数据分析与知识发现, 2017, 1(4): 1-8. (WANG Z X, LE X Q, HE Y B. Research on TextRank improved algorithm for identifying core topic sentences of papers based on WMD semantic similarity[J]. Data analysis and knowledge discovery, 2017, 1(4): 1-8.)
[22] 毛进, 陈子洋. 基于深度学习的科技文献摘要结构功能识别研究[J]. 农业图书情报学报, 2022, 34(3): 15-27. (MAO J, CHEN Z Y. Research on structure and function recognition of abstract of scientific and technological literature based on deep learning[J]. Journal of library and information science of agriculture, 2022, 34(3): 15-27.)
[23] 刘家益, 邹益民. 近70年文本自动摘要研究综述[J]. 情报科学, 2017, 35(7): 154-161. (LIU J Y, ZOU Y M. A review of research on automatic text summarization in the past 70 years [J]. Intelligence science, 2017, 35(7): 154-161.)
[24] 李金鹏, 张闯, 陈小军, 等. 自动文本摘要研究综述[J]. 计算机研究与发展, 2021, 58(1): 1-21. (LI J P, ZHANG C, CHEN X J, et al. A review of research on automatic text summarization [J]. Computer research and development, 2021, 58(1): 1-21.)
[25] 王俊丽, 魏绍臣, 管敏. 基于图排序算法的自动文摘研究综述[J]. 计算机科学, 2015, 42(12): 1-7, 39. (WANG J L, WEI S C, GUAN M. A review of research on automatic summarization based on graph sorting algorithms [J]. Computer science, 2015, 42(12): 1-7, 39.)
[26] 赵美玲, 刘胜全, 刘艳, 等. 基于改进K-means聚类与图模型相结合的多文本自动文摘研究[J]. 现代计算机: 中旬刊, 2017(6): 26-30. (ZHAO M L, LIU S Q, LIU Y, et al. Research on multi text automatic abstraction based on the combination of improved K-means clustering and graph model [J]. Modern computer: midday, 2017(6): 26-30.)
[27] LUHN H. The automatic creation of literature abstracts[J]. IBM journal of research and development, 1958, 2(2): 159-165.
[28] JAIME G, JADE G. The use of MMR, diversity-based reranking for reordering documents and producing summaries[C]. Association for Computing Machinery. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. New York: ACM, 1998: 335-336.
[29] HARANDIZADEH B, PRINISKI J H, MORSTATTER F. Keyword assisted embedded topic model [C]//Proceedings of the fifteenth ACM international conference on Web search and data mining. New York: Association for Computing Machinery, 2022: 372-380.
[30] 张璐, 曹杰, 蒲朝仪, 等. 基于词句协同排序的单文档自动摘要算法[J]. 计算机应用, 2017, 37(7): 2100. (ZHANG L, CAO J, PU C Y, et al. A single document automatic summary algorithm based on word sentence collaborative sorting[J]. Computer applications, 2017, 37(7): 2100.)
[31] 李炫. 基于图排序的文档自动摘要[D]. 北京: 中国科学院大学, 2012. (LI X. Automatic document summarization based on graph sorting [D]. Beijing: University of Chinese Academy of Sciences, 2012.)
[32] 李楠. 新闻网页摘要算法的研究及实现[D]. 成都: 西南交通大学, 2018. (LI N. Research and implementation of news web page abstract algorithm [D]. Chengdu: Southwest Jiaotong University, 2018.)
[33] 陈鑫. 基于行块分布函数的通用网页正文抽取算法[R]. 哈尔滨: 哈尔滨工业大学社会计算与信息检索研究中心, 2012. (CHEN X. A general Web page text extraction algorithm based on row block distribution function [R]. Harbin: Social Computing and Information Retrieval Research Center of Harbin Institute of Technology, 2012.)
[34] 程琨, 李传艺, 贾欣欣, 等. 基于改进的MMR算法的新闻文本抽取式摘要方法[J]. 应用科学学报, 2021, 39(3): 443-442. (CHENG K, LI C Y, JIA X X, et al. A news text extraction summarization method based on improved MMR algorithm [J]. Journal of applied sciences, 2021, 39(3): 443-442.)
[35] LIN C Y. ROUGE: a package for automatic evaluation of summaries [C]//Proceedings of the ACL workshop: text summarization braches out 2004. Barcelona: Computer Science, Linguistics, 2004: 74-81.
[36] 何天文, 王红. 基于语义语法分析的中文语句困惑度评价[J]. 计算机应用研究, 2017, 34(12): 3538-3542. (HE T W, WANG H. Evaluation of Chinese sentence confusion based on semantic grammar analysis [J]. Computer application research, 2017, 34(12): 3538-3542.)
[37] 黄迎春, 王港. 基于BM25-IWF特征提取的改进Simhash算法[J]. 移动信息, 2021(5): 7-10. (HUANG Y C, WANG G. Improved Simhash algorithm based on BM25-IWF feature extraction [J]. Mobile information, 2021(5): 7-10.)
[38] WANG Z, LE X, HE Y, et al. Recognizing core topic sentences with improved TextRank algorithm based on WMD semantic similarity[J]. Data analysis and knowledge discovery, 2017, 1(4): 1-8.
文章导航

/