基于机器学习的自动文摘研究综述

doi:10.13266/j.issn.0252-3116.2014.18.018

图书情报工作 ›› 2014, Vol. 58 ›› Issue (18): 122-130.DOI: 10.13266/j.issn.0252-3116.2014.18.018

基于机器学习的自动文摘研究综述

曹洋, 成颖, 裴雷

南京大学信息管理学院

收稿日期:2014-07-24 修回日期:2014-08-22 出版日期:2014-09-20 发布日期:2014-09-20
通讯作者: 成颖，南京大学信息管理学院教授，博士，博士生导师，通讯作者，E-mail：chengy@nju.edu.cn
作者简介:曹洋，南京大学信息管理学院硕士研究生；裴雷，南京大学信息管理学院副教授，博士。
基金资助:
本文系国家社会科学基金重大招标项目“面向学科领域的网络信息资源深度聚合与服务研究”（项目编号：12&ZD221）和国家自然科学基金项目“融合范式视角下的链接分析理论集成框架及其实证研究”（项目编号：71273125）研究成果之一。

A Review on Machine Learning Oriented Automatic Summarization

Cao Yang, Cheng Ying, Pei Lei

School of Information Management, Nanjing University, Nanjing 210093

Received:2014-07-24 Revised:2014-08-22 Online:2014-09-20 Published:2014-09-20

摘要/Abstract

摘要：

探讨基于机器学习的自动文摘研究中的特征选取、算法选择、模型训练、文摘提取和模型评测等主要过程；重点分析3种主要的机器学习算法：朴素贝叶斯、隐马尔科夫和条件随机场，阐释3种算法的基本思想，在对相关研究进行系统梳理的基础上，给出作者的思考；对3种机器学习算法在训练方法、协同训练与主动学习、类别平衡以及词汇分布等方面存在的共性问题进行深入讨论并提出未来的主要研究方向。

关键词: 自动文摘, 机器学习, NB, HMM, CRF

Abstract:

This paper probes into the process of automatic summarization based on machine learning, including features selection, algorithm selection, model training, abstracts extraction, model evaluation. The Review focuses on three main machine learning algorithms: Naive Bayes, Hidden Markov Model and Conditional Random Fields, mainly elaborating the idea of these algorithms, summarizing related research, and giving reflections. Then it discusses the common problems with three machine learning algorithms, including training methods, collaborative training and active learning, category balance, terms distribution. In the end, future research directions are explored.

Key words: automatic summarization, machine learning, NB, HMM, CRF

中图分类号:

G252.7

曹洋, 成颖, 裴雷. 基于机器学习的自动文摘研究综述[J]. 图书情报工作, 2014, 58(18): 122-130.

Cao Yang, Cheng Ying, Pei Lei. A Review on Machine Learning Oriented Automatic Summarization[J]. LIS, 2014, 58(18): 122-130.

参考文献

[1] Luhn H P. The automatic creation of literature abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165.

[2] Mani I, Maybury M T. Advances in automatic text summarization[M]. Cambridge: MIT Press, 1999.

[3] Mani I, Bloedorn E. Machine learning of generic and user-focused summarization[C]//Proceedings of the Fifteenth National Conference on Artificial Intelligence.Reston VA:AAAI Press, 1998: 821-826.

[4] Mitchell T M. Machine learning[M]. Burr Ridge: McGraw Hill, 1997:45.

[5] 郭燕慧,钟义信,马志勇,等. 自动文摘综述[J]. 情报学报,2002(2):582-591.

[6] Jones K S. Automatic summarizing:Factors and directions[C]//Advances in Automatic Text Summarization. Cambridge: MIT Press,1999:1-12.

[7] Hovy E, Marcu D. Automated text summarization[C]//The Oxford Handbook of Computational Linguistics. USA: Oxford University Press,2005:583-598.

[8] Baxendale P B. Machine-made index for technical literature:An experiment[J]. IBM Journal of Research and Development, 1958, 2(4): 354-361.

[9] Edmundson H P. New methods in automatic extracting[J]. Journal of the ACM (JACM), 1969, 16(2): 264-285.

[10] Ramezania M, Feizi-Derakhshi M. Automated text summarization:An overview[J]. Applied Artificial Intelligence:An International Journal,2014, 28(2):178-215.

[11] Gong Yihong, Liu Xin. Generic text summarization using relevance measure and latent semantic analysis[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2001: 19-25.

[12] Boguraev B, Kennedy C. Salience-based content characterisation of text documents[C]//Advances in Automatic Text Summarization. Cambridge: MIT Press, 1999:99-110.

[13] Barzilay R. Lexical chains for summarization[D].Beer-Sheva: Ben-Gurion University of the Negev, 1997.

[14] Marcu D. From discourse structures to text summaries[C]//Proceedings of the ACL.Madrid:ACL,1997: 82-88.

[15] Barzilay R, Elhadad M. Using lexical chains for text summarization[C]//Advances in Automatic Text Summarization. Cambridge: MIT Press, 1999:111-121.

[16] Mihalcea R. Language independent extractive summarization[C]//Proceedings of the ACL 2005 on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics.Ann Arbor:ACL, 2005: 49-52.

[17] Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries[C]//Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. Barcelona:ACL,2004: 74-81.

[18] Papineni K, Roukos S, Ward T, et al. BLEU:A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics.Priladelphia:ACL, 2002: 311-318.

[19] Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers[J]. Machine Learning, 1997, 29(2-3): 131-163.

[20] Kupiec J, Pedersen J, Chen Francine. A trainable document summarizer[C]//Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Seattle: ACM, 1995: 68-73.

[21] Paice C D. Constructing literature abstracts by computer:Techniques and prospects[J]. Information Processing & Management, 1990, 26(1): 171-186.

[22] Aone C, Okurowski M E, Gorlinsky J. Trainable, scalable summarization using robust NLP and machine learning[C]//Proceedings of the 17th International Conference on Computational Linguistics-Volume 1.Montreal:Association for Computational Linguistics, 1998: 62-66.

[23] Salton G, McGill M J. Introduction to modern information retrieval[M]. New York: McDraw-Hill, 1983:20-25.

[24] Hachey B, Grover C. Sentence extraction for legal text summarisation[C]//Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.Edinburgh:Professional Book Center, 2005: 1686-1687.

[25] Sharan A, Imran H, Joshi M L. A trainable document summarizer using Bayesian classifier approach[C]//Emerging Trends in Engineering and Technology, 2008. ICETET'08. First International Conference on.Nagpur: IEEE, 2008: 1206-1211.

[26] Yousfi-Monod M, Farzindar A, Lapalme G. Supervised machine learning for summarizing legal documents[C]//Advances in Artificial Intelligence. Berlin Heidelberg:Springer, 2010: 51-62.

[27] Pera M S, Ng Y K. A Naive Bayes classifier for Web document summaries created by using word similarity and significant factors[J]. International Journal on Artificial Intelligence Tools, 2010, 19(4): 465-486.

[28] Lee J E, Park H S, Kim K J, et al. Learning to predict the need of summarization on news articles[J]. Procedia Computer Science, 2013, 24: 274-279.

[29] Aries A, Oufaida H, Nouali O. Using clustering and a modified classification algorithm for automatic text summarization[C]//IS&T/SPIE Electronic Imaging.Burlingame:SPIE,2013:9-11.

[30] Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms[C]//Proceedings of the 23rd International Conference on Machine Learning. New York:ACM, 2006: 161-168.

[31] Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition[J].Proceedings of the IEEE, 1989, 77(2): 257-286.

[32] Conroy J M, O'leary D P. Text summarization via hidden markov models[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM, 2001: 406-407.

[33] Schlesinger J D, Okurowski M E, Conroy J M, et al. Understanding machine performance in the context of human performance for multi-document summarization[C]//Proceedings of the Workshop on Automatic Summarization. Gaithersburg: NIST, 2002:71–77.

[34] Fung P, Ngai G, Cheung C S. Combining optimal clustering and Hidden Markov models for extractive summarization[C]//Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering-Volume 12.Morristown: Association for Computational Linguistics, 2003: 21-28.

[35] Dunlavy D M, O'Leary D P, Conroy J M, et al. QCS: A system for querying, clustering and summarizing documents[J]. Information Processing & Management, 2007, 43(6): 1588-1605.

[36] Dunlavy D M, Conroy J M, Schlesinger J D, et al. Performance of a three-stage system for multi-document summarization[C]//Proceedings of the Document Understanding Conference.Gaithersburg:National Institution of Standards and Technology,2003:153-159.

[37] Zajic D, Dorr B J, Lin Jimmy, et al. Multi-candidate reduction: Sentence compression as a tool for document summarization tasks[J]. Information Processing & Management, 2007, 43(6): 1549-1570.

[38] Wang Peng, Wang Haixun, Liu Majin, et al. An algorithmic approach to event summarization[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.New York:ACM, 2010: 183-194.

[39] Darling W M, Song Fei. Probabilistic document modeling for syntax removal in text summarization[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies:Short papers-Volume 2.Portland: Association for Computational Linguistics, 2011: 642-647.

[40] 刘江鸣,徐金安,张玉洁. 基于隐主题马尔科夫模型的多特征自动文摘[J]. 北京大学学报(自然科学版),2014(1):187-193.

[41] Gruber A, Weiss Y, Rosen-Zvi M. Hidden topic Markov models[C]//International Conference on Artificial Intelligence and Statistics.San Juan:AISTATS, 2007: 163-170.

[42] Juang B H, Rabiner L R. Hidden Markov models for speech recognition[J]. Technometrics, 1991, 33(3): 251-272.

[43] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001).Williamstown:Morgan Kaufmann, 2001:282-289.

[44] Saravanan M, Ravindran B, Raman S. Improving legal document summarization using graphical models[C]//Legal Knowledge and Information System, JURIX 2006: The Nineteenth Annual Conference, Paris: IOS Press,2006:51-60.

[45] Shen Dou, Sun Jiantao, Li Hua, et al. Document summarization using conditional random fields[C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence.Hyderabad: Morgan Kaufmann Publishers Inc., 2007: 2862-2867.

[46] 吴晓锋,宗成庆. 一种基于LDA的CRF自动文摘方法[J]. 中文信息学报,2009(6):39-45.

[47] 邓箴,包宏.基于条件随机场的中文自动文摘系统[J].西安石油大学学报(自然科学版),2009(1): 96-99,102,114.

[48] 张龙凯,王厚峰. 文本摘要问题中的句子抽取方法研究[J]. 中文信息学报,2012(2):97-101.

[49] Batcha N K, Aziz N A, Shafie S I. CRF based feature extraction applied for supervised automatic text summarization[J]. Procedia Technology, 2013(11): 426-436.

[50] Wong Tak-Lam, Lam Wai. Learning to extract and summarize hot item features from multiple auction Web sites[J]. Knowledge and Information Systems, 2008, 14(2): 143-160.

[51] Lin Chin-Yew. Training a selection function for extraction[C]//Proceedings of the Eighth International Conference on Lnformation and Knowledge Management. Cambridge:ACM, 1999: 55-62.

[52] Osborne M. Using maximum entropy for sentence extraction[C]//Proceedings of the ACL-02 Workshop on Automatic Summarization-Volume 4.Morristown: Association for Computational Linguistics, 2002: 1-8.

[53] Yeh Jen-Yuan, Ke Hao-Ren, Yang Wei-Pang, et al. Text summarization using a trainable summarizer and latent semantic analysis[J]. Information Processing & Management, 2005, 41(1): 75-95.

[54] Murray G, Renals S, Carletta J, et al. Evaluating automatic summaries of meeting recordings[C]//Proceedings of ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization (MTSE). Ann Arbor:ACL,2005:33-40.

[55] Kaikhah K. Automatic text summarization with neural networks[C]// IEEE International Conference on Intelligent Systems.Varna:IEEE,2004:40-44.

[56] Fuentes M, Alfonseca E, Rodríguez H. Support vector machines for query-focused summarization trained and evaluated on pyramid data[C]//Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions.Stroudsburg: Association for Computational Linguistics, 2007: 57-60.

[57] Aker A, Cohn T, Gaizauskas R. Multi-document summarization using A* search and discriminative training[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.Stroudsburg: Association for Computational Linguistics, 2010: 482-491.

[58] Lin Shih-Hsiang, Chang Yumei, Liu Jiawen, et al. Leveraging evaluation metric-related training criteria for speech summarization[C]//Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on.Dallas:IEEE, 2010: 5314-5317.

[59] Xie Shasha, Liu Yang. Improving supervised learning for meeting summarization using sampling and regression[J]. Computer Speech & Language, 2010, 24(3): 495-514.

[60] Wong Kam-Fai, Wu Mingli, Li Wenjie. Extractive summarization using supervised and semi-supervised learning[C]//Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Manchester: Association for Computational Linguistics, 2008: 985-992.

[61] Zhang Jian Justin, Chan Ricky Ho Yin, Fung Pascode. Extractive speech summarization by active learning[C]//Automatic Speech Recognition & Understanding, 2009.ASRU 2009. Merano:IEEE Workshop on.Merano: IEEE, 2009: 392-397.

[62] Frakes W B, Baeza-Yates R. Information retrieval: Data structures and algorithms[M]. London: Prentice Hall, 1992:100-133.

[63] Sankarasubramaniam Y, Ramanathan K, Ghosh S. Text summarization using Wikipedia[J]. Information Processing & Management, 2014, 50(3): 443-461.

[64] Ramanathan K, Sankarasubramaniam Y, Mathur N, et al. Document summarization using Wikipedia[C]//Proceedings of the First International Conference on Intelligent Human Computer Interaction. New Delhi:Springer India, 2009: 254-260.

[65] Xu Danyun, Cheng Gong, Qu Yunzhong. Preferences in Wikipedia abstracts: Empirical findings and implications for automatic entity summarization[J]. Information Processing & Management, 2014, 50(2): 284-296.

[66] 孙建军. 网络公共信息资源利用效率影响因素实证分析[J]. 图书情报工作,2012,56(10):35-40.

[67] 孙建军,屈良. 基于博客的链接分类体系设计[J]. 情报科学,2012(3):321-326，346.

[68] 孙建军,屈良. 加权入链数:对链接分析中绝对入链数的修正[J]. 情报科学,2012(2):161-165,172.

[69] Zhang Pei-ying,Li Cun-he. Automatic text summarization based on sentences clustering and extraction[C]//Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd IEEE International Conference on.Dalian: IEEE, 2009: 167-170.

[70] ogly Alguliev R M, ogly Alyguliev R M. Automatic text documents summarization through sentences clustering[J]. Journal of Automation and Information Sciences, 2008, 40(9):53-63.

[71] Amini M R, Usunier N, Gallinari P. Automatic text summarization based on word-clusters and ranking algorithms[C]//Advances in Information Retrieval. Berlin Heidelberg:Springer, 2005: 142-156.

[72] Fattah M A. A hybrid machine learning model for multi-document summarization[J]. Applied Intelligence, 2014, 40(4): 592-600.

[73] Lei Yu, Ren Fuji. A study on cross-language text summarization using supervised methods[C]//Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on.Dalian: IEEE, 2009: 1-7.

基于机器学习的自动文摘研究综述

A Review on Machine Learning Oriented Automatic Summarization

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孙震, 冷伏海. 一种基于知识元变异的ESI研究前沿知识演进分析方法[J]. 图书情报工作, 2022, 66(2): 136-148.
[2]	陆颖颖, 孙裕彤, 张瑶, 李旭光. 人工智能、机器学习、自动化和机器人技术对信息行业的影响——2021年CILIP专题研讨会综述与启示[J]. 图书情报工作, 2022, 66(19): 143-152.
[3]	张彪, 吴红, 高道斌, 林艳秋. 基于潜在高被引论文与高价值专利的创新前沿识别研究[J]. 图书情报工作, 2022, 66(18): 72-83.
[4]	严炜炜, 黄为, 温馨. 学术社交网络问答质量智能评价与服务优化研究[J]. 图书情报工作, 2021, 65(6): 129-137.
[5]	郭诗琪, 贠强, 陈亮, 周杰. 专利无效对比文件判定方法研究[J]. 图书情报工作, 2021, 65(2): 117-125.
[6]	谢豪, 吴雪华, 陈茜, 唐晶, 白云, 毛进. 融合多维特征的学术文献下载行为预测研究[J]. 图书情报工作, 2021, 65(12): 112-121.
[7]	赵登鹏, 熊回香, 田丰收, 李昕然. 基于序列比对算法的中文文本相似度计算研究[J]. 图书情报工作, 2021, 65(11): 101-112.
[8]	潘梦雅, 沈旺, 代旺, 刘嘉宇. 社会化问答社区答题者发现研究[J]. 图书情报工作, 2020, 64(18): 76-88.
[9]	邱科达, 马建玲. 机器学习在术语抽取研究中的文献计量分析[J]. 图书情报工作, 2020, 64(14): 94-103.
[10]	徐璐, 卢小宾, 杨冠灿. 金融科技专利识别与分类方法构建及应用[J]. 图书情报工作, 2020, 64(11): 87-95.
[11]	余传明, 原赛, 王峰, 安璐. 大数据环境下文本情感分析算法的规模适配研究:以Twitter为数据源[J]. 图书情报工作, 2019, 63(4): 101-111.
[12]	张云中, 秦艺源. 社会化标注系统标签质量影响因素研究:基于随机森林算法[J]. 图书情报工作, 2019, 63(24): 119-126.
[13]	黄水清, 王东波. 新时代人民日报分词语料库构建、性能及应用(二)——深度学习自动分词模型构建[J]. 图书情报工作, 2019, 63(23): 5-12.
[14]	尹丽春, 王悦. 基于在线评论的图书消费者满意度影响因素与作用机理[J]. 图书情报工作, 2019, 63(22): 106-117.
[15]	初景利. 《中国科学院文献情报中心研究生教育40周年纪念专辑》编者按[J]. 图书情报工作, 2019, 63(19): 5-5.