Research Progress on Topic Evolution of Scientific and Technical Literatures Based on Text Mining

  • Liang Shuang ,
  • Liu Xiaoping
Expand
  • 1. National Science Library, Chinese Academy of Sciences, Beijing 100190;
    2. Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190

Received date: 2022-01-12

  Revised date: 2022-05-01

  Online published: 2022-07-06

Abstract

[Purpose/Significance] This paper classifies, summarizes and concludes the various methods used in the analysis of topic evolution by sorting out the research related to the topic evolution of scientific and technical literatures based on text mining at home and abroad, and proposes the shortcomings of the existing research to provide new ideas and reference significance for the study of topic evolution. [Method/Process] According to the general process of topic evolution research by domestic and foreign scholars, this paper compared and summarized the advantages and disadvantages of various models, indicators and methods used in the three levels of analysis: data set selection and object analysis, topic identification research and topic evolution research (topic evolution time sequence analysis, topic intensity evolution analysis and topic content evolution analysis). Finally, pointing out the limitations of existing research and putting forward the prospect for the future development. [Result/Conclusion] At present, the research has a certain scale and a relatively mature analysis system, but there are still the following shortcomings: the data source is single; the drawbacks of LDA and related extended models need to be further overcome; the lack of exploration and application of other machine learning and deep learning algorithms; evolutionary analysis methods need to be combined and complemented with each other. In the future, we should make corresponding improvements and in-depth exploration for the above problems.

Cite this article

Liang Shuang , Liu Xiaoping . Research Progress on Topic Evolution of Scientific and Technical Literatures Based on Text Mining[J]. Library and Information Service, 2022 , 66(13) : 138 -149 . DOI: 10.13266/j.issn.0252-3116.2022.13.013

References

[1] 陈仕吉.科学研究前沿探测方法综述[J].现代图书情报技术, 2009(9):28-33.
[2] 黄晓斌,吴高.学科领域研究前沿探测方法研究述评[J].情报学报, 2019, 38(8):872-880.
[3] SMALL H. Co-citation in the scientific literature:a new measure of the relationship between two documents[J]. Journal of the American Society for Information Science, 1973, 24(4):265-269.
[4] MORRIS S A, YEN G, WU Z, et al. Time line visualization of research fronts[J]. Journal of the American Society for Information Science and Technology, 2003, 54(5):413-422.
[5] GARFIELD E. Historiographic mapping of knowledge domains literature[J]. Journal of information science, 2004, 30(2):119-145.
[6] HUMMON N P, DOREIAN P. Connectivity in a citation network:the development of DNA theory[J]. Social networks, 1989, 11(1):39-63.
[7] 谌志群,张国煊.文本挖掘研究进展[J].模式识别与人工智能, 2005, 18(1):65-74.
[8] 何伟林,谢红玲,奉国和.潜在狄利克雷分布模型研究综述[J].信息资源管理学报, 2018, 8(1):55-64.
[9] SMALL H G, GRIFFITH B C. The structure of scientific literatures I:Identifying and graphing specialties[J]. Science studies, 1974, 4(1):17-40.
[10] SMALL H G. Co-Citation model of a scientific specialty-longitudinal-study of collagen research[J]. Social studies of science, 1977, 7(2):139-166.
[11] 王春秀,冉美丽.学科主题演化定量分析的理论基础探析[J].现代情报,2008(6):48-50.
[12] 王金龙,徐从富,耿雪玉.基于概率图模型的科研文献主题演化研究[J].情报学报, 2009(3):347-355.
[13] 朱东华,万冬,汪雪锋,等.科学基金资助主题的演化路径分析与预测——以科技管理与政策学科为例[J].北京理工大学学报(社会科学版), 2018, 20(2):51-57.
[14] 沈思,李沁宇,叶媛,等.基于TWE模型的医学科技报告主题挖掘及演化分析研究[J].数据分析与知识发现, 2021, 5(3):35-44.
[15] TU Y N, SENG J L. Research intelligence involving information retrieval-an example of conferences and journals[J]. Expert systems with applications, 2009, 36(10):12151-12166.
[16] QI Y S, ZHU N, ZHAI Y J, et al. The mutually beneficial relationship of patents and scientific literature:topic evolution in nanoscience[J]. Scientometrics, 2018, 115(2):893-911.
[17] HU B B, DONG X L, ZHANG C W, et al. A lead-lag analysis of the topic evolution patterns for preprints and publications[J]. Journal of the Association for Information Science and Technology, 2015, 66(12):2643-2656.
[18] 张子振,储煜桂,吴小兰.基于LDA的多源文献主题及其差异研究——以"机器学习"为例[J].情报科学, 2019, 37(6):108-112,150.
[19] 徐路路,王芳.基于支持向量机和改进粒子群算法的科学前沿预测模型研究[J].情报科学, 2019, 37(8):22-28.
[20] 祝清松,冷伏海.基于引文内容分析的高被引论文主题识别研究[J].中国图书馆学报, 2014, 40(1):39-49.
[21] ABU-JBARA A, RADEV D. Reference scope identification in citing sentences[C]//Association for computational linguistics. Proceedings of the 2012 conference of the North American Chapter of the Association for Computational Linguistics:human language technologies. Montréal:NAACL, 2012:80-90.
[22] JEBARI C, HERRERA-VIEDMA E, COBO M J. The use of citation context to detect the evolution of research topics:a large-scale analysis[J]. Scientometrics, 2021, 126(4):2971-2989.
[23] SMALL H, TSENG H, PATEK M. Discovering discoveries:identifying biomedical discoveries using citation contexts[J]. Journal of informetrics, 2017, 11(1):46-62.
[24] 毕崇武,叶光辉,彭泽,等.引文内容视角下的引文网络知识流动效应研究[J].情报科学, 2022, 40(2):49-58.
[25] 张艺蔓,马秀峰,程结晶.融合引文内容和全文本引文分析的知识流动研究[J].情报杂志, 2015, 34(11):50-54,49.
[26] 陈路遥.数字人文领域的知识网络研究[D].上海:华东师范大学, 2018.
[27] 章成志,徐庶睿,卢超.利用引文内容监测多学科交叉现象的方法与实证[J].图书情报工作, 2016, 60(19):108-115.
[28] 徐庶睿,卢超,章成志.术语引用视角下的学科交叉测度——以PLOS ONE上六个学科为例[J].情报学报, 2017, 36(8):809-820.
[29] 廖君华,陈军营,白如江.基于引文内容的多维度科技创新路径构建与可视化研究[J].山东理工大学学报(社会科学版), 2019, 35(4):80-90.
[30] RAMAGE D, HALL D, NALLAPATI R, et al. Labeled LDA:a supervised topic model for credit attribution in multi-labeled corpora[C]//Association for Computational Linguistics. Proceedings of the 2009 conference on empirical methods in natural language processing. Singapore:ACL, 2009:248-256.
[31] BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process[C]//Neural Information Processing Systems Foundation. Advances in neural information processing systems. Vancouver:NIPS, 2004:106-114.
[32] TEH Y W, JORDAN M I, BEAL M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476):1566-1581.
[33] BASILI R, GIANNONE C, CROCE D, et al. Latent topic models of surface syntactic information[C]//Italian Association for Artificial Intelligence. Proceedings of the artificial intelligence around man and beyond. Palermo:IAAI, 2011:225-237.
[34] 齐亚双,祝娜,翟羽佳.基于DTM的国内外情报学研究主题热度演化对比研究[J].图书情报工作, 2016, 60(16):99-109.
[35] 罗艺.面向科技文献的主题发现及演化预测方法研究与应用[D].成都:电子科技大学, 2021.
[36] WANG X R, MCCALLUM A. Topics over time:a non-Markov continuous-time model of topical trends[C]//Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. New York:ACM, 2006:424-433.
[37] 马秀敏.中国典型管理期刊文献主题发现与演化分析[D].大连:大连理工大学, 2011.
[38] 曾利.三维科研态势演化图谱及软件系统实现[D].长沙:国防科学技术大学, 2015.
[39] DING W Y, CHEN C M. Dynamic topic detection and tracking:a comparison of HDP, C-Word, and Co-Citation methods[J]. Journal of the Association for Information Science and Technology, 2014, 65(10):2084-2097.
[40] NAVEED N, SIZOV S, STAAB S. ATT:analyzing temporal dynamics of topics and authors in social media[C]//Proceedings of the 3rd international Web science conference. New York:ACM, 2011.
[41] 史庆伟,乔晓东,徐硕,等.作者主题演化模型及其在研究兴趣演化分析中的应用[J].情报学报, 2013, 32(9):912-919.
[42] 吴夙慧,成颖,郑彦宁,等.文本聚类中文本表示和相似度计算研究综述[J].情报科学, 2012, 30(4):622-627.
[43] PARLINA A, RAMLI K, MURFI H. Theme mapping and bibliometrics analysis of one decade of big data research in the Scopus database[J]. Information, 2020, 11(2):1-26.
[44] 赵龙,罗勇,孟浩.基于K-means-Laplacian的技术演化分析方法研究[J].情报杂志, 2015, 34(9):192-196.
[45] 邬启为.基于向量空间的文本聚类方法与实现[D].北京:北京交通大学, 2014.
[46] 唐明,朱磊,邹显春.基于Word2Vec的一种文档向量表示[J].计算机科学, 2016, 43(6):214-217,269.
[47] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th international conference on neural information processing systems. New York:Curran Associates, 2013:3111-3119.
[48] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31st international conference on international conference on machine learning. Beijing:JMLR, 2014:1188-1196.
[49] 巴志超,杨子江,朱世伟,等.基于关键词语义网络的领域主题演化分析方法研究[J].情报理论与实践, 2016, 39(3):67-72.
[50] VAHIDNIA S, ABBASI A, ABBASS H A. Embedding-based detection and extraction of research topics from academic documents using deep clustering[J]. Journal of data and information science, 2021, 6(3):99-122.
[51] 贾晓婷,王名扬,曹宇.结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究[J].数据分析与知识发现, 2018, 2(2):86-95.
[52] 霍朝光,霍帆帆,董克.基于LSTM神经网络的学科主题热度预测模型[J].图书情报知识, 2021(2):25-34.
[53] XU S, HAO L Y, AN X, et al. Emerging research topics detection with multiple machine learning models[J]. Journal of informetrics, 2019, 13(4):1-19.
[54] LIANG Z T, MAO J, LU K, et al. Combining deep neural network and bibliometric indicator for emerging research topic prediction[J]. Information processing&management, 2021, 58(5):1-18.
[55] QIAN Y X, NI Z N, GUI W X, et al. Exploring the landscape, hot topics, and trends of electronic health records literature with topics detection and evolution analysis[J]. International journal of computational intelligence systems, 2021, 14(1):744-757.
[56] TRAPPEY A J C, CHEN P P J, TRAPPEY C V, et al. A machine learning approach for solar power technology review and patent evolution analysis[J]. Applied sciences, 2019, 9(7):1-25.
[57] BENGIO Y, DELALLEAU O. On the expressive power of deep architectures[C]//Proceedings of the 22nd international conference on algorithmic learning theory. Espoo:ALT, 2011:18-36.
[58] 隗玲,许海云,胡正银,等.学科主题演化路径的多模式识别与预测——一个情报学学科主题演化案例[J].图书情报工作, 2016, 60(13):71-81.
[59] 单斌,李芳.基于LDA话题演化研究方法综述[J].中文信息学报, 2010, 24(6):43-49,68.
[60] GRIFFITHS T L, STEYVERS M. Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1):5228-5235.
[61] BLEI D M, LAFFERTY J D. Dynamic Topic Models[C]//Association for Computing Machinery. Proceedings of the 23rd international conference on machine learning. Pittsburgh:ACM, 2006:113-120.
[62] WEI X, SUN J M, WANG X R. Dynamic Mixture Models for multiple time series[C]//Proceedings of the 20th international joint conference on artificial intelligence. Hyderabad:Morgan Kaufmann Publislers, 2007:2909-2914.
[63] WANG C, BLEI D, HECKERMAN D. Continuous time Dynamic Topic Models[C/OL]//Proceedings of the twenty-fourth conference on uncertainty in artificial intelligence.[2022-05-30]. http://doi.org/10.48550/arxiv.1208.5154.
[64] ALSUMAIT L, BARBARÁD, DOMENICONI C. On-line LDA:adaptive topic models for mining text streams with applications to topic detection and tracking[C]//IEEE Computer Society. Proceedings of the 2008 eighth IEEE international conference on data mining. Washington D.C.:IEEE, 2008:3-12.
[65] HALL D, JURAFSKY D, MANNING C D. Studying the history of ideas using topic models[C]//Association for Computational Linguistics. Proceedings of the conference on empirical methods in natural language processing. Hawaii:ACL, 2008:363-371.
[66] MOHAMMADI E, KARAMI A. Exploring research trends in big data across disciplines:a text mining analysis[J]. Journal of information science, 2022, 48(1):44-56.
[67] 贺亮.基于话题模型的科技文献话题发现与趋势分析[D].上海:上海交通大学, 2012.
[68] 刘自强,王效岳,白如江.多维主题演化分析模型构建与实证研究[J].情报理论与实践, 2017, 40(3):92-98.
[69] WU H, YI H F, LI C. An integrated approach for detecting and quantifying the topic evolutions of patent technology:a case study on graphene field[J]. Scientometrics, 2021, 126(8):6301-6321.
[70] 王文娟,马建霞.基于LDA的科研项目主题挖掘与演化分析——以NSF海洋酸化研究为例[J].情报杂志, 2017, 36(7):34-39.
[71] 霍朝光,董克,司湘云.国内外LIS学科主题热度演化分析与预测[J].图书情报知识, 2021(2):35-47,57.
[72] 秦晓慧,乐小虬.基于LDA主题关联过滤的领域主题演化研究[J].现代图书情报技术, 2015(3):18-25.
[73] 徐红姣,曾文,张运良.基于Word2vec的论文和专利主题关联演化分析方法研究[J].情报杂志, 2018, 37(12):36-42.
[74] 刘自强,王效岳,白如江.语义分类的学科主题演化分析方法研究——以我国图书情报领域大数据研究为例[J].图书情报工作, 2016, 60(15):76-85,93.
[75] 吕伟民.基于DTM的科学基金主题演化分析[D].北京:中国科学院大学, 2017.
[76] CHEN B T, TSUTSUI S, DING Y, et al. Understanding the topic evolution in a scientific domain:an exploratory study for the field of information retrieval[J]. Journal of informetrics, 2017, 11(4):1175-1189.
[77] DE BATTISTI F, FERRARA A, SALINI S. A decade of research in statistics:a topic model approach[J]. Scientometrics, 2015, 103(2):413-433.
[78] 关鹏,王曰芬,傅柱.基于LDA的主题语义演化分析方法研究——以锂离子电池领域为例[J].数据分析与知识发现, 2019, 3(7):61-72.
[79] 曲佳彬,欧石燕.基于主题过滤与主题关联的学科主题演化分析[J].数据分析与知识发现, 2018, 2(1):64-75.
[80] 丁玉飞,王曰芬,刘卫江.基于主题模型的科技监测方法及应用研究[J].情报学报, 2015, 34(8):854-865.
[81] 范少萍,安新颖,单连慧,等.基于医学文献的主题演化类型与演化路径识别方法研究[J].情报理论与实践, 2019, 42(3):114-119.
[82] 颜端武,苏琼,张馨月.基于时序主题关联演化的科学领域前沿探测研究[J].情报理论与实践, 2019, 42(7):144-150.
[83] WANG X F, ZHANG S, LIU Y Q, et al. How pharmaceutical innovation evolves:the path from science to technological development to marketable drugs[J]. Technological forecasting and social change, 2021, 167(1):120698.
[84] WANG X F, ZHANG S, LIU Y Q. ITGInsight-discovering and visualizing research fronts in the scientific literature[J/OL]. Scientometrics, 2021.[2022-04-30].http://doi.org/10.1007/s11192-021-04190-9.作者贡献说明:梁爽:文献调研与资料收集,论文撰写;刘小平:确定选题,设计论文框架,写作指导与修订。
Outlines

/