RESEARCH PAPERS

Research on Domain Topic Representation and Evolution Process Based on Improved BERTopic Model

  • Liu Ying ,
  • Yu Chunmei ,
  • Li Xiaochen ,
  • Li Ye ,
  • Zhao Mingyu
Expand
  • 1 School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun 130117;
    2 Key Laboratory of Financial Technology in Jinlin Province, Changchun 130117;
    3 Jilin Provincial Research Center for Business Big Data, Changchun 130117;
    4 Institute of Big Data and Cross Science, Changchun 130117

Received date: 2024-03-12

  Revised date: 2024-06-15

  Online published: 2025-02-11

Supported by

This work is supported by the Jilin Province Science and Technology Department project titled “Research on Green Supply Chain Finance Data Certification and Risk Identification Algorithms under the ‘Dual Carbon’Goal” (Grant No. YDZJ202301ZYTS482) and the National Social Science Fund of China project titled “Research on Supply Chain Finance Risk Assessment Method Based on Deep Integration of Multi-source Data” (Grant No. 20BTJ062).

Abstract

[Purpose/Significance] Aiming at research frontiers and predicting future research topics can help researchers discover new hotspots of academic growth and optimize the allocation of government resources. [Method/Process] This paper used a deep learning pre-training model to enhance the semantic representation of text, and proposed an improved BERTopic model (SBERT-UMAP-HDBSCAN-TopMine), in order to enrich and improve the topic representation and topic evolution methods. Firstly, the SBERT model was used for sentence embedding to make up for the non-smooth anisotropy of sentence vectors. Dimensionality reduction was performed using UMAP, followed by topic clustering utilizing the HDBSCAN algorithm. To address potential blind spots in semantic expression during topic identification, TopMine was applied to extract topic phrases from clustered topics for topic representation. Secondly, the Word Mover’s Distance (WMD) was employed to calculate the similarity between different topics in adjacent time periods, revealing the emergence, occurrence, and dynamic relationship of topics across different time periods. Finally, dynamic evolution analysis of supply chain finance hotspots was conducted. [Result/Conclusion] This paper takes the field of supply chain finance as an example for empirical analysis, dividing the research direction into four categories: supply chain finance risk assessment, supply chain finance financing models, modern financial technology empowerment, and sustainable supply chain finance. The model helps to improve the interpretability and recognizability of hot topics in the field of supply chain finance. Through dynamic evolution analysis, it is found that sustainable supply chain finance is a key area of current and future research.

Cite this article

Liu Ying , Yu Chunmei , Li Xiaochen , Li Ye , Zhao Mingyu . Research on Domain Topic Representation and Evolution Process Based on Improved BERTopic Model[J]. Library and Information Service, 2025 , 69(3) : 78 -89 . DOI: 10.13266/j.issn.0252-3116.2025.03.007

References

[1] 马费成, 张帅. 我国图书情报领域新兴交叉学科发展探析[J]. 中国图书馆学报, 2023, 49(2): 4-14. (MA F C, ZHANG S. The development of emerging interdisciplines in library and information science in China[J]. Journal of library science in China, 2023, 49(2): 4-14.)
[2] 张新猛, 刘江鹏, 范亚茹, 等. 产业链视角下专利新兴技术主题识别[J]. 情报杂志, 2023, 42(8): 96-101, 55. (ZHANG X M, LIU J P, FAN Y R, et al. Identification emerging technology topics of patent from the perspective of industry chain[J]. Journal of intelligence, 2023, 42(8): 96-101, 55.)
[3] 曹树金, 曹茹烨. 基于研究主题和引文分析的信息资源管理学科发展探究[J]. 信息资源管理学报, 2023, 13(2): 12-29. (CAO S J, CAO R Y. Research on the development of the firstlevel discipline of information resource management based on research theme and citation analysis[J]. Journal of information resources management, 2023, 13(2): 12-29.)
[4] 张金柱, 于文倩. 基于短语表示学习的主题识别及其表征词抽取方法研究[J]. 数据分析与知识发现, 2021, 5(2): 50-60. (ZHANG J Z, YU W Q. Topic recognition and key-phrase extraction with phrase representation learning[J]. Data analysis and knowledge discovery, 2021, 5(2): 50-60.)
[5] HUANG L, CHEN X, NI X, et al. Tracking the dynamics of coword networks for emerging topic identification[J]. Technological forecasting and social change, 2021, 170: 120944.
[6] 陈琼, 朱庆华, 闵华, 等. 基于领域主题的学科交叉特征识别方法研究——以医学信息学为例[J]. 现代情报, 2022, 42(4): 11-24. (CHEN Q, ZHU Q H, MIN H, et al. Research on method of recognizing interdisciplinary features based on domain topics: taking medical informatics for example[J]. Journal of modern information, 2022, 42(4): 11-24.)
[7] 王曰芬, 刘佳宁, 王柳虹, 等. 高质量发展背景下科技新闻主题识别及其热点演化分析[J]. 情报理论与实践, 2023, 46(5): 107-116. (WANG Y F, LIU J N, WANG L H, et al. Topic identification and hotspot evolution analysis of sci-tech news in the context of high-quality development[J]. Information studies: theory & application, 2023, 46(5): 107-116.)
[8] 张霁阳, 张鹏, 兰月新, 等. 基于动态主题聚类的网络舆情反转识别模型构建与实证研究[J]. 情报理论与实践, 2023, 46(10): 174-181, 129. (ZHANG J Y, ZHANG P, LAN Y X, et al. Construction and empirical study of online public opinion inversion identification model based on dynamic topic clustering[J]. Information studies: theory & application, 2023, 46(10): 174-181, 129.)
[9] CHEN H, WANG X, PAN S, et al. Identify topic relations in scientific literature using topic modeling[J]. IEEE transactions on engineering management, 2019, 68(5): 1232-1244.
[10] 解学梅, 于平. 女性创业者性别刻板印象研究热点分析与演化路径: 基于知识图谱的研究[J]. 管理评论, 2022, 34(10): 108-121. (XIE X M, YU P. Hot spot and evolution path of female entrepreneurs’ gender stereotypes: research based on mapping knowledge domain[J]. Management review, 2022, 34(10): 108-121.)
[11] 张汝昊, 袁军鹏. 融合引用语义和语境特征的作者引文耦合分 析法 [J]. 情报 学报, 2022, 41(8): 796-811. (ZHANG R H, YUAN J P. Semantic-and contextual-based author bibliographic coupling analysis[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(8): 796-811.)
[12] 张久珍, 崔汭. 基于引文内容分析法的刘国钧《近代图书馆之性质及功用》影响研究[J]. 图书情报工作, 2022, 66(20): 93-100. (ZHANG J Z, CUI R. Influences of Liu Kwoh-chuin’s Libraries’ Properties and Functions in Modern Time based on citation content analysis[J]. Library and information service, 2022, 66(20): 93-100.)
[13] 高楠, 高嘉骐, 陈洪璞. 新兴技术识别与演化路径分析方法研究 ——以集成电路领域为例[J]. 情报科学, 2023, 41(3): 127-135, 172. (GAO N, GAO J Q, CHEN H P. Emerging technology identification method and evolution path: take the field of integrated circuits as an example[J]. Information science, 2023, 41(3): 127-135, 172.)
[14] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine learning research, 2003, 3(3): 993-1022.
[15] 邱均平, 胡博, 徐中阳, 等. 基于DTM模型的国内外话语权研究主题挖掘及比较分析[J]. 情报理论与实践, 2023, 46(2): 24-34. (QIU J P, HU B, XU Z Y, et al. Topic mining andcomparative analysis of discourse power research in China and overseas based on DTM model[J]. Information studies: theory & application, 2023, 46(2): 24-34.)
[16] 韩亚楠, 刘建伟, 罗雄麟. 概率主题模型综述[J]. 计算机学报, 2021, 44(6): 1095-1139. (HAN Y N, LIU J W, LUO X L. A survey on probabilistic topic model[J]. Chinese journal ofcomputers, 2021, 44(6): 1095-1139.)
[17] GROOTENDORST M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[J]. arXiv preprint arXiv:2203.05794, 2022.
[18] 姚茹, 乌吉斯古楞, 张学福. 基于知识流的研究前沿主题演化分析方法研究——以“ 基因组编辑技术及其在农作物中的应用” 研究前沿为例[J]. 情报理论与实践, 2023, 46(8): 165-174. (YAO R, GULENG W J S, ZHANG X F. Research on the evolution analysis method of research frontier topics based on knowledge flow: a case study of “genome editing technology and its application in crops”[J]. Information studies: theory & application, 2023, 46(8): 165-174.)
[19] 吕璐成, 周健, 王学昭, 等. 基于双层主题模型的技术演化分析框架及其应用[J]. 数据分析与知识发现, 2022, 6(Z1): 18-32. (LV L C, ZHOU J, WANG X Z, et al. Technology evolution analysis framework based on two-layer topic model and application[J]. Data analysis and knowledge discovery, 2022, 6(Z1): 18-32.)
[20] 张柳, 王慧, 相甍甍. 基于LDA的突发事件应急管理主题热度与演化分析[J]. 情报科学, 2023, 41(6): 182-191. (ZHANG L, WANG H, XIANG M M. Analysis on the topic popularity and evolution of emergency management based on LDA[J]. Information science, 2023, 41(6): 182-191.)
[21] 马建红, 王晨曦, 闫林, 等. 基于产品生命周期的专利技术主题演化分析[J]. 情报学报, 2022, 41(7): 684-691. (MA J H, WANG C X, YAN L, et al. Analysis of patent technology topic evolution based on product life cycle[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(7): 684-691.)
[22] 刘春江, 刘自强, 方曙. 基于SAO的技术主题创新演化路径识别及其可视化研究[J]. 情报学报, 2023, 42(2): 164-175. (LIU C J, LIU Z Q, FANG S. Evolution path identification and visualization of technological innovation based on SAO[J]. Journal of the China Society for Scientific and Technical Information, 2023, 42(2): 164-175.)
[23] 马晓悦, 孙铭菲. 融合热点事件主题演化的民族文化扩散研究 [J]. 图书情报工作, 2022, 66(3): 106-117. (MA X Y, SUN M F. Research on the diffusion of national culture integrating the theme evolution of hot topics[J]. Library and information service, 2022, 66(3): 106-117.)
[24] 郭宇, 张传洋, 张海涛, 等. 危机管理视角下突发事件舆情主题演化与治理分析[J]. 图书情报工作, 2022, 66(8): 113-121. (GUO Y, ZHANG C Y, ZHANG H T, et al. Analysis on the topic evolution and governance of public opinion in emergencies from the perspective of crisis management[J]. Library and information service, 2022, 66(8): 113-121.)
[25] 王正成, 袁竹星. 面向主题的微博意见领袖挖掘研究[J]. 情报科学, 2018, 36(3): 112-116. (WANG Z C, YUAN Z X. Research on the theme-oriented mining of microblog opinion leaders[J]. Information science, 2018, 36(3): 112-116.)
[26] 邢晓昭, 任亮, 雷孝平, 等. 基于专利主题演化的颠覆性技术识别研究——以类脑智能领域为例[J]. 情报科学, 2023, 41(3): 81-88. (XING X Z, REN L, LEI X P, et al. The identification of disruptive technology based on patent theme evolution: taking the field of brain-inspired intelligence as an example[J]. Information science, 2023, 41(3): 81-88.)
[27] 崔旭, 杨煜, 李姗姗. 基于LDA模型的我国档案馆非物质文化遗产保护主题挖掘与演化分析——与非遗保护中心对比视角[J]. 图书情报工作, 2022, 66(23): 82-92. (CUI X, YANG Y, LI S S. Topic mining and evolution analysis of intangible cultural heritage protection in chinese archives based on LDA modelcomparison with intangible cultural heritage protection center[J]. Library and information service, 2022, 66(23): 82-92.)
[28] REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[J]. arXiv preprint arXiv:1908.10084, 2019.
[29] 易灿灿, 庹帅, 涂闪, 等. 基于UMAP辅助的模糊C聚类方法进行太赫兹光谱识别[J]. 光谱学与光谱分析, 2022, 42(9): 2694-2701. (YI C C, TUO S, TU S, ZHANG W T. UMAPassisted fuzzy c-clustering method for recognition of terahertz spectrum[J]. Spectroscopy and spectral analysis, 2022, 42(9): 2694-2701.)
[30] WANG D, HUANG Y, CAI Z. A two-phase clustering approach for traffic accident black spots identification: integrated GISbased processing and HDBSCAN model[J]. International journal of injury control and safety promotion, 2023, 30(2): 270-281.
[31] EL-KISHKY A, SONG Y, WANG C, et al. Sc al abl e topical phrase mining from text corpora[J]. arXiv preprint arXiv:1406.6312, 2014.
[32] 赵琪. 基于深度学习的文本语义相似度研究[D]. 北京: 中国人民公安大学, 2021. (ZHAO Q. Research on text semantic similarity based on deep learning[D]. Beijing: People’s Public Security University of China, 2021.)
[33] PALLA G, BARABÁSI A L, VICSEK T. Quantifying social group evolution[J]. Nature, 2007, 446(7136): 664-667.
[34] 李慧, 胡吉霞, 佟志颖. 面向多源数据的学科主题挖掘与演化分析[J]. 数据分析与知识发现, 2022, 6(7): 44-55. (LI H, HU J X, TONG Z Y. Subject topic mining and evolution analysis with multi-source data[J]. Data analysis and knowledge discovery, 2022, 6(7): 44-55.)
[35] 李娟, 崔冉, 王伟. 第三方物流企业主导供应链金融价值创造机制探析[J]. 财会月刊, 2023, 44(21): 117-123. (LI J, CUI R, WANG W. Analysis on the value creation mechanism of third party logistics enterprise leading supply chain finance[J]. Finance and accounting monthly, 2023, 44(21): 117-123.)
[36] 祝由, 贾冉, 王纲金, 等. 供应链金融风险评估研究综述——基于知识图谱技术[J]. 系统工程理论与实践, 2023, 43(3): 795-812. (ZHU Y, JIA R, WANG G J, et al. A review of supply chain finance risk assessment research: based on knowledge graph technology[J]. Systems engineering-theory & practice, 2023, 43(3): 795-812.)
[37] XU X H, CHEN X F, JIA F, et al. Supply chain finance a systematic literature reviiew and bibliometric analysis[J]. International journal of production economics, 2018, 204: 160-173.
[38] CHAKUU S, MASI D, GODSELL J. Exploring the relationship between mechanisms, actors and instruments in supply chain finance: a systematic literature review[J]. International journal of production economics, 2019, 216: 35-53.
[39] 关鹏, 王曰芬, 傅柱. 不同语料下基于LDA主题模型的科学文献主题抽取效果分析[J]. 图书情报工作, 2016, 60(2): 112-121. (GUAN P, WANG Y F, FU Z. Effect analysis of scientific literature topic extraction based on LDA topic model with different corpus[J]. Library and information service, 2016, 60(2): 112-121.)
[40] 柴正猛, 黄轩. 供应链金融风险管理研究综述[J]. 管理现代化, 2020, 40(2): 109-115. (CHAI Z M, HUANG X. Review of supply chain financial risk management[J]. Modernization of management, 2020, 40(2): 109-115.)
[41] 徐杨杨, 雷全胜. 供应链金融综述[J]. 广西科学, 2021, 28(6): 547-556. (XU Y Y, LEI Q S. Review on supply chain finance[J]. Guangxi sciences, 2021, 28(6): 547-556.)
[42] CHOD J, TRICHAKIS N, TSOUKALAS G, et al. On the financing benefits of supply chain transparency and blockchain adoption[J]. Management science, 2020, 66(10): 4378-4396.
[43] CHEN J, CAI T, HE W, et al. A blockchain-driven supply chain finance application for auto retail industry[J]. Entropy, 2020, 22(1): 95-111.
[44] 陆岷峰, 徐阳洋. “ 双碳” 目标背景下供应链经济的新特点、 新挑战与新对策[J]. 新疆社会科学, 2022(1): 38-46, 146. (LU M F, XU Y Y. New features, challenges and countermeasures of supply chain economy in the context of “double carbon” target[J]. Social sciences in Xinjiang, 2022(1): 38-46, 146.)
[45] 滕婕, 刘莉, 李硕, 等. 动态语义网的高价值热点主题识别与演化路径分析[J]. 图书情报工作, 2023, 67(7): 92-106. (TENG J, LIU L, LI S, et al. High-value hot topic identification and evolutionary path analysis based on dynamic semantic network[J]. Library and information service, 2023, 67(7): 92-106.)
[46] 李晓青, 郑小妮, 刘金豪. 可持续供应链金融如何影响中小企业融资绩效——基于环境规制视角[J]. 金融监管研究, 2020(3): 70-84. (LI X Q, ZHENG X N, LIU J H. How does sustainable supply chain finance affect the financing performance of SEMs: from the perspective of environmental regulation[J]. Financial regulation research, 2020(3): 70-84.)
[47] 蒋伯亨, 温涛. 农业供应链金融(ASCF)研究进展[J]. 农业经 济问 题, 2021(2): 84-97. (JIANG B H, WEN T. Research progress on agricultural supply chain finance(ASCF)[J]. Issues in agricultural economy, 2021(2): 84-97.)
[48] 薛小飞. 商业银行绿色供应链金融的实践思考: 模式、问题及对策[J]. 新金融, 2022, 398(3): 41-47. (XUE X F. Thinking on the practice of green supply chain finance ofcommercial banks: modes, problems and countermeasures[J]. New finance, 2022, 398(3): 41-47.)
Outlines

/