知识组织

基于共词和Word2Vec加权向量的文献-主题语义匹配分析方法

  • 丁敬达 ,
  • 陈一帆 ,
  • 刘超 ,
  • 蔡微
展开
  • 上海大学文化遗产与信息管理学院 上海 200444
丁敬达,教授,博士,博士生导师,E-mail:djdhyn@126.com;陈一帆,硕士研究生;刘超,博士研究生;蔡微,硕士研究生。

收稿日期: 2021-11-10

  修回日期: 2022-03-26

  网络出版日期: 2022-06-25

基金资助

本文系国家社会科学基金项目"基于多元数据融合的社科领域新兴主题探测方法及实证研究"(项目编号:21BTQ010)研究成果之一。

An Article-Topic Semantic Matching Analysis Method Based on Co-Word and Weighted Word2Vec

  • Ding Jingda ,
  • Chen Yifan ,
  • Liu Chao ,
  • Cai Wei
Expand
  • School of Cultural Heritage and Information Management, Shanghai University, Shanghai 200444

Received date: 2021-11-10

  Revised date: 2022-03-26

  Online published: 2022-06-25

摘要

[目的/意义]共词分析作为主题识别的重要方法,存在一定的局限和不足,将Word2Vec加权向量与共词分析相结合,有利于明确具体文献的主题归属,更好地对主题的发展演化进行分析。[方法/过程]在运用共词分析进行主题聚类的基础上,通过Word2Vec加权向量分别计算文献向量与聚类主题向量,并基于余弦相似度进行文献与主题的语义匹配。[结果/结论]国内外知识共享领域的实证分析表明,该方法能较好地将相关文献匹配至对应主题,并能从文献层面对主题特征及发展演化进行动态分析。

本文引用格式

丁敬达 , 陈一帆 , 刘超 , 蔡微 . 基于共词和Word2Vec加权向量的文献-主题语义匹配分析方法[J]. 图书情报工作, 2022 , 66(12) : 108 -116 . DOI: 10.13266/j.issn.0252-3116.2022.12.010

Abstract

[Purpose/Significance] As an important method for topic identification, co-word analysis has some limitations and deficiencies. The combination of weighted Word2Vec and co-word analysis is helpful to clarify the topic attribution of specific articles, and to better analyze the evolution of topics. [Method/Process] On the basis of topic clustering by co-word analysis, the article vectors and the clustering topic vectors were calculated by weighted Word2Vec, and the semantic matching between articles and topics was carried out based on cosine similarity. [Result/Conclusion] The empirical analysis in the field of knowledge sharing at home and abroad shows that this method can better match the relevant articles to the corresponding topics, and a dynamic analysis of the topic characteristic and evolution can be carried out from the article level.

参考文献

[1] 巴志超,李纲,朱世伟.共现分析中的关键词选择与语义度量方法研究[J].情报学报, 2016, 35(2):197-207.
[2] 周利琴,徐健,巴志超,等.基于SNA和DMR方法的高血压主题探测与演化趋势比较研究[J].图书情报工作, 2018, 62(13):82-91.
[3] CALLON M, COURTIAL J P, TURNER W A, et al. From translations to problematic networks:an introduction to co-word analysis[J]. Social science information, 1983, 22(2):191-235.
[4] 钟伟金,李佳,杨兴菊.共词分析法研究(三)——共词聚类分析法的原理与特点[J].情报杂志, 2008(7):118-120.
[5] 李纲,巴志超.共词分析过程中的若干问题研究[J].中国图书馆学报, 2017, 43(4):93-113.
[6] 李锋.基于核心关键词的聚类分析——兼论共词聚类分析的不足[J].情报科学, 2017, 35(8):68-71,78.
[7] 孙海生.连边社团检测算法对共词分析聚类结果的改进研究[J].图书情报工作, 2016, 60(10):123-129.
[8] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[EB/OL].[2022-03-22]. https://arxiv.org/abs/1310.4546v1.
[9] 裘惠麟,邵波.多源数据环境下科研热点识别方法研究[J].图书情报工作, 2020, 64(5):78-88.
[10] 颜端武,梅喜瑞,杨雄飞,等.基于主题模型和词向量融合的微博文本主题聚类研究[J].现代情报, 2021, 41(10):67-74.
[11] 王英泽,化柏林.欧美国家颠覆性技术政策文本数据的主题建模分析研究[J/OL].情报理论与实践, 2022:1-14[2022-03-22]. http://kns.cnki.net/kcms/detail/11.1762.g3.20220225.1702.002.html.
[12] MOODY C E. Mixing dirichlet topic models and word embeddings to make lda2vec[J/OL].[2022-03-23]. https://arxiv.org/abs/1605.02019.
[13] 王卫军,姚畅,乔子越,等.基于词嵌入的国家自然科学基金学科交叉知识发现方法——以"人工智能"与"信息管理"为例[J].情报学报, 2021, 40(8):831-845.
[14] 闫盛枫.融合词向量语义增强和DTM模型的公共政策文本时序建模与演化分析——以"大数据领域"为例[J].情报科学, 2021, 39(9):146-154.
[15] 周云泽,闵超.基于LDA模型与共享语义空间的新兴技术识别——以自动驾驶汽车为例[J/OL].数据分析与知识发现, 2021:1-16[2022-03-25]. http://kns.cnki.net/kcms/detail/10.1478.g2.20211206.1917.007.html.
[16] LI C, GUO J, LU Y, et al. LDA meets Word2Vec:a novel model for academic abstract clustering[C]//Companion of the Web Conference 2018. Republic and Canton of Geneva:CHE, 2018:1699-1706.
[17] 蒋甜,刘小平,刘会洲.基于关键词关联度指标(KRI)进行LDA噪声主题过滤的方法研究[J].图书情报工作, 2020, 64(3):92-99.
[18] 王婷婷,韩满,王宇. LDA模型的优化及其主题数量选择研究——以科技文献为例[J].数据分析与知识发现, 2018, 2(1):29-40.
[19] HUANG L, CHEN X, ZHANG Y, et al. Identification of topic evolution:network analytics with piecewise linear representation and word embedding[J]. Scientometrics, 2022, 127(2):1-31.
[20] 虞秋雨,徐跃权.共词分析中高频词阈值确定方法的实证研究——以新冠肺炎文献高频词选取为例[J].情报科学, 2020, 38(9):90-95.
[21] 白如江,刘博文,冷伏海.基于多维指标的未来新兴科学研究前沿识别研究[J].情报学报, 2020, 39(7):747-760.
[22] TANG X, WAN Y, LIU Y, et al. Chinese spam classification based on weighted distributed characteristic[C]//Proceedings of the 2017 Chinese Automation Congress. Jinan:2017:6618-6622.
[23] 白敬毅,颜端武,陈琼.基于主题模型和曲线拟合的新兴主题趋势预测研究[J].情报理论与实践, 2020, 43(7):130-136,193.
[24] 吴一平,于纯良,曲佳彬,等.文本主题视域下的高校论文研究前沿领域及演化发展趋势研究[J].情报科学, 2021, 39(5):156-162,183.
[25] 黄璐,朱一鹤,张嶷.基于加权网络链路预测的新兴技术主题识别研究[J].情报学报, 2019, 38(4):335-341.
[26] 范少萍,安新颖,晏归来,等.医学领域前沿主题识别方法研究[J].情报学报, 2018, 37(7):686-694.
[27] 刘自强,许海云,岳丽欣,等.面向研究前沿预测的主题扩散演化滞后效应研究[J].情报学报, 2018, 37(10):979-988.
[28] 熊回香,李跃艳.基于Word2vec的科研人员推荐与跨语言论文推荐模型研究[J].情报科学, 2019, 37(12):19-26.
[29] CASTANEDA D I, CUELLAR S. Knowledge sharing and innovation:a systematic review[J]. Knowledge and process management, 2020, 27(3):159-173.
[30] 张春阳,梁启华.基于Web of Science知识共享科学研究现状及发展态势分析[J].图书馆学研究, 2016(18):20-29.
[31] KOCK N, DAVISON R. Can lean media support knowledge sharing?investigating a hidden advantage of process improvement[J]. IEEE transactions on engineering management, 2003, 50(2):151-163.
[32] LOOI C K, CHEN W. Community-based individual knowledge construction in the classroom:a process-oriented account[J]. Journal of computer assisted learning, 2010, 26(3):202-213.
[33] SHEN L Y, OCHOA J J, SHAH M N, et al. The application of urban sustainability indicators-a comparison between various practices[J]. Habitat international, 2011, 35(1):17-29.
[34] JOHANNA M, NATASHA K, ARNOLDO M K, et al. Climate adaptation research for the next generation[EB/OL]. Climate and development, 2013:189-193[2022-03-25]. https://www.tandfonline.com/doi/full/10.1080/17565529.2013.812953.
[35] GEORGIA T BN, TRACEY B, ANDREA M, et al. Patients'perceptions of participation in nursing care on medical wards[J]. Scandinavian journal of caring science, 2016, 30(2):260-270.
[36] EDGHIEM F, ABUALQUMBOZ M, MOUZUGHI Y. Covid-19 transition, could Twitter support UK Universities?[J/OL]. Knowledge management research&practice, 2020:1-6[2022-03-25]. https://www.tandfonline.com/doi/full/10.1080/14778238.2020.1848364.
[37] SAKUSIC A, MARKOTIC D, DONG Y, et al. Rapid, multimodal, critical care knowledge-sharing platform for COVID-19 pandemics[J]. Bosnian journal of basic medical sciences, 2020, 21(1):93-97.
文章导航

/