情报研究

基于关键词关联度指标(KRI)进行LDA噪声主题过滤的方法研究

  • 蒋甜 ,
  • 刘小平 ,
  • 刘会洲
展开
  • 中国科学院文献情报中心 北京 100190
蒋甜(ORCID:0000-0002-9065-1223),博士后,E-mail:jiangtian@mail.las.ac.cn;刘小平(ORCID:0000-0002-3342-8041),研究员,硕士生导师;刘会洲(ORCID:0000-0002-7808-8570),研究员,博士生导师。

收稿日期: 2019-04-16

  修回日期: 2019-07-23

  网络出版日期: 2020-02-05

基金资助

本文系中国科学院文献情报能力建设专项"科技领域战略情报研究与决策咨询体系建设"子课题"基础交叉前沿领域战略情报研究与决策咨询"(项目编号: Y8C0381005-01)研究成果之一。

Topic Filtering of LDA Model Recognition Results Based on the Keywords Relevance Index (KRI)

  • Jiang Tian ,
  • Liu Xiaoping ,
  • Liu Huizhou
Expand
  • National Science Library, Chinese Academy of Sciences, Beijing 100190

Received date: 2019-04-16

  Revised date: 2019-07-23

  Online published: 2020-02-05

摘要

[目的/意义] 针对LDA模型主题识别结果通常包含噪声主题的问题,建立科学有效的主题过滤方法,排除噪声主题,确保主题识别及后续演化分析的准确性。[方法/过程] 基于关键词之间的共现关系,构建关键词关联度指标(KRI),借助定量手段进行主题筛选和过滤。以单细胞研究领域为例,计算各主题-关键词分布的KRI值,与人工判读结果进行对比分析。[结果/结论] 实验结果表明,该方法能够有效排除LDA模型识别结果中的噪声主题,提高主题识别的准确性,也在一定程度上降低了主题识别过程对人工判读的依赖性。

本文引用格式

蒋甜 , 刘小平 , 刘会洲 . 基于关键词关联度指标(KRI)进行LDA噪声主题过滤的方法研究[J]. 图书情报工作, 2020 , 64(3) : 92 -99 . DOI: 10.13266/j.issn.0252-3116.2020.03.010

Abstract

[Purpose/significance] The identification results of the LDA model is sometimes unsatisfactory due to some meaningless topics mixed together. Therefore, it's quite necessary to establish an effective topic filtering method to eliminate these noise topics and to ensure the accuracy of subsequent evolution analysis.[Method/process] Based on the co-occurrence relationship between keywords, keywords relevance index (KRI) was constructed. Taking the field of single cell research as an example, KRI values of the distribution of theme-keywords were calculated and compared with the results of manual interpretation.[Result/conclusion] Experimental results show that this method can effectively eliminate meaningless noise topics in the LDA model recognition results, which can improve the accuracy of topic recognition and the subsequent topic evolution analysis. It also helps to reduce the dependence on manual interpretation in the process of topic identification through the topic model method.

参考文献

[1] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of machine learning research, 2003(3):993-1022.
[2] BLEI D M, LAFFERTY J D. Dynamic topic model[C]//Proceedings of the 23rd international conference on machine learning. New York:ACM, 2006:113-120.
[3] WANG X R, MCCALLUM A. Topic over time:A non-markov continuous-time model of topical trends[C]//Proceedings of the 12th ACM SIG KDD International conference on knowledge discovery and data mining. Philadelphia:ACM, 2006:424-433.
[4] YAN X H, GUO J F, LAN Y Y, et al. A biterm topic model for short texts[C]//Proceedings of the 22nd international conference on World Wide Web. New York:ACM. 2013:1445-1455.
[5] ZHAO F, ZHU Y J, JIN H, et al. A personalized hashtag recommendation approach using LDA-based topic model in microblog environment[J]. Future generation computer systems, 2016, 65:196-206.
[6] MAGNUSSON M, JONSSON L, VILLANI M. DOLDA:a regularized supervised topic model for high-dimensional multi-class regression[EB/OL].[2019-09-08]. https://doi.org/10.1007/s00180-019-00891-1.
[7] 解琰. 主题优化过滤方法与研究应用[D]. 大连:大连海事大学, 2015:26-27.
[8] 曲佳彬,欧石燕. 基于主题过滤与主题关联的学科主题演化分析[J]. 数据分析与知识发现,2018,2(1):64-75.
[9] MACKAY D J C. Information theory, inference, and learning algorithms[M]. Cambridge:Cambridge University Press, 2003.
[10] 李保利,杨星. 基于LDA模型和话题过滤的研究主题演化分析[J]. 小型微型计算机系统,2012,3(12):2738-2743.
[11] ISHWARAN H, RAO J S. Spike and slab gene selection for multigroup microarray data[J]. Journal of the American Statistical Association, 2005, 100(471):764-780.
[12] CHANG Y L, LEE K F, CHIEN J T. Bayesian feature selection for sparse topic model[C]//IEEE international workshop on machine learning for signal processing (MLSP). Santander:IEEE, 2011:1-6.
[13] PONWEISER M, GRUN B. Finding scientific topics revisited[C]//CARPITA M, BRENTARI E, QANNARI E M. Advances in latent variables. Berlin:Springer, 2014:93-100.
[14] 关鹏,王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9):42-50.
[15] GROSSMAN D A, Frieder O. Information retrieval:algorithms and heuristics[M]. Berlin:Springer, 2004.
[16] LEE L. On the Eectiveness of the skew divergence for statistical language analysis[C]//RICHARDSON T S, JAAKKOLA T S. Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics. Key West:Society for Artificial Intelligence and Statistics, 2001:65-72.
[17] CAO J, XIA T, LI J, et al. A density-based method for adaptive LDA model selection[J]. Neurocomputing, 2009, 72(7/9):1775-1781.
[18] CALLON M, COOUTIAL J P, LAVILLE F. Co-word analysis as a tool for describing the network of interactions between basic and technological research:the case of polymer chemistry[J]. Scientometrics, 1991, 22(1):155-205.
[19] WANG Z Y, LI G, LI C Y, et al. Research on the semantic-based co-word analysis[J]. Scientometrics, 2012, 90(3):855-875.
[20] TURNER K, LYNCH C, ROUSE H, et al. Direct single-cell analysis of human polar bodies and cleavage-stage embryos reveals no evidence of the telomere theory of reproductive ageing in relation to aneuploidy generation[J]. Cells, 2019,8(2):1-17.
[21] FLETCHER R B, DAS D, GADYE L, et al. Deconstructing olfactory stem cell trajectories at single-cell resolution[J]. Cell stem cell, 2017, 20(6):817-830.
[22] JACOBSEN S E W, NERLOV C. Haematopoiesis in the era of advanced single-cell technologies[J]. Natrue cell biology, 2019, 21(1):2-8.
[23] GERDES M J, GÖKMEN-POLAR Y, SUI Y, et al. Single cell heterogeneity in ductal carcinoma in situ of breast[J]. Modern pathology, 2018,31(3):406-417.
[24] DAVIS K M, ISBERG R R. Defining heterogeneity within bacterial populations via single cell approaches[J]. Bioessays, 2016, 38(8):782-790.
[25] KOSTOFF R N. Co-word analysis[C]//BOZEMAN B, MELKERS J. Evaluating R&D impacts:methods and practice. New York:Springer, 1993:63-78.
文章导航

/