Library and Information Service >
Effect Analysis of Scientific Literature Topic Extraction Based on LDA Topic Model with Different Corpus
Received date: 2015-12-13
Revised date: 2016-01-04
Online published: 2016-01-20
[Purpose/significance] Latent Dirichlet Allocation(LDA) is used to discover subject topic, hot topic and development trend in scientific and technical intelligence analysis. The paper evaluates the effect of LDA topic extraction with three common scientific literature corpuses, which are structured by keywords, abstracts or mixture of keywords and abstracts. The purpose of this thesis is to promote the effect of using LDA in science and technology intelligence analysis.[Method/process] We analyze effect of topic extraction by LDA under three above-mentioned corpuses and evaluate the results by two patterns. One is quantitative analysis by using quantitative indexes, including precision rate, recall rate, F-score and information entropy;the other one is qualitative analysis, including two dimensionalities:extent of topic extraction and granularity of topic.[Result/conclusion] Experiments on scientific and technical literatures of domestic wind energy field show that the effect of topic extraction by LDA with abstracts or mixture of keywords and abstracts is better than LDA with keywords, whether from quantitative analysis or qualitative analysis. LDA with abstracts and mixture of keywords and abstracts has different application scenarios. The former has larger extent of topic extraction and the latter has smaller granularity of topic.
Key words: topic model; LDA; topic extraction; effect analysis; scientific literature
Guan Peng , Wang Yuefen , Fu Zhu . Effect Analysis of Scientific Literature Topic Extraction Based on LDA Topic Model with Different Corpus[J]. Library and Information Service, 2016 , 60(2) : 112 -121 . DOI: 10.13266/j.issn.0252-3116.2016.02.018
[1] BLEO D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J].Journal of machine learning research, 2003,3:993-1022.
[2] SCOTT J. Social network analysis[M]. London:Sage, 2012.
[3] BLEI D M, LAFFERTY J D. A correlated topic model of science[J]. The annals of applied statistics, 2007,1(1):17-35.
[4] GRIFFITHS T L,STEYVERS M. Finding scientific topics[J].Proceedings of the National Academy of Sciences of the United States of America, 2004,101(1):5228-5235.
[5] 杨星, 李保利, 金明举. 基于LDA模型的研究领域热点及趋势分析[J]. 计算机技术与发展, 2012(10):66-69.
[6] HE Q, CHEN B, PEI J, et al. Detecting topic evolution in scientific literature:how can citations help?[C]//Proceedings of the 18th ACM conference on information and knowledge management. New York:ACM, 2009:957-966.
[7] ALSUMAIT L, BARBARÀ D, DOMENICONI C. On-line LDA:adaptive topic models for mining text streams with applications to topic detection and tracking[C]//Eighth IEEE international conference on data mining. Piscataway:IEEE, 2008:3-12.
[8] 李湘东, 张娇, 袁满. 基于LDA模型的科技期刊主题演化研究[J]. 情报杂志, 2014(7):115-121.
[9] 曾利, 李自力, 谭跃进. 基于动态LDA的科研文献主题演化分析[J]. 软件, 2014(5):102-107.
[10] HASSAN S U, HADDAWY P. Analyzing knowledge flows of scientific literature through semantic links:a case study in the field of energy[J]. Scientometrics, 2015, 103(1):33-46.
[11] DIETZ L, BICKEL S, SCHEFFER T. Unsupervised prediction of citation influences[C]//Proceedings of the 24th international conference on machine learning.New York:ACM, 2007:233-240.
[12] 范云满,马建霞.基于LDA与新兴主题特征分析的新兴主题探测研究[J].情报学报,2014,33(7):698-711.
[13] 规范关键词的选择[J]. 山西大学学报(自然科学版),2014(4):578-579.
[14] 顾泉佩. 学术论文中关键词的正确写作[J]. 福州大学学报:自然科学版, 2003(3):283.
[15] 刘雅琴, 蒋菡, 苏亚志. 科技论文摘要写作中的一些问题及辨析[J]. 现代情报, 2004(1):178-181.
[16] 李纲, 王忠义. 基于语义的共词分析方法研究[J]. 情报杂志, 2011(12):145-149.
[17] STEYVERS M, SMYTH P, ROSEN-ZVI M, et al. Probabilistic author-topic models for information discovery[C]//Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining.New York:ACM,2004:306-315.
[18] BLEI D M,MCAULIFFE J D. Supervised topic models[EB/OL].[2015-10-08].http://papers.nips.cc/paper/3328-supervised-topic-models.pdf.
[19] WANG X, McCALLUM A. Topics over time:a non-Markov continuous-time model of topical trends[C]//Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining.New York:ACM, 2006:424-433.
[20] MILLAR J R, PETERSON G L, MENDENHALL M J. Document clustering and visualization with latent dirichlet allocation and self-organizing maps[C]//Proceedings of the twenty-second international FLAIRS conference. Menlo Park:AAAI Press,2009:69-74.
[21] XIE P, XING E P. Integrating document clustering and topic modeling[EB/OL].[2015-10-08].http://arxiv.org/ftp/arxiv/papers/1309/1309.6874.pdf.
[22] WANG C L, ZHANG J X. Improved K-means algorithm based on latent Dirichlet allocation for text clustering[J]. Journal of computer applications,2014,34(1):249-254
[23] 王广新.基于微博的用户兴趣分析与个性化信息推荐[D].上海:上海交通大学,2013.
[24] KRESTEL R, FANKHAUSER P, NEJDL W. Latent Dirichlet allocation for tag recommendation[C]//Proceedings of the third ACM conference on recommender systems.New York:ACM, 2009:61-68.
[25] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011(30):583-590.
[26] 叶春蕾, 冷伏海. 基于引文——主题概率模型的科学文献主题识别方法研究[J]. 情报理论与实践, 2013(9):100-103.
[27] 王平. 基于层次概率主题模型的科学文献主题发现及演化[J]. 图书情报工作, 2014,58(22):70-77.
[28] 张晗, 徐硕, 乔晓东. 融合科学文献内外部特征的主题模型发展综述[J]. 情报学报, 2014(10):1108-1120.
[29] 莫维尔. Web信息架构[M].北京:电子工业出版社, 2013,1-15.
[30] 蒋宏春. 风力发电技术综述[J]. 机械设计与制造, 2010(9):250-251.
[31] SUN J Y. Jieba 0.37[EB/OL].[2015-10-08].https://pypi.python.org/pypi/jieba/.
[32] REHUREK R.Gensim 0.10.2[EB/OL].[2014-12-11]. https://pypi.python.org/pypi/gensim.
[33] 马妍春, 黄可心. 科技论文摘要、关键词及参考文献的规范化[J]. 情报科学, 2006, 17(6):625-627.
[34] 王丹丹. 科技论文关键词使用中存在的问题及解决方法[J]. 出版发行研究, 2013(4):102-104.
/
〈 |
|
〉 |