[目的/意义] 为有效探测科技文献中潜在的研究热点,研究文献中关键词突发的特征条件,构建突发词识别模型对促进科研人员精确把握研究方向具有重要意义。[方法/过程] 获取各年度内关键词及词频,构建关键词-年度矩阵,将分析时间段划分为标准窗口、观察窗口和表现窗口,在观察窗口内利用多测度突发词探测模型识别具有突发特征的关键词;在表现窗口内利用LDA挖掘主题词汇作为热点词集合。设计突发词覆盖率指标,辅助滑动时间窗口法,计算不同时间窗口内突发词集合和热点词集合的覆盖率,验证模型识别准确性。[结果/结论] 3次滑动时间窗口,计算得到3次突发词覆盖率都在70%以上;与Citespace突发词的对照试验中,本模型3次覆盖率均大于前者,表明设计的突发词探测模型性能良好。
[Purpose/significance] In order to effectively detect potential research hotspots in scientific and technological literature, to study the characteristic conditions of keyword emergencies in the literature, and to construct a model of burst word recognition is of great significance to promote scientific researchers to accurately grasp the research direction. [Method/process] This paper got keywords and word frequency in each year, constructed keyword-year matrix, divided the analysis period into standard window, observation window and performance window, used multi-measure burst word detection model to identify keywords with burst characteristics in the observation window, and used LDA to mine topic words as hot words set in the performance window. The coverage index of burst words was designed, and the sliding time window method was used to calculate the coverage of burst words and hot words in different time windows to verify the accuracy of model recognition. [Result/conclusion] The three sliding time windows calculated that the coverage of the three sudden words is more than 70%. In the control test with Citespace, the coverage of the model three times is greater than the former, indicating that the designed burst word detection model performs well.
[1] 关鹏, 王曰芬.基于LDA主题模型和生命周期理论的科学文献主题挖掘[J].情报学报, 2015, 34(3):286-299.
[2] KLEINBERG J.Bursty and hierarchical structure in streams[J]. Data mining & knowledge discovery, 2003, 7(4):373-397.
[3] 郑乐丹.基于突发检测的我国数字图书馆研究前沿及其演进分析[J].图书馆论坛, 2013, 33(1):47-51.
[4] CHEN C M. CitespaceII:detecting and visualizing emerging trends and transient patterns in scientific literature[J]. Journal of the Association for Information Science & Technology, 2006, 57(3):359-377.
[5] 杨选辉,蔡志强.基于突变检测与共词分析的关联数据新兴趋势探测[J].情报科学, 2018, 36(11):164-168.
[6] 唐晓彬,周志敏,董莉.大数据背景下网络突发事件动态监测研究[J].统计研究, 2017, 34(2):46-56.
[7] 卓可秋,虞为,苏新宁.突发事件检测的MapReduce并行化实现[J].现代图书情报技术, 2015(2):46-54.
[8] 陈国兰.基于爆发词识别的微博突发事件监测方法研究[J].情报杂志, 2014, 33(9):123-128.
[9] 逯万辉,马建霞.基于CRFs的领域爆发词识别的研究与实现[J].情报科学, 2014, 32(1):89-93.
[10] 介飞,谢飞,李磊,等.社交网络中隐式事件突发性检测[J].自动化学报, 2018, 44(4):730-742.
[11] XIE W, ZHU F, JIANG J, et al. TopicSketch:real-time bursty topic detection from Twitter[J].IEEE transactions on knowledge and data engineering,2016,28(8):2216-2229.
[12] 王莉亚.基于关键词突变的主题突变研究[J].情报理论与实践, 2013, 36(11):45-48.
[13] 王征,易莉,赵磊.基于突发词检测的科研热点发掘服务模型研究[J].情报杂志, 2015, 34(12):176-180.
[14] 张金柱,吕品.基于主题关联度改进的主题演变和突变分析[J].情报理论与实践, 2018, 41(3):129-135.
[15] 姜鑫,王德庄,马海群.关键词词频变化视角下我国"科学数据"领域研究主题演化分析[J]. 现代情报, 2018, 38(1):141-146,161.
[16] SHI L,DU J P,LIANG M Y. Strm:a sparse rnn-topic model for discovering bursty topics in big data of social networks[J]. Journal of information science and engineering, 2019, 35(4):749-767.
[17] 傅柱,王曰芬.共词分析中术语收集阶段的若干问题研究[J].情报学报,2016,35(7):704-713.
[18] 刘敏娟,张学福,颜蕴.基于词频、词量、累积词频占比的共词分析词集范围选取方法研究[J].图书情报工作,2016,60(23):135-142.
[19] Wikipedia.Long tail[EB/OL].[2019-09-08]. https://en.wikipedia.org/wiki/Long_tail.
[20] 徐剑,黄秋月. "二八定律"在图书馆管理中的应用[J].中国图书馆学报, 2007(5):106-108.
[21] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J].Journal of machine learning research, 2003, 3(4/5):993-1022.
[22] 王建.基于多特征融合的微博突发事件检测方法研究[D]. 北京:北京信息科技大学, 2018.
[23] 马文建.基于突发词检测的中文专利预警系统[D]. 北京:北京工业大学, 2016.
[24] 安璐,杜廷尧,李纲,等.突发公共卫生事件利益相关者在社交媒体中的关注点及演化模式[J].情报学报, 2018, 37(4):394-405.