Library and Information Service >
Study on Topic Extraction of Literatures Based on Weighted Semantic and Citation Relation
Received date: 2016-01-04
Revised date: 2016-04-20
Online published: 2016-05-05
[Purpose/significance] The traditional methods of topic extraction mainly extract the themes of literatures by keywords, abstracts and full texts, but their results are not comprehensive or have noises. The method which starts from the semantic of literature content and is combined with the citation content, can extract the themes of literatures more accurately. [Result/conclusion] This article proposes a literature topic extracting algorithm based on weighted semantic and citation relation for multi-documents. It builds the Labeled-LDA topic Model with the citation content and keywords of literatures, gets documents-topics probability distribution. Then it clusters documents through the K-means clustering method and extracts the topics of each type of documents. [Result/conclusion] In the experimental part, the test data are downloaded from the PubMed database. The result shows that the method can accurately extract the theme of literatures.
Key words: Labeled-LDA Model; citation content; topic extraction
Yang Chunyan , Pan Youneng , Zhao Li . Study on Topic Extraction of Literatures Based on Weighted Semantic and Citation Relation[J]. Library and Information Service, 2016 , 60(9) : 131 -138,146 . DOI: 10.13266/j.issn.0252-3116.2016.09.018
[1] LUHN H P.The automatic creation of literature abstract [J].IBM journal of research and development,1958,2(2):159-165.
[2] BAXENDALE P E. Machine-made index for technical literature-an experiment [J]. Journal of research and development,1958, 2(4):354-361.
[3] BOKAETF M H, SAMETI H, LIU Y. Unsupervised approach to extract summary keywords in meeting domain[C]//Proceedings of the 23rd European signal processing conference.Piscataway: IEEE Press, 2015:1406-1410.
[4] BARNES C I, COSTANTINI L, PERSCHKE S. Automatic indexing using the SLC-II system [J]. Information processing & management, 1978, 14(2): 107-119.
[5] AWAJAN A. Keyword extraction from Arabic documents using term equivalence classes[J]. Crains Chicago business, 2015, 14(2):1-18.
[6] El-BELTAGY S R, RAFEA A. KP-Miner: a keyphrase extraction system for English and Arabic documents [J]. Information systems, 2009,34(1): 132-144.
[7] CHEN Y H, LU J L, MENG F T. Finding keywords in blogs: efficient keyword extraction in blog mining via user behaviors[J]. Expert systems with applications an international journal, 2014, 41(2):663-670.
[8] BRANDOW R, MITZE K, RAU L F. Automatic condensation of electronic publications by sentence selection [J].Information processing & management,1995,31(5):675-685.
[9] 韩客松,王永成.中文全文标引的主题词标引和主题概念标引方法[J].情报学报,2001,20(2):212-216.
[10] 何新贵,彭甫阳.中文文本的关键词自动抽取和模糊分类[J].中文信息学报,1999,13(1):9-15.
[11] 陈翀,罗鹏程,汪十红. 利用引用信息的关键词提取[J].图书情报工作,2014,58(1):101-108.
[12] SUBRAMANIYASWAMY V, PANDIAN S. Effective tag recommendation system based on topic ontology using Wikipedia and WordNet [J]. International journal of intelligent systems,2012,27(12) : 1034-1048.
[13] 刘琼琼,左万利,王英.面向网页的主题概念挖掘[J].计算机科学,2015,42(5):62-66.
[14] 曾聪,张东站.基于同义词词林和《知网》的短语主题提取[J].厦门大学学报(自然科学版),2015,54(2):263-269.
[15] INOUE K, MCCRACKEN N. Automated keyword extraction of learning materials using semantic relations[EB/OL].[2015-10-08].https://www.ideals.illinois.edu/bitstream/handle/2142/15050/abstract.pdf?sequence=2.
[16] 裘江南,罗志成,叶鑫.语义相关度算法在主题抽取中的适用性研究[J].情报学报,2009(1):34-39.
[17] 陈叶旺,王华珍,李海波,等.基于百度百科与文本分类的网络文本语义主题抽取方法[J].小型微型计算机系统,2012,33(12):2605-2610.
[18] REN F. An unsupervised cascade learning scheme for 'cluster-theme keywords' structure extraction from scientific papers [J]. Journal of information science, 2013, 40(2):167-170.
[19] XIE F, WU X, HU X. Keyphrase extraction based on semantic relatedness[C]//Proceedings of the 9th IEEE international conference on cognitive informatics. Piscataway: IEEE Press,2010: 308-312.
[20] GONENC E, ILYAS C. Using lexical chains for keyword extraction [J]. Information processing and management, 2007, 43(6): 1705-1714.
[21] 张云涛,龚玲,王永成.基于综合方法的文本主题句的自动抽取[J].上海交通大学学报,2006,40(5):771-774.
[22] ZHANG W, FENG W, WANG J. Integrating semantic relatedness and words' intrinsic features for keyword extraction[C]//Proceedings of the twenty-third international joint conference on artificial intelligence. Menlo Park:AAAI Press, 2013: 2225-2231.
[23] 方俊,郭雷,王晓东.基于语义的关键词提取算法[J].计算机科学,2008,35(6):148-151.
[24] DEERWESTER S C, DUMAIS S T, LANDAUER T K, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[25] HOFMANN T. Probabilistic latent semantic indexing[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York:ACM Press,1999:50-57.
[26] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation [J]. The journal of machine learning research, 2003, 3(3):993-1022.
[27] JINHEE P, JAEKWANG K,JEE-HYONG L. Keyword extraction for blogs based on content richness[J]. Journal of information science, 2014, 40(1): 38-49.
[28] GUO X, XIANG Y, CHEN Q, et al. LDA-based online topic detection using tensor factorization[J]. Journal of information science, 2013, 39(4): 459-468.
[29] 李湘东,张娇,袁满,基于LDA 模型的科技期刊主题演化研究[J].情报杂志,2014,33(7):115-121.
[30] RAMAGE D, HALL D, NALLAPATI R, et al. Labeled LDA: a supervised topic model for credit attribution inmulti-labeled corpora[C]//Proceedings of the 2009 conference on empirical methods in natural language processing.Stroudsburg:Association for Computational Linguistics,2009:248-256.
[31] 赵蓉英,曾宪琴,陈必坤.全文本引文分析-引文分析的新发展[J].图书情报工作,2014,58(9):129-134.
[32] SALTON G,BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information processing & management,1988, 24(5):513-523.
[33] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.Berkeley: University of California Press,1967: 281-297.
/
〈 |
|
〉 |