A Research on Internal Hierarchical Topic Organization Model of the Book Based on hLDA

  • Chen Jing ,
  • Xu Bo ,
  • Wang Tiantian ,
  • Lu Quan
Expand
  • 1. School of Information Management, Central China Normal University, Wuhan 430079;
    2. School of Information Management, Wuhan University, Wuhan 430072

Received date: 2016-07-29

  Revised date: 2016-09-06

  Online published: 2016-09-20

Abstract

[Purpose/significance] This paper analyzes and organizes hierarchical topic texts in multi-topic long documents which represented by books, and offers fine-granularity mining results for users to help them understand the topic of a book and quickly understand the structure and relationship of topics within the book. [Method/process] Firstly, hierarchical topic model (hLDA) and context information are applied to build hierarchical topic organization model within the book and its prototype system is implemented. Secondly, an experiment is designed to evaluate this model. [Result/conclusion] The experiment results prove that the internal hierarchical topic organization model of a book will promote the recall and the precision.

Cite this article

Chen Jing , Xu Bo , Wang Tiantian , Lu Quan . A Research on Internal Hierarchical Topic Organization Model of the Book Based on hLDA[J]. Library and Information Service, 2016 , 60(18) : 140 -148 . DOI: 10.13266/j.issn.0252-3116.2016.18.017

References

[1] 迟呈英, 麻志毅, 姚天顺. 文本理解与汉语文本结构分析[J]. 中文信息, 1997(1):9-11.
[2] BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process[J].Advances in neural information processing systems, 2004,57(2):18-22.
[3] POPE S B. PDF methods for turbulent reactive flows[J].Progress in energy and combustion science,1985, 11(2):119-192.
[4] 叶兰.电子图书新规范EPUB3.0及其应用[J].图书馆杂志,2012,31(8):53-59.
[5] 范炜. 达尔文信息类型架构DITA研究[J].情报杂志,2009,28(11):172-175.
[6] 林鸿飞, 战学刚, 姚天顺. 基于概念的文本结构分析方法[J].计算机研究与发展,2000,37(3):324-328.
[7] DU L, BUNTINE W, JIN H, et al. Sequential latent Dirichlet allocation[J]. Knowledge & information systems, 2012, 31(3):475-503.
[8] 林鸿飞, 战学刚, 姚天顺. 基于潜在语义索引的文本分析方法[J].模式识别与人工智能, 2000,13(1):47-51.
[9] 汤世平, 樊孝忠, 朱建勇. 基于潜在语义分析的文本连贯性分析[J].计算机应用与软件, 2008,25(2):95-96.
[10] 张芝妍. 中文图书目次自动解析研究——以农业图书为例[D].武汉:华中师范大学, 2014.
[11] 高良才, 汤帜, 林晓帆,等. 一种基于聚类技术的图书目录识别方法[J]. 北京大学学报(自然科学版), 2010, 46(4):531-538.
[12] 侯玉芳. 一种基于统计的文本逻辑段划分算法——Dotplotting算法的原理及其实现[J]. 现代图书情报技术, 2005, 21(10):32-34.
[13] 陈国光, 丁晓青. 一个基于规则的图书逻辑结构提取算法[J]. 计算机工程与应用, 2002, 38(19):53-57.
[14] 李效晋. 基于统计模型的文本分割方法及其改进[D]. 济南:山东大学, 2014.
[15] KOZIMA H. Text segmentation based on similarity between words[C]//Proceedings of the 31st annual meeting on association for computational linguistics. USA:ACL, 1993:286-288.
[16] NAKAO Y. Thematic hierarchy detection of a text using lexical cohesion[J]. Journal of natural language processing,1999, 6(6):83-112.
[17] REYNAR J C. An Automatic method of finding topic boundaries[J]. eprint arXiv:cmp-lg/9406017, 1994, 14(101):331-333.
[18] GRUBER A, WEISS Y, ROSEN-ZVI M. Hidden topic markov models[C]//International conference on artificial intelligence and statistics. Cambridge:MIT Press,2007:163-170.
[19] BLEI D M, MORENO P J. Topic segmentation with an aspect hidden markov model[C]//Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. New York:ACM Press, 2001:343-348.
[20] HOFMANN T. Unsupervised learning by probabilistic latent semantic analysis[J].Machine learning, 2001, 42(1-2):177-196.
[21] BLEI D M, NG A Y, JORDAN M I.Latent Dirichlet allocation[J]. Journal of machine learning research, 2003,3(1):993-1022.
[22] 石晶,李万龙.基于LDA模型的主题词抽取方法[J]. 计算机工程, 2010, 36(19):81-83.
[23] 刘红艳.基于hLDA层次主题模型的多文档摘要技术研究[D],北京:北京邮电大学,2012.
[24] BEEFERMAN D, BERGER A, LAFFERTY J. Statistical models for text segmentation[J]. Machining learning, 1999, 34(1):177-210.
[25] SALTON G, ALLAN J, BUCKLEY C, et al. Automatic analysis, theme generation, and summarization of machine-readable texts[J].Science,June 1994,264(3):1421-1426.
[26] YAARI Y. Segmentation of expository texts by hierarchical agglomerative clustering[C]//Proceedings of the RANLP'97.Tzigov Chark, Bulgaria, 1997:59-65.
[27] UTIYAMA M,ISAHARA H.A statistical model for domain-independent text segmentation[C]//Proceedings of the 39th annual meeting on association for computational linguistics. New York:ACL, 2001:499-506.
[28] 钟茂生. 基于内容相关度计算的文本结构分析方法研究[D]. 上海:上海交通大学, 2010.
[29] HEARST M. TileBars:visualization of term distribution information in full text information access[C]//Proceedings of the SIGCHI conference on Human factors in computing systems. USA:ACM, 1995:59-66.
[30] BYRD D.A scrollbar-based visualization for document navigation[C]//Proceedings of the fourth ACM conference on Digital libraries. USA:ACM, 1999, 122-129.
[31] PALEY W B. TextArc:Showing word frequency and distribution in text[C]//IEEE symposium on information visualization. Poster Compendium:IEEE CS Press, 2002:148-165.
[32] MIZOGUCHI K, SAKAMOTO D, IGARASHI T. Overview scrollbar:a scrollbar showing an entire document as an overview[J]. Lecture notes in computer science, 2013, 8/20:603-610.
[33] LESKOVEC J,GROBELNIK M,MILIC-FRAYLING N. Learning sub-structures of document semantic graphs for document summarization[C]//Proceedings of workshop on link analysis and group detection(Link KDD) at KDD 2004.Seattle:Link KDD,2004:133-138.
[34] 朱俊波. 纸本馆藏图书目次信息著录分析[J]. 图书情报工作, 2011, 55(5):60-63.
[35] 中文图书目次信息数据著录规范的研究[J]. 国家图书馆学刊, 2000,8(2):26-31.
[36] HE F, DING X, PENG L. Hierarchical logical structure extraction of book documents by analyzing tables of contents[C]//Proceedings of SPIE-The international society for optical engineering. Beijing:Tsinghua University,2004:6-13.
[37] LIN C C, NIWA Y, NARITA S. Logical structure analysis of book document images using contents information[C]//Proceedings of the fourth international conference on IEEE.Germany:Ulm, 1997:1048-1054.
[38] Chen J, Lu Q. A method for automatic analysis table of contents in Chinese books[J]. Library hi tech, 2015, 33(3):424-438.

Outlines

/