Chinese Text Hierarchical Segmentation Based on Knowledge Element

  • Wang Zhongyi ,
  • Shen Xueying ,
  • Huang Jing
Expand
  • 1. School of Information Management, Central China Normal University, Wuhan 430079;
    2. Wuhan Polytechnic, Wuhan 430074

Received date: 2018-06-22

  Revised date: 2018-10-11

  Online published: 2019-04-05

Abstract

[Purpose/significance] This paper aims to help users to retrieve complete and appropriate size of knowledge unit and to satisfy users' multi-granularity requirements. [Method/process] This paper proposes a hierarchical segmentation based on the knowledge element. Firstly, the method analyzes the types of knowledge elements and the description rules. Secondly, it identifies the knowledge elements in the entity resources according to the knowledge element description rules, and treats the knowledge elements and the joint sentences as a class. Finally, the fisher segmentation algorithm is used to divide the class bi-levelly until all topics are identified, and the segmentation boundaries are determined, to achieve the hierarchical segmentation. [Result/conclusion] This method is based on the recognition of the knowledge element to segment the text. On the one hand, segmentation granularity extends from sentence to knowledge element, which improves the efficiency of segmentation. On the other hand, the control unit of knowledge service is deepened from the literature into knowledge blocks with knowledge elements and knowledge elements sets as the unit, providing the necessary knowledge resources, realizing the progress from data retrieval, information retrieval to knowledge retrieval, improving the efficiency of knowledge acquisition and achieving the transformation of information services to knowledge services.

Cite this article

Wang Zhongyi , Shen Xueying , Huang Jing . Chinese Text Hierarchical Segmentation Based on Knowledge Element[J]. Library and Information Service, 2019 , 63(7) : 105 -115 . DOI: 10.13266/j.issn.0252-3116.2019.07.013

References

[1] 石晶.文本分割综述[J].计算机工程与应用,2006,42(35):155-159.
[2] REYNAR J C.Topic segmentation:algorithms and applications[D].Computer and information science,Philadelphia:University of Pennsylvania,1998.
[3] 邹箭,钟茂生,孟荔.中文文本分割模式获取及其优化方法[J].南昌大学学报(理科版),2011,35(6):597-601.
[4] HALLIDAY M A K,HASAN R.Cohesion in English[M].London:Routledge,1976.
[5] KOZIMA H.Text segmentation based on similarity between words[C]//Proceedings of the 31st annual meeting on association for computational linguistics.Stroudsburg,PA,USA:association for computational linguistics,1993:286-288.
[6] REYNAR J C.An automatic method of finding topic boundaries[C]//Proceedings of the 32nd annual meeting on association for computational linguistics.Stroudsburg,PA,USA:association for computational linguistics,1994:331-333.
[7] HEARST M A.Multi-paragraph segmentation of expository text[C]//Proceedings of the 32nd annual meeting on association for computational linguistics.Stroudsburg,PA,USA:association for computational linguistics,1994:9-16.
[8] CHOI F Y Y.Advances in domain independent linear text segmentation[C]//Proceedings of the 1st North American chapter of the association for computational linguistics conference.Stroudsburg,PA,USA:association for computational linguistics,2000:26-33.
[9] PONTE J M,CROFT W B.Text segmentation by topic[M]//Research and advanced technology for digital libraries.Heidelberg:Springer Berlin Heidelberg,1997:113-125.
[10] MORRIS J,HIRST G.Lexical cohesion computed by thesaural relations as an indicator of the structure of text[J].Computational linguistics,1991,17(1):21-48.
[11] CHOI F Y Y,WIEMER-HASTINGS P,MOOREJ.Latent semantic analysis for text segmentation[J].Proceedings of emnlp,2001,4(3):109-117.
[12] 石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248.
[13] 石晶,胡明,石鑫,等.基于LDA模型的文本分割[J].计算机学报,2008,31(10):1865-1873.
[14] RIEDL M,BIEMANN C.How text segmentation algorithms gain from topic models[C]//Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics:human language technologies.Stroudsburg,PA,USA:association for computational linguistics,2012:553-557.
[15] EISENSTEIN J,BARZILAY R.Bayesian unsupervised topic segmentation[C]//Proceedings of the conference on empirical methods in natural language processing.Stroudsburg,PA,USA:association for computational linguistics,2008:334-343.
[16] MULBREGT P,CARP I,GILLICK L,et al.Text segmentation and topic tracking on broadcast news via a hidden markov model approach[C]//Fifth international conference on spoken language processing,Sydney,Australia:ISCA Archive,1998:2519-2522.
[17] BRANTS T,CHEN F,TSOCHANTARIDIS I.Topic-based document segmentation with probabilistic latent semantic analysis[C]//Proceedings of the eleventh international conference on information and knowledge management.New York,NY,USA:ACM,2002:211-218.
[18] RIEDL M,BIEMANN C.TopicTiling:a text segmentation algorithm based on LDA[C]//ACL 2012 student research workshop.USA:association for computational linguistics,2012:37-42.
[19] KAN M Y.Combining visual layout and lexical cohesion features for text segmentation[J].In proceedings of the 31stWorkshop on graph theoretic concepts in computer science-WG 2005,2001:187-198.
[20] YAARI Y.Segmentation of expository texts by hierarchical agglomerative clustering[EB/OL].[2018-03-21] https://arxiv.org/pdf/cmp-lg/9709015v1.pdf.
[21] EISENSTEIN J.Hierarchical text segmentation from multi-scale lexical cohesion[C]//Human language technologies:the conference of the North American chapter of the association for computational linguistics.Stroudsburg,PA,USA:association for computational linguistics,2009:353-361.
[22] TEH Y W,JORDAN M I,BEAL M J,et al.Hierarchical dirichlet processes[J].Journal of the American statistical association,2006,101(476):1566-1581.
[23] 李天彩,王波,席耀一,等.基于分层狄利克雷过程模型的文本分割[J].数据采集与处理,2017,32(2):408-416.
[24] 温有奎.基于"知识元"的知识组织与检索[J].计算机工程与应用,2005,41(1):55-57.
[25] 张静,刘延申,卫金磊.论中小学多媒体知识元库的建设[J].现代教育技术,2005,15(5):68-71.
[26] 原小玲.基于知识元的知识标引[J].图书馆学研究,2007(6):45-47.
[27] 赵蓉英,张心源.基于知识元抽取的中文智库成果描述规则研究[J].图书与情报,2017(1):119-127.
[28] 温有奎,温浩,徐端颐,等.基于知识元的文本知识标引[J].情报学报,2006,25(3):282-288.
[29] 化柏林.学术论文中方法知识元的类型与描述规则研究[J].中国图书馆学报,2016,42(1):30-40.
[30] 肖聪,顾圣平,崔巍,等.Fisher最优分割法在李仙江流域汛期分期中的应用[J].水电能源科学,2014(3):70-74.
[31] PEVZNER L,HEARST M A.A critique and improvement of an evaluation metric for text segmentation[J].Computational linguistics,2002,28(1):19-36.
Outlines

/