收稿日期: 2015-05-12
修回日期: 2015-05-22
网络出版日期: 2015-06-05
Exploring of Word Segmentation for Fore-Qin Literature Based on the Domain Glossary of Sinological Index Series
Received date: 2015-05-12
Revised date: 2015-05-22
Online published: 2015-06-05
[目的/意义] 在人文计算兴起这一背景下, 为了更加深入和精准地从古代典籍中挖掘出相应的知识, 针对先秦文献进行自动分词的探究。[方法/过程] 基于《汉学引得丛刊》中的《春秋经传注疏引书引得》制定词汇表, 在由《春秋左氏传》和《晏子春秋》所构成的训练和测试语料上, 通过条件随机场模型, 结合使用统计和人工内省方法确定的特征模板, 完成对先秦典籍进行自动分词的探究。[结果/结论] 在先秦典籍自动分词的整个流程基础上, 得到简单特征模板、内部特征模板和组合特征模板下的自动分词模型, 最好的分词模型调和平均值达到97.47%, 具有较强的推广和应用价值。在构建自动分词模型的过程中, 通过融入内部和外部的特征知识, 模型的精确率和召回率得到有效的提升。
黄水清 , 王东波 , 何琳 . 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015 , 59(11) : 127 -133 . DOI: 10.13266/j.issn.0252-3116.2015.11.018
[Purpose/significance] With the rising of humanities computing, in order to more deeply and accurately mine the corresponding knowledge from the ancient classics, the Fore-Qin Literature is automatically segmented in this paper.[Method/process] Based on domain glossary of Zuo Commentary from the Sinological Index Series, the paper finishes the segmentation of Fore-Qin Literature on the corpus of train and test which consist of Zuo Commentary and Yanzi's Spring and Autum Annals by the conditional random fields which uses the feature template determined by the method of statistics and rules. [Result/conclusion] The segmentation models based on simple feature template, internal feature template and combined feature template are obtained under the framework of word segmentation for Fore-Qin Literature. The best F-measure of segmentation model reaches 97.47%, which has a great potential for popularization and application.In the processof constructing the model, the precision rate and recall rate of segmentation model are effectively enhanced by merging internal and external feature knowledge.
[1] Huijnen P,Laan F,Rijke M,et al.A digital humanities approach to the history of science[J].Social Informatics Lecture Notes in Computer Science, 2014,83(59):71-85.
[2] 赵生辉,朱学芳.我国高校数字人文中心建设初探[J].图书情报工作,2014,58(6):64-69.
[3] 孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究[J].中文信息学报,2000,14(1):1-6.
[4] 刘挺,吴岩.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报,1998,12(1): 17-25.
[5] 姚天顺,张桂平.基于规则的汉语自动分词系统[J].中文信息学报,1990,4(1):37-43.
[6] 赵益民.用VFP实现汉语文献的自动分词[J].图书情报工作,2002,46(11): 64-66.
[7] 曹自强,李素建.HDP与互信息相结合的中文无指导分词[J].中文信息学报,2013,27(6):1-5.
[8] 韩冬煦,常宝宝.中文分词模型的领域适应性方法[J].计算机学报,2015,38(2):272-281.
[9] Zhao Hai, Huang Chang-Ning, Li Mu, et al. A unified character-based tagging method of Chinese word segmentation via conditional random field modeling[J]. ACM Transaction on Asian Language Information Processing, 2010, 9(2):1-32.
[10] 李双龙,刘群,王成耀.基于条件随机场的汉语分词系统[J].微计算机信息,2006(28):178-180.
[11] 宋彦,蔡东风,张桂平,等.种基于字词联合解码的中文分词方法[J].软件学报, 2009(9): 2366-2375.
[12] 汉籍电子文献[EB/OL]. [2015-05-07].http://hanji.sinica.edu.tw/index.html.
[13] 邱冰,皇甫娟.基于中文信息处理的古代汉语分词研究[J].微计算机信息, 2008,24(24):100-102.
[14] 梁社会,陈小荷.先秦文献《孟子》自动分词方法研究[J].南京师范大学文学院学报,2013(3):175-182.
[15] 汉达文库[EB/OL].[2015-04-13].http://www.chant.org/.
[16] 徐润华,陈小荷.一种利用注疏的《左传》分词新方法[J].中文信息学报,2012,26(2):13-17.
[17] 马学良,孙蕊.从“整理国故”看哈佛燕京学社汉学引得丛刊的价值[J].图书情报工作,2010,54(7):111-114.
[18] Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//The International Mchine Learning Society. Proceedings of the Eighteenth International Conference on Machine Learning. Williamstown: Williams College, 2001:282-289.
[19] CRF++[EB/OL].[2015-05-07].http://sourceforge.net/projects/crfpp/.
/
〈 |
|
〉 |