图书情报工作 ›› 2015, Vol. 59 ›› Issue (11): 127-133.DOI: 10.13266/j.issn.0252-3116.2015.11.018

• 知识组织 • 上一篇    下一篇

以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨

黄水清, 王东波, 何琳   

  1. 南京农业大学信息科学技术学院, 南京, 210095
  • 收稿日期:2015-05-12 修回日期:2015-05-22 出版日期:2015-06-05 发布日期:2015-06-05
  • 作者简介:黄水清(ORCID:0000-0002-1646-9300),院长,教授,博士生导师,E-mail:sqhuang@njau.edu.cn;王东波(ORCID:0000-0002-9894-9550),副教授,硕士生导师;何琳(ORCID:0000-0002-4207-3588),副院长,教授,硕士生导师。

Exploring of Word Segmentation for Fore-Qin Literature Based on the Domain Glossary of Sinological Index Series

Huang Shuiqing, Wang Dongbo, He Lin   

  1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095
  • Received:2015-05-12 Revised:2015-05-22 Online:2015-06-05 Published:2015-06-05

摘要:

[目的/意义] 在人文计算兴起这一背景下, 为了更加深入和精准地从古代典籍中挖掘出相应的知识, 针对先秦文献进行自动分词的探究。[方法/过程] 基于《汉学引得丛刊》中的《春秋经传注疏引书引得》制定词汇表, 在由《春秋左氏传》和《晏子春秋》所构成的训练和测试语料上, 通过条件随机场模型, 结合使用统计和人工内省方法确定的特征模板, 完成对先秦典籍进行自动分词的探究。[结果/结论] 在先秦典籍自动分词的整个流程基础上, 得到简单特征模板、内部特征模板和组合特征模板下的自动分词模型, 最好的分词模型调和平均值达到97.47%, 具有较强的推广和应用价值。在构建自动分词模型的过程中, 通过融入内部和外部的特征知识, 模型的精确率和召回率得到有效的提升。

关键词: 人文计算, 《汉学引得丛刊》, 条件随机场模型, 特征模板

Abstract:

[Purpose/significance] With the rising of humanities computing, in order to more deeply and accurately mine the corresponding knowledge from the ancient classics, the Fore-Qin Literature is automatically segmented in this paper.[Method/process] Based on domain glossary of Zuo Commentary from the Sinological Index Series, the paper finishes the segmentation of Fore-Qin Literature on the corpus of train and test which consist of Zuo Commentary and Yanzi's Spring and Autum Annals by the conditional random fields which uses the feature template determined by the method of statistics and rules. [Result/conclusion] The segmentation models based on simple feature template, internal feature template and combined feature template are obtained under the framework of word segmentation for Fore-Qin Literature. The best F-measure of segmentation model reaches 97.47%, which has a great potential for popularization and application.In the processof constructing the model, the precision rate and recall rate of segmentation model are effectively enhanced by merging internal and external feature knowledge.

Key words: humanities computing, Sinological Index Series, conditional random fields, feature template

中图分类号: