Researches of Automatic Part-of-speech Tagging for Pre-Qin Literature Based on Multi-feature Knowledge

  • Wang Dongbo ,
  • Huang Shuiqing ,
  • He Lin
Expand
  • 1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
    2. Research Center for Correlation of Domain Knowledge Nanjing Agricultural University, Nanjing 210095

Received date: 2017-02-13

  Revised date: 2017-04-12

  Online published: 2017-06-20

Abstract

[Purpose/significance] The Pre-Qin literature plays an extremely important role in the whole ancient classics. In order to more accurately mine the deep knowledge from the Pre-Qin literature, the automatic part-of-speech tagging for Pre-Qin literature becomes the first assignment, and the paper presents the solving method. [Method/process] Based on conditional random fields model and combined feature template which is determined by the method of statistics, the paper finally finishes constructing the model of the automatic part-of-speech tagging for the Pre-Qin literature. [Result/conclusion] The part-of-speech tagging models based on simple feature template and combined feature template are obtained under the processing flow of part-of-speech for Pre-Qin literature. The F-measure of part-of-speech model reaches 94.79% which is able to promote and apply. In the course of constructing model, the precision rate and recall rate of segmentation model are effectively enhanced by merging the feature knowledge, such as word structure, phonetic spelling and word length.

Cite this article

Wang Dongbo , Huang Shuiqing , He Lin . Researches of Automatic Part-of-speech Tagging for Pre-Qin Literature Based on Multi-feature Knowledge[J]. Library and Information Service, 2017 , 61(12) : 64 -70 . DOI: 10.13266/j.issn.0252-3116.2017.12.008

References

[1] 宗成庆. 统计自然语言处理(第2版)[M].北京:清华大学出版社, 2013:167-173.
[2] 刘开瑛. 中文文本自动分词和标注[M].上海:商务印书馆, 2000:162-166.
[3] SAHARIA N, DAS D, SHARM A, et al. Part of Speech tagger for Assamese text[C]//Proceedings of the ACL-IJCNLP 2009 Conference. Stroudsburg:Association for Computational Linguistics, 2009:33-36.
[4] BRILL E, MAGERMAN D, MARCUS M, et al. Deducing linguistic structure from the statistics of large corpora[C]//Proceedings of workshop on Speech and Natural Language, Stroudsburg:Association for Computational Linguistics, 1990:380-389.
[5] 彭涛,戴耀康,朱枫彤,等. 一种基于规则的无监督词性标注方法[J]. 吉林大学学报(理学版), 2015,53(5):956-962.
[6] MOON T, ERK K, BALDRIDGE J. Crouching Dirichlet,hidden Markov model:unsupervised POS tagging with context local tag generation[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2010:196-206.
[7] 余昕聪, 李红莲,吕学强. 最大熵和HMM在中文词性标注中的应用[J]. 无线互联科技,2014(11):122-124.
[8] 赵法兴,赵伟. 平滑的最大熵模型在汉语词性自动标注中的应用[J]. 长春工业大学学报(自然科学版), 2007,29(2):213-216.
[9] 赵伟, 赵法兴, 王东海,等. 一种基于改进的最大熵模型的汉语词性自动标注的新方法[J].计算机研究与发展, 2006,43(S1):174-178.
[10] 留金腾, 宋彦,夏飞. 上古汉语分词及词性标注语料库的构建——以《淮南子》为范例[J]. 中文信息学报, 2013,27(6):6-15,81.
[11] 朱晓,金力. 条件随机场图模型在《明史》词性标注研究中的应用效果探索[J]. 复旦大学学报(自然科学版), 2014,53(3):297-304.
[12] 黄水清, 王东波,何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015,59(11):127-133.
[13] 洪铭材, 张阔, 唐杰, 等. 基于条件随机场(CRFs)的中文词性标注方法[J]. 计算机科学, 2006,33(10):148-151,155.
[14] 于丽丽, 丁德鑫, 曲维光, 等. 基于条件随机场的古汉语词义消歧研究[J].微电子学与计算机, 2009,26(10):45-48.
[15] 徐润华,陈小荷. 一种利用注疏的《左传》分词新方法[J]. 中文信息学报, 2012(2):13-17,45.
[16] 肖磊,陈小荷. 古籍版本异文的自动发现[J]. 中文信息学报, 2010(5):50-55.
[17] 马创新, 陈小荷, 曲维光,等. 《论语》与其注疏文献对齐语料库的构建[J]. 现代教育技术, 2012,22(7):109-113.
[18] 石民, 李斌,陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010(2):39-45.
[19] 梁社会, 陈小荷,刘浏. 先秦汉语排比句自动识别研究——以《孟子》《论语》中的排比句自动识别为例[J]. 计算机工程与应用, 2013,49(19):222-226.
[20] 汤亚芬. 先秦古汉语典籍中的人名自动识别研究[J]. 现代图书情报技术, 2013,29(7/8):63-68.
[21] 张开旭, 夏云庆,宇航. 基于条件随机场的古汉语自动断句与标点方法[J]. 清华大学学报(自然科学版), 2009(10):1733-1736.
[22] CRF++0.58.[EB/OL].[2016-08-12].http://taku910.github.io/crfpp/.
[23] 陈小荷. 先秦文献信息处理[M].北京:世界图书出版公司, 2013.
Outlines

/