专题:先秦典籍的语义组织与挖掘研究

面向先秦典籍的知识本体构建技术研究

  • 何琳 ,
  • 陈雅玲 ,
  • 孙珂迪
展开
  • 南京农业大学信息管理系 南京 210095
何琳(ORCID:0000-0002-4207-3588),教授,博士,博士生导师,E-mail:helin@njau.edu.cn;陈雅玲(OCRID:0000-0002-7515-4843),硕士研究生;孙珂迪(ORCID:0000-0003-0193-1117),硕士研究生。

收稿日期: 2019-07-10

  修回日期: 2019-11-24

  网络出版日期: 2020-04-05

基金资助

本文系中央高校基本科研业务费资助项目"基于《汉学引得丛刊》的古文本体研究"(项目编号:SKCX2017004)研究成果之一。

Research on Ontology Building Methods of Chinese Ancient Books

  • He Lin ,
  • Chen Yaling ,
  • Sun Kedi
Expand
  • College of Information Science & Technology, Nanjing Agricultural University, Nanjing 210095

Received date: 2019-07-10

  Revised date: 2019-11-24

  Online published: 2020-04-05

摘要

[目的/意义] 构建面向典籍文本的语义本体,能够促进典籍文本的挖掘与分析。然而由于典籍文本与现代文本在语法上存在较大差异,给面向典籍的语义本体构建带来了困难。[方法/过程] 本文运用自然语言处理技术探讨针对先秦典籍的本体构建方法。以国际上文化遗产领域通用的CIDOC CRM为框架,设计先秦典籍本体模型。针对典籍文本内容的特点及句法特征,将规则抽取与条件随机场方法相结合,提出一套本体实例自动获取技术,并以《左传》为实验语料进行测试。[结果/结论] 实验表明,本文所提出的本体实例抽取技术能够较好地提高面向典籍文本的本体构建效率。基于规则的本体实例抽取实验F值在93%左右,基于条件随机场的本体实例抽取最佳特征模板的F值为82.51%。在本体实例获取中,词性信息和位置信息具有重要作用。

本文引用格式

何琳 , 陈雅玲 , 孙珂迪 . 面向先秦典籍的知识本体构建技术研究[J]. 图书情报工作, 2020 , 64(7) : 13 -19 . DOI: 10.13266/j.issn.0252-3116.2020.07.002

Abstract

[Purpose/significance] It is very helpful to build semantic ontology of Chinese ancient books for texting mining and text analysis of China history. However, there are lots of differences between ancient and modern Chinese in syntactic structure. The difference makes a lot of difficulties in Ontology Building of Chinese ancient books. [Method/process] This paper focused on ontology building methods of ancient Chinese books based on Natural language processing (NLP) technique. We designed the ontology model based on CIDOC CRM which is an international standard for the description of cultural heritages. Then we gave a solution to extract instances of the ontology automatically which is a hybrid method of regulation extraction and CRFs recognition based on the syntactic structure of Chinese ancient books. At last, we did an examination using one of Chinese ancient books called Zuo Zhuan. [Result/conclusion] The experiment results show that our method can improve the extraction precision of Ontology instances, which can enhance the efficiency of ontology construction from Chinese ancient books. This paper got 93% F-score on the testing of regular-based method, and 82.51% F-score on CRFs method using the best feature template. It also finds that it is important to use the characters of the position and part-of-speech of words to enhance the extraction of ontology instances in our methods.

参考文献

[1] 踪凡. 让古籍文献"活起来"[N]. 光明日报,2017-11-30(14).
[2] 夏翠娟,张磊.关联数据在家谱数字人文服务中的应用[J].图书馆杂志,2016,35(10):26-34.
[3] 于彤,崔蒙,李海燕,等.ISO技术规范"中医药学语言系统语义网络框架"的应用研究[J].中国医药导报,2016,13(4):89-92.
[4] 董慧,徐雷,王菲,等.基于语义系统的中华史籍分析研究[J].图书馆理论与实践,2015(4):1-5, 46.
[5] 陈小荷.先秦文献的信息处理[M].北京:世界图书出版公司,2013.
[6] 欧阳剑. 面向数字人文研究的大规模古籍文本可视化分析与挖掘[J]. 中国图书馆学报,2016,42(2):66-80.
[7] 朱晓,金力.条件随机场图模型在《明史》词性标注研究中的应用效果探索[J].复旦学报(自然科学版),2014,53(3):297-304.
[8] 刘浏,李斌,曲维光,等. 先秦词汇的时代特征自动获取及文献时代的自动判定[J]. 中文信息学报, 2013, 27(5):107-113.
[9] 于丽丽,丁德鑫,曲维光,等.基于条件随机场的古汉语词义消歧研究[J]. 微电子学与计算机, 2009,26(10):45-48.
[10] 任飞亮,沈继坤,孙宾宾,等.从文本中构建领域本体技术综述[J].计算机学报, 2019,42(3):654-676.
[11] WIMALASURIYA D C, DOU D. Ontology-based information extraction:an introduction and a survey of current approaches[J]. Journal of information science, 2010, 36(3):306-323.
[12] 王颖,张智雄,孙辉,等.国史知识的语义揭示与组织方法研究[J].中国图书馆学报, 2015,41(4):55-64.
[13] THAKKER D, KARANASIOS S, BLANCHARD E, et al. Ontology for cultural variations in interpersonal communication:building on theoretical models and crowdsourced knowledge[J]. Journal of the Association for Information Science and Technology, 2017,68(6):1411-1428.
[14] 周耀林,赵跃,孙晶琼.非物质文化遗产信息资源组织与检索研究路径[J]. 2017,36(8):166-174.
[15] ISO technical committee 46 variations in interpersoation, subcomittee SC4 e 46 variations in interper. Information and documentation——a reference ontology for the interchange of cultural heritage information[S]. ISO 21127:2014. Geneva:ISO, 2014.
[16] DOERR M. The CIDOC conceptual reference module:an ontological approach to semantic interoperability of metadata[J]. AI magazine, 2003,24(3):75-92.
[17] 顾栋高. 春秋大事表[M]. 北京:中华书局,1993.
[18] 童书业.春秋左传研究[M]. 上海:上海人民出版社,2019.
[19] 陈小洁. 基于本体的《左传》战争知识地图构建研究[D].南京:南京农业大学,2018.
[20] 陈雅玲. 基于CIDOC CRM的先秦人物知识本体构建方法研究[D].南京:南京农业大学, 2019.
[21] CHEN X H, LI B, FENG M X, et al. Ancient Chinese corpus[M]. Philadelphia:Linguistic Data Consortium, 2017.
[22] LAFFERTY J D, McCALLUM A, PEREIRA F. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th international conference on machine learning. San Francisco:Morgan Kaufmann Publishers Inc, 2001:282-289.
[23] 吕云云, 李旸, 王素格. 基于BootStrapping的集成分类器的中文观点句识别方法[J]. 中文信息学报, 2013, 27(5):84-93.
文章导航

/