图书情报工作 ›› 2017, Vol. 61 ›› Issue (12): 71-76.DOI: 10.13266/j.issn.0252-3116.2017.12.009

• 专题:古文信息处理与《汉学引得丛刊》 • 上一篇    下一篇

基于支持向量机的先秦诸子典籍自动分类研究

王东波1,2, 何琳1,2, 黄水清1,2   

  1. 1. 南京农业大学信息科学技术学院 南京 210095;
    2. 南京农业大学领域知识关联研究中心 南京 210095
  • 收稿日期:2017-02-13 修回日期:2017-06-09 出版日期:2017-06-20 发布日期:2017-06-20
  • 作者简介:王东波(ORCID:0000-0002-9894-9550),副教授,硕士生导师;何琳(ORCID:0000-0002-4207-3588),副院长,教授,博士生导师。
  • 基金资助:
    本文系国家社科基金重大项目"基于《汉学引得丛刊》的典籍知识库构建及人文计算研究"(项目编号:15ZDB127)、南京农业大学人文社科基金项目(项目编号:SKPT2016001)和国家社会科学基金青年项目"哈佛燕京学社汉学引得丛刊研究"(项目编号:12CTQ019)研究成果之一。

Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine

Wang Dongbo1,2, He Lin1,2, Huang Shuiqing1,2   

  1. 1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
    2. Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095
  • Received:2017-02-13 Revised:2017-06-09 Online:2017-06-20 Published:2017-06-20

摘要: [目的/意义] 在人文计算兴起这一背景下,针对先秦诸子典籍进行自动分类的探究,以更加深入和精准地从古代典籍中挖掘出相应的知识。[方法/过程] 基于《论语》《老子》《管子》《庄子》《孙子》《韩非子》《孟子》《荀子》和《墨子》9种先秦诸子典籍构成的训练和测试语料,采用支持向量机技术,提取TF-IDF、信息增益、卡方统计和互信息为特征,完成针对先秦诸子典籍的自动分类实验。[结果/结论] 基于先秦诸子典籍得到的自动分类模型调和平均值能达到99.21%,效果较好,具有较强的推广和应用价值。

关键词: 先秦典籍, 支持向量机, 自动分类, 古文信息处理

Abstract: [Purpose/significance] In order to deeply and accurately mine the knowledge from the ancient classics, the automatic classification of Pre-Qin Literature is implemented at the background of the rising of humanities computing. [Method/process] Based on the training and testing corpus which consisted of 9 kinds of full texts of the Analects of Confucius, Laozi, Guanzi, Zhuangzi, Xunzi, Han Fei Zi, Mencius, Xunzi and Mozi, the paper finished experiments about the automatic classification of Pre-Qin Philosophers Literature by the support vector machine which used the feature selection, which included TF-IDF, information gain, Chi-square statistics and mutual information determined by the method of statistics rules. [Result/conclusion] The classification models based on the support vector machine are obtained under 4 different feature selection methods for Pre-Qin Philosophers Literature. The best F-measure of classification model reaches 99.21% which has favorable effect and the value of promotion and application.

Key words: Pre-Qin Literature, support vector machine, automatic classification, ancient Chinese character information processing

中图分类号: