专题:古文信息处理与《汉学引得丛刊》

基于支持向量机的先秦诸子典籍自动分类研究

  • 王东波 ,
  • 何琳 ,
  • 黄水清
展开
  • 1. 南京农业大学信息科学技术学院 南京 210095;
    2. 南京农业大学领域知识关联研究中心 南京 210095
王东波(ORCID:0000-0002-9894-9550),副教授,硕士生导师;何琳(ORCID:0000-0002-4207-3588),副院长,教授,博士生导师。

收稿日期: 2017-02-13

  修回日期: 2017-06-09

  网络出版日期: 2017-06-20

基金资助

本文系国家社科基金重大项目"基于《汉学引得丛刊》的典籍知识库构建及人文计算研究"(项目编号:15ZDB127)、南京农业大学人文社科基金项目(项目编号:SKPT2016001)和国家社会科学基金青年项目"哈佛燕京学社汉学引得丛刊研究"(项目编号:12CTQ019)研究成果之一。

Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine

  • Wang Dongbo ,
  • He Lin ,
  • Huang Shuiqing
Expand
  • 1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095;
    2. Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095

Received date: 2017-02-13

  Revised date: 2017-06-09

  Online published: 2017-06-20

摘要

[目的/意义] 在人文计算兴起这一背景下,针对先秦诸子典籍进行自动分类的探究,以更加深入和精准地从古代典籍中挖掘出相应的知识。[方法/过程] 基于《论语》《老子》《管子》《庄子》《孙子》《韩非子》《孟子》《荀子》和《墨子》9种先秦诸子典籍构成的训练和测试语料,采用支持向量机技术,提取TF-IDF、信息增益、卡方统计和互信息为特征,完成针对先秦诸子典籍的自动分类实验。[结果/结论] 基于先秦诸子典籍得到的自动分类模型调和平均值能达到99.21%,效果较好,具有较强的推广和应用价值。

本文引用格式

王东波 , 何琳 , 黄水清 . 基于支持向量机的先秦诸子典籍自动分类研究[J]. 图书情报工作, 2017 , 61(12) : 71 -76 . DOI: 10.13266/j.issn.0252-3116.2017.12.009

Abstract

[Purpose/significance] In order to deeply and accurately mine the knowledge from the ancient classics, the automatic classification of Pre-Qin Literature is implemented at the background of the rising of humanities computing. [Method/process] Based on the training and testing corpus which consisted of 9 kinds of full texts of the Analects of Confucius, Laozi, Guanzi, Zhuangzi, Xunzi, Han Fei Zi, Mencius, Xunzi and Mozi, the paper finished experiments about the automatic classification of Pre-Qin Philosophers Literature by the support vector machine which used the feature selection, which included TF-IDF, information gain, Chi-square statistics and mutual information determined by the method of statistics rules. [Result/conclusion] The classification models based on the support vector machine are obtained under 4 different feature selection methods for Pre-Qin Philosophers Literature. The best F-measure of classification model reaches 99.21% which has favorable effect and the value of promotion and application.

参考文献

[1] HUIJNEN P, LAAN F, RIJKE M D, et al. A digital humanities approach to the history of science[C]//Proceedings of fifth international conference on social informatics. Berlin:Springer Berlin, 2013:71-85.
[2] 赵生辉, 朱学芳. 我国高校数字人文中心建设初探[J]. 图书情报工作, 2014, 58(6):64-69.
[3] YANG Y, PEDERSEN J O. A Comparative study on feature selection in text categorization[C]//Proceedings of fourteenth international conference on machine learning. California:Morgan Kaufmann Publishers Inc,1997:412-420.
[4] YANG Y,LIU X. A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. New York:Association for Computing Machinery,1999:42-49.
[5] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18(1):26-32.
[6] 刘志刚, 李德仁, 秦前清,等. 支持向量机在多类分类问题中的推广[J]. 计算机工程与应用, 2004, 40(7):10-13.
[7] 李盼池, 许少华. 支持向量机在模式识别中的核函数特性分析[J]. 计算机工程与设计, 2005, 26(2):302-304.
[8] VALENZA R J. Are the Thisted-Efron authorship tests valid?[J]. Computers & the humanities, 1991, 25(1):27-46.
[9] SANDERSON C,GUENTER S. Short text authorship attribution via sequence kernels, Markov chains and author unmasking:An investigation[C]//Proceedings of the 2006 conference on empirical methods in natural language processing. Stroudsburg:Association for Computational Linguistics,2006:482-491.
[10] 年洪东, 陈小荷,王东波. 现当代文学作品的作者身份识别研究[J].计算机工程与应用, 2010, 46(4):226-229.
[11] 王昊, 严明, 苏新宁. 基于机器学习的中文书目自动分类研究[J]. 中国图书馆学报, 2010(6):28-39.
[12] 王东波, 苏新宁, 朱丹浩,等. 基于支持向量机的医学期刊文章自动分类研究[J]. 情报理论与实践, 2011, 34(4):115-118.
[13] 黄水清,王东波,何琳.以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J].图书情报工作,2015,59(11):127-133.
[14] SALTON G, FOX E A, WU H. Extended boolean information retrieval[M]. New York:The Cornell University Pres, 1982.
[15] MITCHELL T M. Machine learning[M]. New York:The Mc-Graw-Hill Companies, 1997.
[16] YATES F. Contingency tables involving small numbers and the χ2 test[J]. Journal of the Royal Statistical Society, 1934, 1(2):217-235.
[17] HANKS P. Word association norms, mutual information, and lexicography[C]//Proceedings of the 27th annual meeting on association for computational linguistics. Stroudsburg:Association for Computational Linguistics, 1989:76-83.
[18] 金敏. 《管子·明法》与《韩非子·有度》比较[J]. 中外法学, 1997, 9(6):111-113.
文章导航

/