一种面向中文网络百科非结构化信息的知识获取方法

doi:10.13266/j.issn.0252-3116.2016.13.016

图书情报工作 ›› 2016, Vol. 60 ›› Issue (13): 126-133.DOI: 10.13266/j.issn.0252-3116.2016.13.016

一种面向中文网络百科非结构化信息的知识获取方法

王汀, 冀付军, 徐天晟

首都经济贸易大学信息学院北京 100070

收稿日期:2016-05-10 修回日期:2016-06-24 出版日期:2016-07-05 发布日期:2016-07-05
作者简介:王汀(ORCID:0000-0003-2481-2890),讲师,博士,E-mail:wangting@cueb.edu.cn;冀付军(ORCID:0000-0002-8068-4726),副教授,硕士生导师;徐天晟(ORCID:0000-0002-5876-9248),教授,硕士生导师
基金资助:
本文系首都经济贸易大学科研项目“中文链接数据构建关键技术研究”（项目编号：00791654490223）和北京市社会科学基金项目“微媒体对北京大学生行为模式变化影响的研究”（项目编号：15ZHB011）研究成果之一。

A Novel Knowledge Extraction Approach Oriented on Unstructured Information of Chinese Online Encyclopedia

Wang Ting, Ji Fujun, Xu Tiansheng

Information School, Capital University of Economics and Business, Beijing 100070

Received:2016-05-10 Revised:2016-06-24 Online:2016-07-05 Published:2016-07-05

摘要/Abstract

摘要： [目的/意义] 在进行大规模知识库构建时，基于手工方式的构建模式效率较低并且可行性较差，因此，从网络百科中自动地获取海量知识已经被越来越多的学者所关注。目前的研究主要关注于从英文网络百科数据源进行海量知识的抽取，而面向中文百科数据源进行的知识抽取研究工作尚处于起步阶段。[方法/过程] 为解决中文大规模知识库的构建问题，提出一种新的基于中文网络百科架构的大规模知识库的自动化构建方法：在第一阶段，对知识三元组中的主语和宾语之间的语义关系进行自扩展学习；在第二阶段，基于条件随机场和支持向量机协同分类器，对标注出的属性和属性值实体之间的语义关系进行预测。[结果/结论] 实验评测结果表明，该方法较前人工作在典型中文百科分类页面中的实体识别查准率和查全率分别最高有约10%和6%的提升。

关键词: 中文知识库, 网络开放百科, 新词发现, 条件随机场, 支持向量机

Abstract: [Purpose/significance] In the process of constructing large-scale knowledge base, the manual-based construction approach is lack of efficiency and flexibility. Automatically extracting of massive knowledge from online encyclopedia has attracted attention of an increasing number of scholars. Current research mainly focuses on extracting the data from English online encyclopedia, whereas research about knowledge extraction from Chinese or other languages' data sources is rare.[Method/process] This paper proposes an automatic construction scheme for large-scale knowledge base based on Chinese online Encyclopedia. (i)In the first stage of the scheme, self-expanded learning is performed on the semantic relations between subjects and objects among the knowledge triples. (ii)In the second stage, the semantic relationship between marked attributes and their entities is predicted based on Conditional Random Fields (CRFs) and Support Vector Machine (SVM) classifier.[Result/conclusion] A large-scale knowledge base is automatically constructed based on the scheme, and the experiment results indicate that the scheme possesses feasibility and effectiveness.

Key words: Chinese knowledge base, online encyclopedia, new word detection, CRF, SVM

中图分类号:

G250

王汀, 冀付军, 徐天晟. 一种面向中文网络百科非结构化信息的知识获取方法[J]. 图书情报工作, 2016, 60(13): 126-133.

Wang Ting, Ji Fujun, Xu Tiansheng. A Novel Knowledge Extraction Approach Oriented on Unstructured Information of Chinese Online Encyclopedia[J]. LIS, 2016, 60(13): 126-133.

参考文献

[1] BERNERS-LEE T,HENDLER J,LASSILA O.The semantic web[J].Scientific american,2001,284(5):28-37.
[2] BIZER C,HEATH T,BERNERS-LEE T.Linked data-the story so far[M]//Semantic services,interoperability and Web applications:emerging concepts.USA:Information Science Reference,2009:205-227.
[3] BIZER C,LEHMANN J,KOBILAROV G,et al.DBpedia-a crystallization point for the Web of data[J].Web Semantics:science,services and agents on the world wide web,2009,7(3):154-165.
[4] WANG Z,WANG Z,LI J,et al.Knowledge extraction from Chinese wiki encyclopedias[J].Journal of Zhejiang University SCIENCE C,2012,13(4):268-280.
[5] SUCHANEK F M,KASNECI G,WEIKUM G.Yago:A large ontology from wikipedia and wordnet[J].Web Semantics:Science,Services and Agents on the World Wide Web,2008,6(3):203-217.
[6] WU F,WELD D S.Automatically refining the wikipedia infobox ontology[C]//Proceedings of the 17th international conference on World Wide Web.New York:ACM,2008:635-644.
[7] WU F,WELD D S.Autonomously semantifying wikipedia[C]//Proceedings of the sixteenth ACM conference on Conference on information and knowledge management.New York:ACM,2007:41-50.
[8] 康为,穗志方.基于Web弱指导的本体概念实例及属性的同步提取[J].中文信息学报,2010,24(1):54-60.
[9] 郭剑毅,李真,余正涛,等.领域本体概念实例,属性和属性值的抽取及关系预测[J].南京大学学报:自然科学版,2012,48(4):383-389.
[10] CHEN Y,CHEN L,XU K.Learning Chinese entity attributes from online encyclopedia[C]//Asia-Pacific Web conference.Berlin:Springer,2012:179-186.
[11] 贾真,杨宇飞,何大可,等.面向中文网络百科的属性和属性值抽取[J].北京大学学报(自然科学版),2014,50(1):41-47.
[12] 刘倩,刘冰洋,贺敏,等.基于同义扩展的在线百科中实体属性抽取[J].中文信息学报,2016,30(1):16-24.
[13] LAFFERTY J,MCCALLUM A,PEREIRA F.Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the eighteenth international conference on machine learning.San Francisco:ACM,2001:282-289.
[14] CORTES C,VAPNIK V.Support-vector networks[J].Machine learning,1995,20(3):273-297.
[15] 王汀,徐天晟,冀付军.基于数据场和全局序列比对的大规模中文关联数据模型[J].中文信息学报,2016,30(3):116-124.
[16] 梅家驹,竺一鸣,高蕴琦,等.同义词词林[M].上海:上海辞书出版社,1984.
[17] 哈工大社会计算与信息检索研究中心.同义词词林(扩展版)[EB/OL].[2016-01-05].http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm.
[18] 刘文远,武丽霞,王宝文.基于优序图加权的多维稀疏模糊推理方法[J].计算机工程,2009,35(11):210-212.
[19] BAI L.Computer-assisted discovery on language knowledge[M].Beijing:Science Press,1995.
[20] CRF++[EB/OL].[2016-02-08].http://crfpp.sourceforge.net/.
[21] "国立"台湾大学.LibSVM[EB/OL].[2016-03-20].http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[22] 中国科学院计算技术研究所.ICTCLAS[EB/OL].[2016-01-03].http://ictclas.org/.

一种面向中文网络百科非结构化信息的知识获取方法

A Novel Knowledge Extraction Approach Oriented on Unstructured Information of Chinese Online Encyclopedia

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	耿骞, 邓斯予, 靳健. 融合词语义表示和新词发现的领域本体演化——以产品评论数据为例[J]. 图书情报工作, 2021, 65(8): 85-96.
[2]	何琳, 陈雅玲, 孙珂迪. 面向先秦典籍的知识本体构建技术研究[J]. 图书情报工作, 2020, 64(7): 13-19.
[3]	刘忠宝, 党建飞, 张志剑. 《史记》历史事件自动抽取与事理图谱构建研究[J]. 图书情报工作, 2020, 64(11): 116-124.
[4]	黄水清, 王东波. 新时代人民日报分词语料库构建、性能及应用(一)——语料库构建及测评[J]. 图书情报工作, 2019, 63(22): 5-12.
[5]	王若佳, 赵常煜, 王继民. 中文电子病历的分词及实体识别研究[J]. 图书情报工作, 2019, 63(2): 34-42.
[6]	田梅, 朱学芳. 基于支持向量机的大学生网络信息偶遇影响因素研究[J]. 图书情报工作, 2018, 62(8): 84-92.
[7]	王东波, 胡昊天, 周鑫, 朱丹浩. 基于深度学习的数据科学招聘实体自动抽取及分析研究[J]. 图书情报工作, 2018, 62(13): 64-73.
[8]	王东波, 陆昊翔, 周鑫, 朱丹浩. 面向摘要结构功能划分的模型性能比较研究[J]. 图书情报工作, 2018, 62(12): 84-90.
[9]	孙安, 于英香, 罗永刚, 王祺. 序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例[J]. 图书情报工作, 2018, 62(11): 103-111.
[10]	王东波, 黄水清, 何琳. 基于多特征知识的先秦典籍词性自动标注研究[J]. 图书情报工作, 2017, 61(12): 64-70.
[11]	王东波, 何琳, 黄水清. 基于支持向量机的先秦诸子典籍自动分类研究[J]. 图书情报工作, 2017, 61(12): 71-76.
[12]	雷声伟, 陈海华, 黄永, 陆伟. 学术文献引文上下文自动识别研究[J]. 图书情报工作, 2016, 60(17): 78-87.
[13]	黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015, 59(12): 135-140.
[14]	黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133.
[15]	刘忠宝, 赵文娟, 贾君枝. 多标记用户分类系统构建方法研究[J]. 图书情报工作, 2014, 58(10): 145-148.