知识组织

基于决策树的多源文献元数据融合研究

  • 李静 ,
  • 胡潜 ,
  • 李想 ,
  • 肖兵
展开
  • 华中师范大学信息管理学院 武汉 430079
李静,博士研究生,E-mail:lj2016122579@mails.ccnu.edu.cn;胡潜,教授,博士生导师;李想,硕士研究生;肖兵,博士研究生。

收稿日期: 2021-08-01

  修回日期: 2021-10-15

  网络出版日期: 2022-03-30

基金资助

本文系国家社会科学基金项目"‘互联网+’背景下面向产业链的行业信息服务融合研究"(项目编号:16BTQ063)研究成果之一。

Research on Metadata Fusion of Multi-Source Documents Based on the Decision Tree

  • Li Jing ,
  • Hu Qian ,
  • Li Xiang ,
  • Xiao Bing
Expand
  • School of Information Management of Central Normal University, Wuhan 430079

Received date: 2021-08-01

  Revised date: 2021-10-15

  Online published: 2022-03-30

摘要

[目的/意义] 构建多源文献元数据融合模型,有助于提升文献元数据整体质量,促进资源发现系统中的元数据管理与利用,优化用户资源发现服务体验。针对笔者此前提出的文献元数据判重策略进行优化,从经验为主向自动化转变,在保障判重和融合效果的前提下,提升整个过程的自动化水平。[方法/过程] 针对不同类型文献的元数据项不一样、同一文献不同来源的元数据项不一样均会使得判重方法有所区别的情况,提出一种自动化的基于决策树的多源文献元数据融合模型,将判重问题转化为分类问题,根据特征相似度选择特征并构造决策树,在此基础上实施元数据判重及融合,并以不同类型的文献资源元数据为例进行实验,对策略进行效果验证。[结果/结论] 结果显示,对于5种文献类型元数据,判重策略的准确率均达到99%以上,召回率均达到98%以上,总体效果较好。对于融合策略的效果判断,专利、学位论文、期刊论文、会议论文、图书的元数据项质量提升比例分别为15.15%、36.80%、15.29%、52.63%、15.38%,均有明显幅度的提升。

本文引用格式

李静 , 胡潜 , 李想 , 肖兵 . 基于决策树的多源文献元数据融合研究[J]. 图书情报工作, 2022 , 66(6) : 118 -125 . DOI: 10.13266/j.issn.0252-3116.2022.06.013

Abstract

[Purpose/significance] Constructing a multi-source document metadata fusion model will help improve the overall quality of document metadata, promote metadata management and utilization in the resource discovery system, and optimize user resource discovery service experience. In view of the document metadata duplication judgment strategy proposed by the writers before, this paper optimizes the strategy from experience-oriented to automated, and improves the automation level in the whole process on the premise of guaranteeing the duplication judgment and fusion effect.[Method/process] The metadata items of different types of documents were different, and the metadata items of the same document from different sources were different, which will make the method of judging duplication different. An automatic multi-source document metadata fusion model based on the decision tree was proposed, which transformed a duplication judgment problem into a classification problem. This paper selected features according to feature similarity and constructed the decision tree, on this basis, it implemented metadata duplication judgment and fusion, and took different types of document resource metadata as examples to conduct experiments to verify the effectiveness of the strategy.[Result/conclusion] The results show that for the five document types of metadata, the accuracy of the duplication judgment strategy is more than 99%, and the recall rate is more than 98%. The overall effect is good. Judgment on the effect of the fusion strategy, the quality improvement ratios of the metadata items of patents, dissertations, journal papers, conference papers and books are 15.15%, 36.80%, 15.29%, 52.63% and 15.38% respectively, all of which have significant improvement.

参考文献

[1] 林鑫,李想,李静.资源发现系统中基于多源数据融合的文献元数据质量提升[J].情报理论与实践,2021,44(5):122-126,186.
[2] PARK J R, TOSAKA Y. Metadata quality control in digital repositories and collections:criteria, semantics, and mechanisms[J]. Cataloging & classification quarterly,2010,48(8):696-715.
[3] STVILIA B, TWIDALE M B, SMITH L C, et al. Information quality work organization in Wikipedia[J]. Journal of the American Society for Information Science and Technology,2008,59(6):983-1001.
[4] BRUCE T R, HILLMANN D I. The continuum of metadata quality:defining, expressing, exploiting[C]//HILLMANN D I, WEATBROOKS E L. Metadata in practice. Chicago:American Library Association,2004:238-256.
[5] 黄莺,李建阳.元数据质量评估方法及模型研究[J].图书馆学研究,2013,(12):52-56.
[6] 翟军,陶晨阳,李晓彤.开放政府数据质量评估研究进展及启示[J].图书馆,2018(12):74-79.
[7] 黄刚,袁满,吴秀英.元数据驱动的数据质量评估体系架构研究[J].计算机工程与应用,2013,49(8):114-119,181.
[8] 张晓娟,谭婧.我国省级政府数据开放平台元数据质量评估研究[J].电子政务,2019(3):58-71.
[9] 董微,赵捷.开放期刊资源元数据质量管理研究[J].中国科技资源导刊,2018,50(3):82-86.
[10] 刘家真,廖茹.电子文件管理元数据的质量控制与管理[J].图书情报知识,2009(6):91-96,102.
[11] 寇晶晶,贾君枝.高校图书馆资源发现系统中文检索性能比较分析[J].国家图书馆学刊,2016,25(6):71-79.
[12] LI G L, WANG J N, ZHENG Y D, et al. Crowdsourced data management:a survey[J]. IEEE transactions on knowledge and data engineering, 2016,28(9):2296-2319.
[13] 李慧佳,马建玲,张秀秀,等.元数据语义化映射过程研究——以中科院机构名称规范控制库为例[J].图书馆论坛,2017,37(12):72-79.
[14] 孙锐,杨新涯,魏群义,等.文献资产元数据仓储建设关键问题研究——以重庆大学图书馆为例[J].大学图书馆学报,2018,36(2):18-24.
[15] 鲁丹,李欣.数字人文环境下异构方志元数据整合策略[J].图书馆论坛,2019,39(4):158-165.
[16] MANGUINHAS H, JOSE B. Quality control of metadata:a case with UNIMARC[J]. ECDL,2006,4172(3):244-255.
[17] 曹月珍,马建玲.国内外元数据质量控制的研究进展与发展趋势[J].图书与情报,2013(6):101-104.
[18] 王利亚,邱航,陈若雅.基于元数据可追溯性的健康医疗大数据治理方法及可视化呈现[J]. 中国卫生信息管理杂志,2019,16(6):661-666.
[19] 严承希,房小可.开放世界视角:面向多源词表的知识融合框架MtFFO研究[J].中国图书馆学报,2017,43(4):114-129.
[20] 储光,胡学钢,张玉红.基于语义的文本数据流概念漂移检测算法[J].计算机工程,2018,44(2):24-30.
[21] 李静,胡潜.多语言UGC环境下MOOC课程笔记自动生成[J].情报理论与实践,2021,44(11):173-179.
[22] 唐亮,李飞.基于决策树的车联网安全态势预测模型研究[J].计算机科学,2021,48(S1):514-517.
[23] 李勇男.信息增益决策树在反恐情报分析中的应用研究[J].情报科学,2018,36(4):80-84,149.
[24] 吴鹏,肖维聪,楚榕珍.基于模型检测的财经舆情可信度研究[J].情报学报,2020,39(6):619-629.
文章导航

/