图书情报工作 ›› 2022, Vol. 66 ›› Issue (6): 118-125.DOI: 10.13266/j.issn.0252-3116.2022.06.013

• 知识组织 • 上一篇    下一篇

基于决策树的多源文献元数据融合研究

李静, 胡潜, 李想, 肖兵   

  1. 华中师范大学信息管理学院 武汉 430079
  • 收稿日期:2021-08-01 修回日期:2021-10-15 出版日期:2022-03-30 发布日期:2022-03-30
  • 作者简介:李静,博士研究生,E-mail:lj2016122579@mails.ccnu.edu.cn;胡潜,教授,博士生导师;李想,硕士研究生;肖兵,博士研究生。
  • 基金资助:
    本文系国家社会科学基金项目"‘互联网+’背景下面向产业链的行业信息服务融合研究"(项目编号:16BTQ063)研究成果之一。

Research on Metadata Fusion of Multi-Source Documents Based on the Decision Tree

Li Jing, Hu Qian, Li Xiang, Xiao Bing   

  1. School of Information Management of Central Normal University, Wuhan 430079
  • Received:2021-08-01 Revised:2021-10-15 Online:2022-03-30 Published:2022-03-30

摘要: [目的/意义] 构建多源文献元数据融合模型,有助于提升文献元数据整体质量,促进资源发现系统中的元数据管理与利用,优化用户资源发现服务体验。针对笔者此前提出的文献元数据判重策略进行优化,从经验为主向自动化转变,在保障判重和融合效果的前提下,提升整个过程的自动化水平。[方法/过程] 针对不同类型文献的元数据项不一样、同一文献不同来源的元数据项不一样均会使得判重方法有所区别的情况,提出一种自动化的基于决策树的多源文献元数据融合模型,将判重问题转化为分类问题,根据特征相似度选择特征并构造决策树,在此基础上实施元数据判重及融合,并以不同类型的文献资源元数据为例进行实验,对策略进行效果验证。[结果/结论] 结果显示,对于5种文献类型元数据,判重策略的准确率均达到99%以上,召回率均达到98%以上,总体效果较好。对于融合策略的效果判断,专利、学位论文、期刊论文、会议论文、图书的元数据项质量提升比例分别为15.15%、36.80%、15.29%、52.63%、15.38%,均有明显幅度的提升。

关键词: 多源元数据, 决策树, 元数据判重, 元数据融合

Abstract: [Purpose/significance] Constructing a multi-source document metadata fusion model will help improve the overall quality of document metadata, promote metadata management and utilization in the resource discovery system, and optimize user resource discovery service experience. In view of the document metadata duplication judgment strategy proposed by the writers before, this paper optimizes the strategy from experience-oriented to automated, and improves the automation level in the whole process on the premise of guaranteeing the duplication judgment and fusion effect.[Method/process] The metadata items of different types of documents were different, and the metadata items of the same document from different sources were different, which will make the method of judging duplication different. An automatic multi-source document metadata fusion model based on the decision tree was proposed, which transformed a duplication judgment problem into a classification problem. This paper selected features according to feature similarity and constructed the decision tree, on this basis, it implemented metadata duplication judgment and fusion, and took different types of document resource metadata as examples to conduct experiments to verify the effectiveness of the strategy.[Result/conclusion] The results show that for the five document types of metadata, the accuracy of the duplication judgment strategy is more than 99%, and the recall rate is more than 98%. The overall effect is good. Judgment on the effect of the fusion strategy, the quality improvement ratios of the metadata items of patents, dissertations, journal papers, conference papers and books are 15.15%, 36.80%, 15.29%, 52.63% and 15.38% respectively, all of which have significant improvement.

Key words: multi-source metadata, the decision tree, metadata duplication judgment, metadata fusion

中图分类号: