[1] SWARTZ N. Gartner warns firms of'dirty data'[J]. Information management journal, 2007, 41(3):6-6. [2] 王淞,彭煜玮,兰海,等.数据集成方法发展与展望[J].软件学报,2020,31(3):893-908. [3] 宗威,林松涛,蒿恒,等.基于互异特征向量的ERP重复物料记录识别方法研究[J].信息系统学报,2021(24):92-101. [4] MARTÍ M, RÓDENAS C. Measurement errors in geographical labor mobility using data linkage:the Spanish case[J]. International journal of social research methodology,2021,24(1):53-64. [5] TIBBLE H, DI LAW H, et al. The importance of including aliases in data linkage with vulnerable populations[J]. BMC medical research methodology, 2018,18(76):1-5. [6] 丁东洋,周丽莉.行政记录整合的贝叶斯分层记录链接模型及应用[J].统计与信息论坛,2016,31(7):30-35. [7] 杨昭,任娟.中文文献题录数据机构名称归一化研究[J].图书情报工作,2020,64(4):95-102. [8] 林克柔,王昊,龚丽娟,等.融合多特征的中文论文同名学者消歧研究[J].数据分析与知识发现,2021,5(4):90-102. [9] CHRISTOPHIDES V, EFTHYMIOU V, PALPANAS T, et al. End-to-end entity resolution for big data:a survey[J]. ACM computing surveys, 2021,53(6):1-37. [10] FELLEGI IP, SUNTER A B. A theory for record linkage[J]. Journal of the American Statistical Association, 1969, 64(328):1183-1210. [11] HERNÁNDEZ MAURICIO A, STOLFO SALVATORE J. Real-world data is dirty:data cleansing and the merge/purge problem[J]. Data mining and knowledge discovery,1998,2(1):9-37. [12] XIAO C, WANG W, LIN X, et al. Efficient similarity joins for near-duplicate detection[J]. ACM transactions on database systems, 2011, 36(3):1-41. [13] 姜华,韩安琪,王美佳,等.基于改进编辑距离的字符串相似度求解算法[J].计算机工程,2014,40(1):222-227. [14] VATSALAN D, CHRISTEN P, VERYKIOS V S. A taxonomy of privacy-preserving record linkage techniques[J]. Information systems, 2013, 38(6):946-969. [15] 王瑞琴,杨小明,楼俊钢.词汇语义相关性度量研究[J].情报学报,2016,35(4):389-404. [16] 牛奉高,张亚宇.基于共现潜在语义向量空间模型的语义核构建[J].情报学报,2017,36(8):834-842. [17] BENBERNOU S, HUANG X, OUZIRI M. Semantic-based and entity-resolution fusion to enhance quality of big RDF data[J]. IEEE transactions on big data, 2020, 6(2):382-395. [18] ZHAO H, RAM S. Entity matching across heterogeneous data sources:an approach based on constrained cascade generalization[J]. Data&knowledge engineering, 2008, 66(3):368-381. [19] 杨萌,聂铁铮,申德荣,等.基于随机森林的实体识别方法[J].集成技术,2018,7(2):57-68. [20] CUXAC P, LAMIREL J C, BONVALLOT V. Efficient supervised and semi-supervised approaches for affiliations disambiguation[J]. Scientometrics, 2013, 97(1):47-58. [21] HUANG C, ZHU J, HUANG X, et al. A novel approach for entity resolution in scientific documents using context graphs[J]. Information sciences, 2018(432):431-441. [22] REIMERS N, GUREVYCH I. Sentence-BERT:sentence embeddings using siamese bert-networks[C]//Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing. Hong Kong:Association for Computational Linguistics, 2019. [23] HADSELL R, CHOPRA S, YANG L C. Dimensionality reduction by learning an invariant mapping[C]//2006 IEEE computer society conference on computer vision and pattern recognition. New York:IEEE, 2006. [24] ZONG W, WU F, CHU L K, et al. Identification of approximately duplicate material records in ERP systems[J]. Enterprise information systems, 2017, 11(3):434-451.作者贡献说明:宗威:研究选题与研究思路设计,数据处理,论文修改和审阅;林松涛:数据分析与处理,论文撰写与修改;刘继昶:实验方案设计,实验实施。 |