[目的/意义]针对当前数据集成场景下,实体识别时未能充分提取文本语义信息导致识别效果不佳以及传统分块方法无法满足高效识别的问题,提出一种考虑语义信息的高效实体识别方法,以提升实体识别的效果与效率。[方法/过程]以需要集成的两个数据集A、B为例,首先,分别对数据集A和B中的所有记录进行分词、去停用词等数据预处理操作,然后基于数据集A中的每一个词,建立数据集A的倒排索引;其次,计算数据集B中记录的每个词在数据集A中的重要度,依据重要度大小选择关键词代表该条记录;最后将关键词与索引词进行比对,基于Sentence-BERT模型依次计算关键词所对应的记录与索引词包含的所有记录之间的相似程度。将超过阈值的记录判定为对同一实体的描述记录,如此往复直至比对完数据集B中的所有记录。[结果/结论]实验结果表明,本文提出的考虑语义信息的高效实体识别方法在精确率、召回率、稳定性和响应时间等评价指标的表现上均优于传统的实体识别方法,为解决数据集成中的实体识别问题提供了方法指导。
[Purpose/significance] In view of the poor recognition effect caused by the failure to fully extract the text semantic information for entity recognition in the current data integration scene, and the problem that the traditional blocking method can not meet the efficient recognition, an efficient entity recognition method considering semantic information is proposed to improve the effect and efficiency of entity recognition. [Method/process] Taking two data sets A and B that need to be integrated as an example, first, this paper performed word segmentation, removed stop words and conducted other data preprocessing operations on all records in data sets A and B respectively, and then established the inverted index of data set A based on each word in data set A; Secondly, in dataset A, the study calculated the importance of each word recorded in dataset B, and selected a keyword to represent the record according to the importance; Finally, the keyword was compared with the index word, and the similarity between the records corresponding to the keywords and all records contained in the index word was calculated successively based on the Sentence-BERT model. The records exceeding the threshold were determined as the description records of the same entity, and so on until all records in set B were compared. [Result/conclusion] The experimental results show that the efficient entity recognition method considering semantic information proposed in this paper is superior to the traditional entity recognition methods in terms of accuracy, recall, stability and response time, which provides method guidance for solving the entity recognition problem in data integration.
[1] SWARTZ N. Gartner warns firms of'dirty data'[J]. Information management journal, 2007, 41(3):6-6.
[2] 王淞,彭煜玮,兰海,等.数据集成方法发展与展望[J].软件学报,2020,31(3):893-908.
[3] 宗威,林松涛,蒿恒,等.基于互异特征向量的ERP重复物料记录识别方法研究[J].信息系统学报,2021(24):92-101.
[4] MARTÍ M, RÓDENAS C. Measurement errors in geographical labor mobility using data linkage:the Spanish case[J]. International journal of social research methodology,2021,24(1):53-64.
[5] TIBBLE H, DI LAW H, et al. The importance of including aliases in data linkage with vulnerable populations[J]. BMC medical research methodology, 2018,18(76):1-5.
[6] 丁东洋,周丽莉.行政记录整合的贝叶斯分层记录链接模型及应用[J].统计与信息论坛,2016,31(7):30-35.
[7] 杨昭,任娟.中文文献题录数据机构名称归一化研究[J].图书情报工作,2020,64(4):95-102.
[8] 林克柔,王昊,龚丽娟,等.融合多特征的中文论文同名学者消歧研究[J].数据分析与知识发现,2021,5(4):90-102.
[9] CHRISTOPHIDES V, EFTHYMIOU V, PALPANAS T, et al. End-to-end entity resolution for big data:a survey[J]. ACM computing surveys, 2021,53(6):1-37.
[10] FELLEGI IP, SUNTER A B. A theory for record linkage[J]. Journal of the American Statistical Association, 1969, 64(328):1183-1210.
[11] HERNÁNDEZ MAURICIO A, STOLFO SALVATORE J. Real-world data is dirty:data cleansing and the merge/purge problem[J]. Data mining and knowledge discovery,1998,2(1):9-37.
[12] XIAO C, WANG W, LIN X, et al. Efficient similarity joins for near-duplicate detection[J]. ACM transactions on database systems, 2011, 36(3):1-41.
[13] 姜华,韩安琪,王美佳,等.基于改进编辑距离的字符串相似度求解算法[J].计算机工程,2014,40(1):222-227.
[14] VATSALAN D, CHRISTEN P, VERYKIOS V S. A taxonomy of privacy-preserving record linkage techniques[J]. Information systems, 2013, 38(6):946-969.
[15] 王瑞琴,杨小明,楼俊钢.词汇语义相关性度量研究[J].情报学报,2016,35(4):389-404.
[16] 牛奉高,张亚宇.基于共现潜在语义向量空间模型的语义核构建[J].情报学报,2017,36(8):834-842.
[17] BENBERNOU S, HUANG X, OUZIRI M. Semantic-based and entity-resolution fusion to enhance quality of big RDF data[J]. IEEE transactions on big data, 2020, 6(2):382-395.
[18] ZHAO H, RAM S. Entity matching across heterogeneous data sources:an approach based on constrained cascade generalization[J]. Data&knowledge engineering, 2008, 66(3):368-381.
[19] 杨萌,聂铁铮,申德荣,等.基于随机森林的实体识别方法[J].集成技术,2018,7(2):57-68.
[20] CUXAC P, LAMIREL J C, BONVALLOT V. Efficient supervised and semi-supervised approaches for affiliations disambiguation[J]. Scientometrics, 2013, 97(1):47-58.
[21] HUANG C, ZHU J, HUANG X, et al. A novel approach for entity resolution in scientific documents using context graphs[J]. Information sciences, 2018(432):431-441.
[22] REIMERS N, GUREVYCH I. Sentence-BERT:sentence embeddings using siamese bert-networks[C]//Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing. Hong Kong:Association for Computational Linguistics, 2019.
[23] HADSELL R, CHOPRA S, YANG L C. Dimensionality reduction by learning an invariant mapping[C]//2006 IEEE computer society conference on computer vision and pattern recognition. New York:IEEE, 2006.
[24] ZONG W, WU F, CHU L K, et al. Identification of approximately duplicate material records in ERP systems[J]. Enterprise information systems, 2017, 11(3):434-451.作者贡献说明:宗威:研究选题与研究思路设计,数据处理,论文修改和审阅;林松涛:数据分析与处理,论文撰写与修改;刘继昶:实验方案设计,实验实施。