基于序列比对算法的中文文本相似度计算研究

赵登鹏; 熊回香; 田丰收; 李昕然

doi:10.13266/j.issn.0252-3116.2021.11.011

图书情报工作 >

2021 , Vol. 65 >Issue 11: 101 - 112

DOI: https://doi.org/10.13266/j.issn.0252-3116.2021.11.011

情报研究

基于序列比对算法的中文文本相似度计算研究

赵登鹏 ,
熊回香 ,
田丰收 ,
李昕然

展开

1. 华中师范大学信息管理学院武汉, 430079;
2. 高寻真源教育科技有限公司技术研发部济南, 250000

赵登鹏(ORCID:0000-0002-7699-5222),硕士研究生,E-mail:1251508909@qq.com;熊回香(ORCID:0000-0001-9956-3396),教授,博士生导师;田丰收(ORCID:0000-0001-8789-4032),硕士研究生;李昕然(ORCID:0000-0002-3134-9876),硕士研究生。

收稿日期: 2020-12-09

修回日期: 2021-02-24

网络出版日期: 2021-06-10

基金资助

本文系国家社会科学基金项目"融合知识图谱和深度学习的在线学术资源挖掘与推荐研究"（项目编号：19BTQ005）研究成果之一。

收起

Research on Chinese Text Similarity Calculation Based on Sequence Alignment Algorithm

Zhao Dengpeng ,
Xiong Huixiang ,
Tian Fengshou ,
Li Xinran

Expand

1. School of Information Management, Central China Normal University, Wuhan 430079;
2. Shandong Technology Center, GaoXunZhenYuan Education Technology Limited Company, Jinan 250000

Received date: 2020-12-09

Revised date: 2021-02-24

Online published: 2021-06-10

Fold

摘要

[目的/意义] 针对序列比对算法在文本相似度中的应用，改进全局比对算法并提高该算法的准确性，同时，应用局部比对算法有效解决内容差异或长短差异较大的两文本进行比对的问题。[方法/过程] 首先，利用HanLP中的CRF模型对在线学术资源中文文本数据集进行规范化处理，构成中文序列集；然后，使用最新的中文维基百科语料训练Word2Vec模型来构建语词对打分矩阵；最后，基于打分矩阵和改进的打分规则，对进行全局比对/局部比对的两中文序列进行比对并获得比对的最优解，回溯该最优解，获取最优解的比对路径，计算两中文序列的相似度。[结果/结论] 实验结果表明，相较于目前全局比对算法的相关研究，本文基于词性标注的结果与Word2Vec构建的语词对打分矩阵进一步提升了全局比对算法计算文本相似度的准确性，同时，应用于文本相似度计算的局部比对算法能够有效解决内容差异或长短差异较大的两文本进行比对的问题。

关键词： CRF模型; 词性标注; Word2Vec; 序列比对; 局部比对; 文本相似度

本文引用格式

赵登鹏 , 熊回香 , 田丰收 , 李昕然 . 基于序列比对算法的中文文本相似度计算研究[J]. 图书情报工作, 2021 , 65(11) : 101 -112 . DOI: 10.13266/j.issn.0252-3116.2021.11.011

Abstract

[Purpose/significance] Aiming at the application of sequence alignment algorithm in text similarity, the global alignment algorithm is improved and the accuracy of the algorithm is improved. At the same time, the local alignment algorithm is used to effectively solve the problem of comparing two texts with different content or with different length. [Method/process] First, the CRF model in HanLP was used to normalize the Chinese text data set of the online academic resources and constitute the Chinese sequence set. Then, Word2Vec model was trained with the latest Chinese Wikipedia corpus to construct the word pair scoring matrix. Finally, based on the scoring matrix and the improved scoring rules, the two Chinese sequences of global/local alignment were compared and the optimal solution of the alignment was obtained. The optimal solution was backtracked to obtain the alignment path of the optimal solution and the similarity of the two Chinese sequences was calculated. [Result/conclusion] The experiment results show that compared with the current research of global alignment algorithm, the method based on the results of the part-of-speech tagging and Word2Vec build words to further improve the global alignment score matrix algorithm and applied to the accuracy of computing text similarity of local alignment algorithm can effectively solve the content differences or differences in the length of two text comparing problems.

Key words： CRF model; part of speech tagging; Word2Vec; sequence alignment; local alignment; text similarity

参考文献

[1] MAHMOOD Q, QADIR M A, AFZAL M T. Application of cores to compute research papers similarity[J].IEEE access,2017,5:26124-26134.
[2] PARASCHIV I C, DASCALU M, TRAUSAN-MATU S, et al. Analyzing the semantic relatedness of paper abstracts: an application to the educational research field[C]//International conference on control systems and computer science.Bucharest:IEEE,2015:759-764.
[3] 黄文彬,车尚锟.计算文本相似度的方法体系与应用分析[J].情报理论与实践,2019,42(11):128-134.
[4] 章成志.基于多层特征的中文字符串相似度计算模型[J].情报学报,2005,24(6):696-701.
[5] 陈二静,姜恩波.文本相似度计算方法研究综述[J].数据分析与知识发现,2017,1(6):1-11.
[6] 李琳,李辉.一种基于概念向量空间的文本相似度计算方法[J].数据分析与知识发现,2018,2(5):48-58.
[7] 王春柳,杨永辉,邓霏,等.文本相似度计算方法研究综述[J].情报科学,2019,37(3):158-168.
[8] GOMAA W H, FAHMY A A. Short answer grading using string similarity and corpus-based similarity[J].International journal of advanced computer science and applications,2012,3(11):114-121.
[9] KADUPITIYA J, RANATHUNGA S, DIAS G. Short sentence similarity calculation using corpus-based and knowledge-based similarity measures[C]//Proceedings of the 26th international conference on computational linguistics.Osaka:The coling 2016 organizing committee,2016:44-53.
[10] GOMAA W H, FAHMY A A. A survey of text similarity approaches[J].International journal of computer applications,2013,68(13):13-18.
[11] BOBADILLA J, ORTEGA F, HERNANDO A, et al. Improving collaborative filtering recommender system results and performance using genetic algorithms[J].Knowledge-based systems,2011,24(8):1310-1316.
[12] 文凤春,王邦菊,肖枝洪.生物序列比对算法的研究现状[J].生物信息学,2010,8(1):66-69.
[13] NEEDLEMAN S B, WUNSCH C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins[J].Journal of molecular biology,1970,48(3):443-453.
[14] SMITH T F, WATERMAN M S, FITCH W M. Comparative biosequence metrics[J].Journal of molecular evolution,1981,18(1):38-46.
[15] FENG D F, DOOLITTLE R F. Progressive sequence alignment as a pre-requisite to correct phylogenetic trees[J].Journal of molecular evolution,1987,25(4):351-360.
[16] ALTSCHUL S F, GISH W, MILLER W, et al. Basic local alignment search tool[J].Journal of molecular biology,1990,215(3):403-410.
[17] EDDY S R. Multiple alignment using hidden markov models[J].International conference on intelligent systems for molecular biology,1995(3):114-120.
[18] THOMPSON J D, HIGGINS D G, GIBSON T J. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice[J].Nucleic acids research,1994,22(22):4673-4680.
[19] NOTREDAME C HERINGA, J HIGGINS. T-coffee: a novel method for fast and accurate multiple sequence alignment[J].Journal of molecular biology,2000,302(1):205-217.
[20] LASSMANN T.Kalign 3: multiple sequence alignment of large data sets[J].Bioinformatics,2019,36(6):1928-1929.
[21] ZHANG C X, ZHENG W, MORTUZA S M, et al. Deepmsa: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant homology proteins[J]. Bioinformatics,2019,36(7):2105-2112.
[22] LU R J, ZHAO X, LI J, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding[J].The lancet, 2020,395:565-574.
[23] 徐硕,朱礼军,乔晓东,等.基于双序列比对的中文术语语义相似度计算的新方法[J].情报学报,2010,29(4):701-708.
[24] 王汀,徐天晟,冀付军.基于数据场和全局序列比对的大规模中文关联数据模型[J].中文信息学报,2016,30(3):204-212.
[25] 田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报(信息科学版),2010,28(6):602-608.
[26] 熊回香,赵登鹏,卢晨凡.基于词向量模型的中文序列比对研究[J].图书情报工作,2020,65(10):86-98.
[27] SUTTON B C,MCCALLUM A. An introduction to conditional random fields[J].Foundations & trends in machine learning,2010,4(4):267-373.
[28] BUCHHOLZ S, MARSI E. Conll-x shared task on multilingual dependency parsing[C]//Tenth conference on computational natural language learning.New York:Association for computational linguistics,2006.
[29] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]//Proceedings of the 15th conference of the European chapter of the Association for Computational Linguistics,2017:427-431.
[30] DOZAT T, MANNING C D. Deep biaffine attention for neural dependency parsing[C]//The 5th international conference on learning representations.Puerto Rico:ICLR,2016:1-8.
[31] DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[EB/OL]/[2021-01-30].http://arXiv:1810.04805.
[32] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J].Advances in neural information processing systems,2013,26:3111-3119.
[33] 郭思成,李纲,周华阳.基于Word2Vec的医学知识组织系统互操作研究——以词表间语义映射为例[J].情报理论与实践,2019,42(9):160-165.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献