[Purpose/significance] Aiming at the application of sequence alignment algorithm in text similarity, the global alignment algorithm is improved and the accuracy of the algorithm is improved. At the same time, the local alignment algorithm is used to effectively solve the problem of comparing two texts with different content or with different length. [Method/process] First, the CRF model in HanLP was used to normalize the Chinese text data set of the online academic resources and constitute the Chinese sequence set. Then, Word2Vec model was trained with the latest Chinese Wikipedia corpus to construct the word pair scoring matrix. Finally, based on the scoring matrix and the improved scoring rules, the two Chinese sequences of global/local alignment were compared and the optimal solution of the alignment was obtained. The optimal solution was backtracked to obtain the alignment path of the optimal solution and the similarity of the two Chinese sequences was calculated. [Result/conclusion] The experiment results show that compared with the current research of global alignment algorithm, the method based on the results of the part-of-speech tagging and Word2Vec build words to further improve the global alignment score matrix algorithm and applied to the accuracy of computing text similarity of local alignment algorithm can effectively solve the content differences or differences in the length of two text comparing problems.
[1] MAHMOOD Q, QADIR M A, AFZAL M T. Application of cores to compute research papers similarity[J].IEEE access,2017,5:26124-26134.
[2] PARASCHIV I C, DASCALU M, TRAUSAN-MATU S, et al. Analyzing the semantic relatedness of paper abstracts: an application to the educational research field[C]//International conference on control systems and computer science.Bucharest:IEEE,2015:759-764.
[3] 黄文彬,车尚锟.计算文本相似度的方法体系与应用分析[J].情报理论与实践,2019,42(11):128-134.
[4] 章成志.基于多层特征的中文字符串相似度计算模型[J].情报学报,2005,24(6):696-701.
[5] 陈二静,姜恩波.文本相似度计算方法研究综述[J].数据分析与知识发现,2017,1(6):1-11.
[6] 李琳,李辉.一种基于概念向量空间的文本相似度计算方法[J].数据分析与知识发现,2018,2(5):48-58.
[7] 王春柳,杨永辉,邓霏,等.文本相似度计算方法研究综述[J].情报科学,2019,37(3):158-168.
[8] GOMAA W H, FAHMY A A. Short answer grading using string similarity and corpus-based similarity[J].International journal of advanced computer science and applications,2012,3(11):114-121.
[9] KADUPITIYA J, RANATHUNGA S, DIAS G. Short sentence similarity calculation using corpus-based and knowledge-based similarity measures[C]//Proceedings of the 26th international conference on computational linguistics.Osaka:The coling 2016 organizing committee,2016:44-53.
[10] GOMAA W H, FAHMY A A. A survey of text similarity approaches[J].International journal of computer applications,2013,68(13):13-18.
[11] BOBADILLA J, ORTEGA F, HERNANDO A, et al. Improving collaborative filtering recommender system results and performance using genetic algorithms[J].Knowledge-based systems,2011,24(8):1310-1316.
[12] 文凤春,王邦菊,肖枝洪.生物序列比对算法的研究现状[J].生物信息学,2010,8(1):66-69.
[13] NEEDLEMAN S B, WUNSCH C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins[J].Journal of molecular biology,1970,48(3):443-453.
[14] SMITH T F, WATERMAN M S, FITCH W M. Comparative biosequence metrics[J].Journal of molecular evolution,1981,18(1):38-46.
[15] FENG D F, DOOLITTLE R F. Progressive sequence alignment as a pre-requisite to correct phylogenetic trees[J].Journal of molecular evolution,1987,25(4):351-360.
[16] ALTSCHUL S F, GISH W, MILLER W, et al. Basic local alignment search tool[J].Journal of molecular biology,1990,215(3):403-410.
[17] EDDY S R. Multiple alignment using hidden markov models[J].International conference on intelligent systems for molecular biology,1995(3):114-120.
[18] THOMPSON J D, HIGGINS D G, GIBSON T J. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice[J].Nucleic acids research,1994,22(22):4673-4680.
[19] NOTREDAME C HERINGA, J HIGGINS. T-coffee: a novel method for fast and accurate multiple sequence alignment[J].Journal of molecular biology,2000,302(1):205-217.
[20] LASSMANN T.Kalign 3: multiple sequence alignment of large data sets[J].Bioinformatics,2019,36(6):1928-1929.
[21] ZHANG C X, ZHENG W, MORTUZA S M, et al. Deepmsa: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant homology proteins[J]. Bioinformatics,2019,36(7):2105-2112.
[22] LU R J, ZHAO X, LI J, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding[J].The lancet, 2020,395:565-574.
[23] 徐硕,朱礼军,乔晓东,等.基于双序列比对的中文术语语义相似度计算的新方法[J].情报学报,2010,29(4):701-708.
[24] 王汀,徐天晟,冀付军.基于数据场和全局序列比对的大规模中文关联数据模型[J].中文信息学报,2016,30(3):204-212.
[25] 田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报(信息科学版),2010,28(6):602-608.
[26] 熊回香,赵登鹏,卢晨凡.基于词向量模型的中文序列比对研究[J].图书情报工作,2020,65(10):86-98.
[27] SUTTON B C,MCCALLUM A. An introduction to conditional random fields[J].Foundations & trends in machine learning,2010,4(4):267-373.
[28] BUCHHOLZ S, MARSI E. Conll-x shared task on multilingual dependency parsing[C]//Tenth conference on computational natural language learning.New York:Association for computational linguistics,2006.
[29] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]//Proceedings of the 15th conference of the European chapter of the Association for Computational Linguistics,2017:427-431.
[30] DOZAT T, MANNING C D. Deep biaffine attention for neural dependency parsing[C]//The 5th international conference on learning representations.Puerto Rico:ICLR,2016:1-8.
[31] DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[EB/OL]/[2021-01-30].http://arXiv:1810.04805.
[32] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J].Advances in neural information processing systems,2013,26:3111-3119.
[33] 郭思成,李纲,周华阳.基于Word2Vec的医学知识组织系统互操作研究——以词表间语义映射为例[J].情报理论与实践,2019,42(9):160-165.