Chinese Sequence Alignment Study of Fusion Word Vectors

  • Xiong Huixiang ,
  • Zhao Dengpeng ,
  • Lu Chenfan
Expand
  • 1 School of Information Management, Central China Normal University, Wuhan 430079;
    2 School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433

Received date: 2019-08-22

  Revised date: 2020-02-05

  Online published: 2020-05-20

Abstract

[Purpose/significance] For the application of the famous sequence alignment algorithm in bioinformatics in text similarity, this paper improves the methods of predecessors and improves the accuracy of text similarity calculation. [Method/process] First, the target text was normalized to form a Chinese sequence set. Subsequently, The trained Skip-Gram model in Word2vec is used to construct the scoring matrix of the Chinese sequence set and formulate the scoring rules. Finally, the Chinese sequences were compared two-two and the optimal solution was obtained. The comparison path of the optimal solution was obtained backtracked, and the similarity of the Chinese sequence was calculated. [Result/conclusion] The empirical results show that compared with the traditional methods, the fusion word vector model of this method improves the accuracy of text similarity calculation and effectively solves the problem of repeated word pairs in traditional methods.

Cite this article

Xiong Huixiang , Zhao Dengpeng , Lu Chenfan . Chinese Sequence Alignment Study of Fusion Word Vectors[J]. Library and Information Service, 2020 , 64(10) : 86 -98 . DOI: 10.13266/j.issn.0252-3116.2020.10.010

References

[1] 张金鹏.基于语义的文本相似度算法研究和应用[D].重庆:重庆理工大学,2014.
[2] 王春柳,杨永辉,邓霏,等.文本相似度计算方法研究综述[J].情报科学,2019,37(3):158-168.
[3] GOMAA W H,FAHMY A A. Short answer grading using string similarity and corpus-based similarity[J]. International journal of advanced computer science and applications,2012,3(11):114-121.
[4] KADUPITIYA J, RANATHUNGA S, DIAS G. Short sentence similarity calculation using corpus-based and knowledge-based similarity measures[C]//Proceedings of the 26th international conference on computational linguistics. Osaka:The COLING 2016 Organization Committee, 2016:44-53.
[5] GOMAA W H, FAHMY A A. A survey of text similarity approaches[J].International journal of computer applications, 2013,68(13):13-18.
[6] LEVENSHTEIN V I. Binary codes capable of correcting deletions, insertions, and reversals[J].Soviet physics doklady,1966,10(8):707-710.
[7] MELAMED I D. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons[EB/OL].[2019-07-05].https://arxiv.org/abs/cmp-lg/9505044.
[8] 张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算[J]. 计算机工程与应用,2001(19):21-22.
[9] KONDRAK G. N-Gram similarity and distance[EB/OL].[2019-07-05].https://link.springer.com/chapter/10.1007%2F11575832_13#citeas.
[10] BOBADILLA J, ORTEGA F, HERNANDO A, et al. Improving collaborative filtering recommender system results and performance using genetic algorithms[J].Knowledge-based systems,2011,24(8):1310-1316.
[11] 章成志.基于多层特征的中文字符串相似度计算模型,情报学报,2005,24(6):696-701.
[12] 文凤春,王邦菊,肖枝洪.生物序列比对算法的研究现状[J].生物信息学,2010,8(1):66-69.
[13] NEEDLEMA S B, WUNSCH C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins[J].Journal of molecular biology,1970,48(3):443-453.
[14] SMITH T F, WATERMAN M S, FITCH W M. Comparative biosequence metrics[J].Journal of molecular evolution,1981,18(1):38-46.
[15] FENG D F, DOOLITTLE R F. Progressive sequence alignment as a pre-requisite to correct phylogenetic trees[J].Journal of molecular evolution,1987, 25(4):351-360.
[16] ALTSCHUL S F, GISH W, MILLER W, et al. Basic local alignment search tool[J].Journal of molecular biology,1990,215(3):403-410.
[17] EDDY S R. Multiple alignment using hidden Markov models[J].International conference on intelligent systems for molecular biology,1995(3):114-120.
[18] THOMPSON J D, HIGGINS D G, GIBSON T J. Clustal w:improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice[J].Nucleic acids research,1994,(22)22:4673-4680.
[19] NOTREDAME C HERINGA, J HIGGINS. T-coffee:a novel method for fast and accurate multiple sequence alignment[J]. Journal of molecular biology,2000,302(1):0-217.
[20] LASSMANN T. Kalign 3:multiple sequence alignment of large data sets[J]. Bioinformatics,2019,36(6):1928-1929.
[21] ZHANG C X, ZHENG W, MORTUZA S M, et al. Deepmsa:constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant homology proteins.[J].Bioinformatics,2019,36(7):2105-2112.
[22] 徐硕,朱礼军,乔晓东,等.基于双序列比对的中文术语语义相似度计算的新方法[J].情报学报,2010,29(4):701-708.
[23] 王汀,徐天晟,冀付军.基于数据场和全局序列比对的大规模中文关联数据模型[J].中文信息学报,2016,30(3):204-212.
[24] 田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报(信息科学版),2010,28(6):602-608.
[25] 唐晓波.基于本体和Word2Vec的文本知识片段语义标引[J].情报科学,2019,37(4):97-102.
[26] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J].Advances in neural information processing systems,2013(26):3111-3119.
[27] 郭思成,李纲,周华阳.基于Word2Vec的医学知识组织系统互操作研究——以词表间语义映射为例[EB/OL].[2019-08-17].https://kns.cnki.net/KCMS/detail/11.1762.G3.20190528.1151.002.html.
[28] XU C Z, LIU D.Chinese text summarization algorithm based on Word2vec[EB/OL].[2019-07-05].https://iopscience.iop.org/article/10.1088/1742-6596/976/1/012006.
[29] STEPHEN F A, THOMAS L, M, ALEJANDRO A, et al. Gapped blast and psi-blast:a new generation of protein database search programs[J].Nucleic acids research,1997,25(17):3389-3402.
[30] HENIKOFF S, HENIKOFF J G. Amino acid substitution matrices from protein blocks[J].Proceedings of the National Academy of Sciences of the United States of America,1992,89(22):10915-10919.
[31] EDDY S R. Where did the BLOSUM62 alignment score matrix come from?[J].Nature biotechnology,2004, 22(8):1035-1036.
[32] SANKOFF D. The early introduction of dynamic programming into computational biology[J].Bioinformatics, 2000,16(1):41-47.
[33] 张福祥,周金玲.序列比对算法的并行化研究与应用[J].潍坊学院学报,2008,8(4):85-87.
[34] 赵登鹏.Word2vec训练语料库[EB/OL].[2019-05-17]https://pan.baidu.com/s/1TZ8GII0CEX32ydjsfMc0zw#list/path=%2F.
Outlines

/