图书情报工作 ›› 2021, Vol. 65 ›› Issue (11): 101-112.DOI: 10.13266/j.issn.0252-3116.2021.11.011

• 情报研究 • 上一篇    下一篇

基于序列比对算法的中文文本相似度计算研究

赵登鹏1, 熊回香1, 田丰收2, 李昕然1   

  1. 1. 华中师范大学信息管理学院 武汉, 430079;
    2. 高寻真源教育科技有限公司技术研发部 济南, 250000
  • 收稿日期:2020-12-09 修回日期:2021-02-24 出版日期:2021-06-05 发布日期:2021-06-10
  • 作者简介:赵登鹏(ORCID:0000-0002-7699-5222),硕士研究生,E-mail:1251508909@qq.com;熊回香(ORCID:0000-0001-9956-3396),教授,博士生导师;田丰收(ORCID:0000-0001-8789-4032),硕士研究生;李昕然(ORCID:0000-0002-3134-9876),硕士研究生。
  • 基金资助:
    本文系国家社会科学基金项目"融合知识图谱和深度学习的在线学术资源挖掘与推荐研究"(项目编号:19BTQ005)研究成果之一。

Research on Chinese Text Similarity Calculation Based on Sequence Alignment Algorithm

Zhao Dengpeng1, Xiong Huixiang1, Tian Fengshou2, Li Xinran1   

  1. 1. School of Information Management, Central China Normal University, Wuhan 430079;
    2. Shandong Technology Center, GaoXunZhenYuan Education Technology Limited Company, Jinan 250000
  • Received:2020-12-09 Revised:2021-02-24 Online:2021-06-05 Published:2021-06-10

摘要: [目的/意义] 针对序列比对算法在文本相似度中的应用,改进全局比对算法并提高该算法的准确性,同时,应用局部比对算法有效解决内容差异或长短差异较大的两文本进行比对的问题。[方法/过程] 首先,利用HanLP中的CRF模型对在线学术资源中文文本数据集进行规范化处理,构成中文序列集;然后,使用最新的中文维基百科语料训练Word2Vec模型来构建语词对打分矩阵;最后,基于打分矩阵和改进的打分规则,对进行全局比对/局部比对的两中文序列进行比对并获得比对的最优解,回溯该最优解,获取最优解的比对路径,计算两中文序列的相似度。[结果/结论] 实验结果表明,相较于目前全局比对算法的相关研究,本文基于词性标注的结果与Word2Vec构建的语词对打分矩阵进一步提升了全局比对算法计算文本相似度的准确性,同时,应用于文本相似度计算的局部比对算法能够有效解决内容差异或长短差异较大的两文本进行比对的问题。

关键词: CRF模型, 词性标注, Word2Vec, 序列比对, 局部比对, 文本相似度

Abstract: [Purpose/significance] Aiming at the application of sequence alignment algorithm in text similarity, the global alignment algorithm is improved and the accuracy of the algorithm is improved. At the same time, the local alignment algorithm is used to effectively solve the problem of comparing two texts with different content or with different length. [Method/process] First, the CRF model in HanLP was used to normalize the Chinese text data set of the online academic resources and constitute the Chinese sequence set. Then, Word2Vec model was trained with the latest Chinese Wikipedia corpus to construct the word pair scoring matrix. Finally, based on the scoring matrix and the improved scoring rules, the two Chinese sequences of global/local alignment were compared and the optimal solution of the alignment was obtained. The optimal solution was backtracked to obtain the alignment path of the optimal solution and the similarity of the two Chinese sequences was calculated. [Result/conclusion] The experiment results show that compared with the current research of global alignment algorithm, the method based on the results of the part-of-speech tagging and Word2Vec build words to further improve the global alignment score matrix algorithm and applied to the accuracy of computing text similarity of local alignment algorithm can effectively solve the content differences or differences in the length of two text comparing problems.

Key words: CRF model, part of speech tagging, Word2Vec, sequence alignment, local alignment, text similarity

中图分类号: