收稿日期: 2015-05-10
修回日期: 2015-05-20
网络出版日期: 2015-06-05
基金资助
本文系国家社会科学基金重大项目“面向突发事件应急决策的快速响应情报体系研究”(项目编号:13&ZD174)、国家社会科学基金项目“基于关联数据的图书馆语义云服务研究”(项目编号:12CTQ009)、江苏省社会科学项目青年项目“基于语义云服务的数字阅读推广研究”(项目编号:14TQC003)、中央高校基本科研业务费专项资金资助项目“基于用户的标语用分析的社会化标签知识组织研究”(项目编号:1435003)和江苏省高校自然科学研究面上资助项目“基于语义消歧技术的社会化标签知识组织研究”(项目编号:15KJB520013)研究成果之一。
An Approach for Scientific Paper Similarity Rapid Detection Based on Spark
Received date: 2015-05-10
Revised date: 2015-05-20
Online published: 2015-06-05
[目的/意义] , 从大规模已知文本集中检测出与待检测论文的相似文本并计算相似度大小, 用于满足在线论文相似性检测秒级响应需求。[方法/过程] 采用分治法策略, 对已知文本句集进行基于正交基的软聚类预处理, 并对软聚类后的每个簇建立倒排索引。接着在快数据处理平台Spark上执行相似性检测, 采用字符结合词组形式计算出待检测论文与已知文本的相似度大小。[结果/结论] 通过200万规模的已知文本集实验结果显示, 综合4种类型的待检测论文, 所提出的倒排索引结合软聚类算法准确率P为100.0%, 召回率R为93.6%, 调和平均值F为96.7%。调和平均值F比相似性检测算法LCS高10%左右, 比Simhash算法高约23%。在检测速度上, 对于一篇字数为5 000左右的待检测论文, 检测时间约为6.5秒, 比Simhash算法快近300倍, 比LCS算法快约4 000倍。此外, 实验结果还表明基于Spark的分布式并行相似性检测算法具有较好的可扩展性。
关键词: 论文相似性检测; Spark快数据处理; 正交基软聚类; 倒排索引
卓可秋 , 童国平 , 虞为 . 一种基于Spark的论文相似性快速检测方法[J]. 图书情报工作, 2015 , 59(11) : 134 -142 . DOI: 10.13266/j.issn.0252-3116.2015.11.019
[Purpose/significance] This paper detects the texts similar with papers to be detected from the large scale known texts and computes their similarities, to meet the second response requirement of online paper similarity detection. [Method/process] It uses divide and conquer strategy to softly cluster known text sentence set, and establishes inverted index for each cluster after soft clustering. Then it performs the similarity computing between papers to be detected and known texts on the fast data processing platform-Spark, using the method of character combined with phrase. [Result/conclusion] Through the experiment of two million known texts set, the results show that the proposed inverted index algorithm combined with soft clustering has precision rate P 100.0%, recall rate R 93.6% and harmonic mean F value 96.7%, integrating four types of papers to be detected. The harmonic mean F is about 10% higher than LCS algorithm and 23% higher than Simhash algorithm. In the detection of the paper with 5 000 words, the proposed algorithm has the detection time of about 6.5 seconds, nearly 300 times faster than the Simhash algorithm, and approximately 4 000 times faster than LCS algorithm. In addition, the results also show that the Spark based distributed parallel similarity detection algorithm has better scalability.
[1] Apache spark[EB/OL]. [2015-03-18]. http://spark.apache.org.
[2] Si A, Leong H V, Lau R W H. Check: A document plagiarism detection system[C]//Proceedings of the 1997 ACM Symposium on Applied Computing. New York:ACM, 1997: 70-77.
[3] Schleimer S, Wilkerson D S, Aiken A. Winnowing: Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2003: 76-85.
[4] 秦新国. 基于句子相似度的文档复制检测算法研究[J]. 现代图书情报技术, 2007, 23(11): 63-66.
[5] Roul R K, Mittal S, Joshi P. Efficient approach for near duplicate document detection using textual and conceptual based techniques[M]//Advanced Computing, Networking and Informatics-Volume 1. Springer International Publishing, 2014: 195-203.
[6] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
[7] 白如江, 王晓笛, 王效岳. 基于数字指纹的文献相似度检测研究[J]. 图书情报工作, 2013, 57(15): 88-95.
[8] 李纲, 毛进, 陈璟浩. 基于语义指纹的中文文本快速去重[J]. 现代图书情报技术, 2013, 29(9): 41-47.
[9] Luo Xi, Najjar W, Hristidis V. Efficient near-duplicate document detection using FPGAs[C]//Big Data, 2013 IEEE International Conference on.Silicon Valley: IEEE, 2013: 54-61.
[10] Monostori K, Zaslavsky A, Schmidt H. Parallel and distributed document overlap detection on the Web[M]//Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. London:Springer-Verlag London, 2001: 206-214.
[11] Apache Hadoop. Hadoop [EB/OL]. [2015-03-18]. http://hadoop.apache.org.
[12] ApacheStorm.Storm[EB/OL]. [2015-03-18]. http://storm.apache.org.
[13] 卓可秋, 虞为, 苏新宁. 突发事件检测的MapReduce并行化实现[J]. 现代图书情报技术, 2015, 31(2): 46-54.
[14] 虞为, 陈俊鹏. 基于mapreduce的书目数据关联匹配研究[J]. 现代图书情报技术, 2013,29(9):15-22.
[15] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley:USENIX Association, 2012: 2.
[16] Xin R S, Gonzalez J E, Franklin M J, et al. Graphx: A resilient distributed graph system on spark[C]//First International Workshop on Graph Data Management Experiences and Systems. New York:ACM, 2013: 2.
[17] 卞昊穹, 陈跃国, 杜小勇, 等. Spark上的等值连接优化[J]. 华东师范大学学报: 自然科学版, 2014 (5): 263-270.
[18] Lin Chieh-Yen, Tsai Cheng-Hao, Lee Ching-Pei, et al. Large-scale logistic regression and linear support vector machines using Spark[C]//2014 IEEE International Conference on Big Data. Washington, DC:IEEE, 2014: 519-528.
[19] Qiu Hongjian, Gu Rong, Yuan Chunfeng, et al. YAFIM: A parallel frequent itemset mining algorithm with Spark[C]//Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. Phoenix:IEEE, 2014: 1664-1671.
[20] Sarazin T, Azzag H, Lebbah M. SOM clustering using spark-mapReduce[C]//Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. Phoenix:IEEE, 2014: 1727-1734.
[21] Wikipedia.K-meansclustering[EB/OL]. [2015-03-18]. http://en.wikipedia.org/wiki/K-means_clustering#cite_note-macqueen1967-1.
[22] Wikipedia. Minimum spanning tree[EB/OL]. [2015-04-08]. http://en.wikipedia.org/wiki/Minimum_spanning_tree.
[23] Wikipedia.Longestcommonsubsequenceproblem[EB/OL]. [2015-04-08]. http://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
[24] Du Yanyan, He Yanxiang, Tian Ye, et al. Microblog bursty topic detection based on user relationship[C]//Information Technology and Artificial Intelligence Conference(ITAIC),2011 6th IEEE Joint International. Chongqing:IEEE, 2011: 260-263.
[25] Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web crawling[C]//Proceedings of the 16th International Conference on World Wide Web. Banff:ACM, 2007: 141-150.
/
〈 |
|
〉 |