Library and Information Service >
An Approach for Scientific Paper Similarity Rapid Detection Based on Spark
Received date: 2015-05-10
Revised date: 2015-05-20
Online published: 2015-06-05
[Purpose/significance] This paper detects the texts similar with papers to be detected from the large scale known texts and computes their similarities, to meet the second response requirement of online paper similarity detection. [Method/process] It uses divide and conquer strategy to softly cluster known text sentence set, and establishes inverted index for each cluster after soft clustering. Then it performs the similarity computing between papers to be detected and known texts on the fast data processing platform-Spark, using the method of character combined with phrase. [Result/conclusion] Through the experiment of two million known texts set, the results show that the proposed inverted index algorithm combined with soft clustering has precision rate P 100.0%, recall rate R 93.6% and harmonic mean F value 96.7%, integrating four types of papers to be detected. The harmonic mean F is about 10% higher than LCS algorithm and 23% higher than Simhash algorithm. In the detection of the paper with 5 000 words, the proposed algorithm has the detection time of about 6.5 seconds, nearly 300 times faster than the Simhash algorithm, and approximately 4 000 times faster than LCS algorithm. In addition, the results also show that the Spark based distributed parallel similarity detection algorithm has better scalability.
Zhuo Keqiu , Tong Guoping , Yu Wei . An Approach for Scientific Paper Similarity Rapid Detection Based on Spark[J]. Library and Information Service, 2015 , 59(11) : 134 -142 . DOI: 10.13266/j.issn.0252-3116.2015.11.019
[1] Apache spark[EB/OL]. [2015-03-18]. http://spark.apache.org.
[2] Si A, Leong H V, Lau R W H. Check: A document plagiarism detection system[C]//Proceedings of the 1997 ACM Symposium on Applied Computing. New York:ACM, 1997: 70-77.
[3] Schleimer S, Wilkerson D S, Aiken A. Winnowing: Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2003: 76-85.
[4] 秦新国. 基于句子相似度的文档复制检测算法研究[J]. 现代图书情报技术, 2007, 23(11): 63-66.
[5] Roul R K, Mittal S, Joshi P. Efficient approach for near duplicate document detection using textual and conceptual based techniques[M]//Advanced Computing, Networking and Informatics-Volume 1. Springer International Publishing, 2014: 195-203.
[6] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
[7] 白如江, 王晓笛, 王效岳. 基于数字指纹的文献相似度检测研究[J]. 图书情报工作, 2013, 57(15): 88-95.
[8] 李纲, 毛进, 陈璟浩. 基于语义指纹的中文文本快速去重[J]. 现代图书情报技术, 2013, 29(9): 41-47.
[9] Luo Xi, Najjar W, Hristidis V. Efficient near-duplicate document detection using FPGAs[C]//Big Data, 2013 IEEE International Conference on.Silicon Valley: IEEE, 2013: 54-61.
[10] Monostori K, Zaslavsky A, Schmidt H. Parallel and distributed document overlap detection on the Web[M]//Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. London:Springer-Verlag London, 2001: 206-214.
[11] Apache Hadoop. Hadoop [EB/OL]. [2015-03-18]. http://hadoop.apache.org.
[12] ApacheStorm.Storm[EB/OL]. [2015-03-18]. http://storm.apache.org.
[13] 卓可秋, 虞为, 苏新宁. 突发事件检测的MapReduce并行化实现[J]. 现代图书情报技术, 2015, 31(2): 46-54.
[14] 虞为, 陈俊鹏. 基于mapreduce的书目数据关联匹配研究[J]. 现代图书情报技术, 2013,29(9):15-22.
[15] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley:USENIX Association, 2012: 2.
[16] Xin R S, Gonzalez J E, Franklin M J, et al. Graphx: A resilient distributed graph system on spark[C]//First International Workshop on Graph Data Management Experiences and Systems. New York:ACM, 2013: 2.
[17] 卞昊穹, 陈跃国, 杜小勇, 等. Spark上的等值连接优化[J]. 华东师范大学学报: 自然科学版, 2014 (5): 263-270.
[18] Lin Chieh-Yen, Tsai Cheng-Hao, Lee Ching-Pei, et al. Large-scale logistic regression and linear support vector machines using Spark[C]//2014 IEEE International Conference on Big Data. Washington, DC:IEEE, 2014: 519-528.
[19] Qiu Hongjian, Gu Rong, Yuan Chunfeng, et al. YAFIM: A parallel frequent itemset mining algorithm with Spark[C]//Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. Phoenix:IEEE, 2014: 1664-1671.
[20] Sarazin T, Azzag H, Lebbah M. SOM clustering using spark-mapReduce[C]//Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. Phoenix:IEEE, 2014: 1727-1734.
[21] Wikipedia.K-meansclustering[EB/OL]. [2015-03-18]. http://en.wikipedia.org/wiki/K-means_clustering#cite_note-macqueen1967-1.
[22] Wikipedia. Minimum spanning tree[EB/OL]. [2015-04-08]. http://en.wikipedia.org/wiki/Minimum_spanning_tree.
[23] Wikipedia.Longestcommonsubsequenceproblem[EB/OL]. [2015-04-08]. http://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
[24] Du Yanyan, He Yanxiang, Tian Ye, et al. Microblog bursty topic detection based on user relationship[C]//Information Technology and Artificial Intelligence Conference(ITAIC),2011 6th IEEE Joint International. Chongqing:IEEE, 2011: 260-263.
[25] Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web crawling[C]//Proceedings of the 16th International Conference on World Wide Web. Banff:ACM, 2007: 141-150.
/
〈 |
|
〉 |