[1] Apache spark[EB/OL]. [2015-03-18]. http://spark.apache.org.
[2] Si A, Leong H V, Lau R W H. Check: A document plagiarism detection system[C]//Proceedings of the 1997 ACM Symposium on Applied Computing. New York:ACM, 1997: 70-77.
[3] Schleimer S, Wilkerson D S, Aiken A. Winnowing: Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2003: 76-85.
[4] 秦新国. 基于句子相似度的文档复制检测算法研究[J]. 现代图书情报技术, 2007, 23(11): 63-66.
[5] Roul R K, Mittal S, Joshi P. Efficient approach for near duplicate document detection using textual and conceptual based techniques[M]//Advanced Computing, Networking and Informatics-Volume 1. Springer International Publishing, 2014: 195-203.
[6] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
[7] 白如江, 王晓笛, 王效岳. 基于数字指纹的文献相似度检测研究[J]. 图书情报工作, 2013, 57(15): 88-95.
[8] 李纲, 毛进, 陈璟浩. 基于语义指纹的中文文本快速去重[J]. 现代图书情报技术, 2013, 29(9): 41-47.
[9] Luo Xi, Najjar W, Hristidis V. Efficient near-duplicate document detection using FPGAs[C]//Big Data, 2013 IEEE International Conference on.Silicon Valley: IEEE, 2013: 54-61.
[10] Monostori K, Zaslavsky A, Schmidt H. Parallel and distributed document overlap detection on the Web[M]//Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. London:Springer-Verlag London, 2001: 206-214.
[11] Apache Hadoop. Hadoop [EB/OL]. [2015-03-18]. http://hadoop.apache.org.
[12] ApacheStorm.Storm[EB/OL]. [2015-03-18]. http://storm.apache.org.
[13] 卓可秋, 虞为, 苏新宁. 突发事件检测的MapReduce并行化实现[J]. 现代图书情报技术, 2015, 31(2): 46-54.
[14] 虞为, 陈俊鹏. 基于mapreduce的书目数据关联匹配研究[J]. 现代图书情报技术, 2013,29(9):15-22.
[15] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley:USENIX Association, 2012: 2.
[16] Xin R S, Gonzalez J E, Franklin M J, et al. Graphx: A resilient distributed graph system on spark[C]//First International Workshop on Graph Data Management Experiences and Systems. New York:ACM, 2013: 2.
[17] 卞昊穹, 陈跃国, 杜小勇, 等. Spark上的等值连接优化[J]. 华东师范大学学报: 自然科学版, 2014 (5): 263-270.
[18] Lin Chieh-Yen, Tsai Cheng-Hao, Lee Ching-Pei, et al. Large-scale logistic regression and linear support vector machines using Spark[C]//2014 IEEE International Conference on Big Data. Washington, DC:IEEE, 2014: 519-528.
[19] Qiu Hongjian, Gu Rong, Yuan Chunfeng, et al. YAFIM: A parallel frequent itemset mining algorithm with Spark[C]//Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. Phoenix:IEEE, 2014: 1664-1671.
[20] Sarazin T, Azzag H, Lebbah M. SOM clustering using spark-mapReduce[C]//Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. Phoenix:IEEE, 2014: 1727-1734.
[21] Wikipedia.K-meansclustering[EB/OL]. [2015-03-18]. http://en.wikipedia.org/wiki/K-means_clustering#cite_note-macqueen1967-1.
[22] Wikipedia. Minimum spanning tree[EB/OL]. [2015-04-08]. http://en.wikipedia.org/wiki/Minimum_spanning_tree.
[23] Wikipedia.Longestcommonsubsequenceproblem[EB/OL]. [2015-04-08]. http://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
[24] Du Yanyan, He Yanxiang, Tian Ye, et al. Microblog bursty topic detection based on user relationship[C]//Information Technology and Artificial Intelligence Conference(ITAIC),2011 6th IEEE Joint International. Chongqing:IEEE, 2011: 260-263.
[25] Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web crawling[C]//Proceedings of the 16th International Conference on World Wide Web. Banff:ACM, 2007: 141-150. |