《计算机应用研究》|Application Research of Computers

Hadoop与Spark应用场景研究

Survey on Hadoop and Spark application scenarios

免费全文下载 (已被下载 次)  
获取PDF全文
作者 冯兴杰,王文超
机构 中国民航大学 a.计算机科学与技术学院;b.信息网络中心,天津 300300
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)09-2561-06
DOI 10.3969/j.issn.1001-3695.2018.09.001
摘要 Spark的崛起对作为当前最为流行的大数据问题解决方案的Hadoop及其生态系统形成了有力的冲击,甚至一度有人认为Spark有取代Hadoop的趋势,但是因为Hadoop与Spark有着各自不同的特点,使得两者拥有不同的应用场景,从而Spark无法完全取代Hadoop。针对以上问题,对Hadoop与Spark的应用场景进行了分析。首先介绍了Hadoop与Spark的相关技术以及各自的生态系统,详细分析了两者的特性;最后针对两者特性,阐述了Hadoop与Spark各自所适应的应用场景。
关键词 Hadoop;Spark;大数据;生态系统;应用场景
基金项目 国家自然科学基金委员会与中国民用航空局联合基金资助项目(U1233113)
国家自然科学基金青年基金资助项目(61301245,61201414)
本文URL http://www.arocmag.com/article/01-2018-09-001.html
英文标题 Survey on Hadoop and Spark application scenarios
作者英文名 Feng Xingjie, Wang Wenchao
机构英文名 a.SchoolofComputerScience&Technology,b.InformationNetworkCenter,CivilAviationUniversityofChina,Tianjin300300,China
英文摘要 The rise of Spark has a strong impact on Hadoop and its ecological system as two big data problems solutions, even some people think that Spark has the trend to replace Hadoop, but because Hadoop and Spark have different characteristics, so they have different application scenarios, as a result, Spark cannot completely replace Hadoop.Based on the above problems, this paper analyzed the application scenarios of Hadoop and Spark.First it introduced the Hadoop and Spark related technologies and their ecosystems, and then detailed analysis of the characteristics of the two, finally for the two characteristics, described the Hadoop and Spark each adapted to the application scenarios.
英文关键词 Hadoop; Spark; big data; ecosystem; application scenarios
参考文献 查看稿件参考文献
  [1] Chen Min, Mao Shiwen, Liu Yunhao. Big data:a survey[J] . Mobile Networks and Applications, 2014, 19(2):171-209.
[2] Kitchin R, McArdle G. The diverse nature of big data[J] . SSRN Electronic Journal, 2015, 25(3):1-10.
[3] Mayer-Schonberger V, Cukier K. Big data:a revolution that will transform how we live, work, and think[M] . [S. l. ] :John Murray Publishers, 2013.
[4] Walker S J. Big data:a revolution that will transform how we live, work, and think[J] . International Journal of Advertising, 2014, 33(1):181-183.
[5] Tempini N. Book review:big data:a revolution that will transform how we live, work, and think[J] . Media Culture & Society, 2013, 37(1):1-3.
[6] Dean J, Ghemawat S. MapReduce:simplified data processing on large clusters[C] //Proc of Conference on Symposium on Opearting Systems Design & Implementation. Berkeley, CA:USENIX Association, 2004:10-11.
[7] Ghemawat S, Gobioff H, Leung S T. The Google file system[C] //Proc of the 19th ACM Symposium on Operating Systems Principles. NewYork:ACM Press, 2003:29-43.
[8] Chang F, Dean J, Ghemawat S, et al. Bigtable:a distributed storage system for structured data[C] //Proc of USENIX Symposium on Operating Systems Design and Implementation. Berkeley, CA:USENIX Association, 2006:15-15.
[9] White T, Cutting D. Hadoop:the definitive guide[M] . [S. l. ] :O’Reilly Media Inc, 2009:1- 4.
[10] Zaharia M, Chowdhury M, Franklin M J, et al. Spark:cluster computing with working sets[C] //Proc of USENIX Conference on Hot Topics in Cloud Computing. Berkeley, CA:USENIX Association, 2010:10.
[11] 于俊. Spark核心技术与高级应用[M] . 北京:机械工业出版社, 2016.
[12] Steinmacher I, Wiese I S, Chaves A P, et al. Newcomers withdrawal in open source software projects:analysis of Hadoop common project[C] //Proc of Brazilian Symposium on Collaborative Systems. Washington DC:IEEE Computer Society, 2012:65-74.
[13] Wang Youwei, Zhou Jiang, Ma Can, et al. Clover:adistributed file system of expandable metadata service derived from HDFS[C] //Proc of IEEE International Conference on Cluster Computing. Washington DC:IEEE Computer Society, 2012:126-134.
[14] Vavilapalli V K, Murthy A C, Douglas C, et al. Apache Hadoop Yarn:yet another resource negotiator[C] //Proc of the 4th Annual Symposium on Cloud Computing. New York:ACM Press, 2013:5-7.
[15] Chowdhury B, Rabl T, Saadatpanah P, et al. A BigBench implementation in the Hadoop ecosystem[C] //Advancing Big Data Benchmarks. 2013:3-18.
[16] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C] //Proc of Conference on Networked Systems Design and Implementation. Berkeley, CA:USENIX Association, 2012:2.
[17] 高彦杰. Spark大数据处理[M] . 北京:机械工业出版社, 2014.
[18] Stoica I. Conquering big data with spark and BDAS[J] . ACM SIGMETRICS Performance Evaluation Review, 2014, 42(1):193.
[19] Jiang Dawei, Ooi B C, Shi Lei, et al. The performance of MapReduce:an in-depth study[J] . Proceedings of the VLDB Endowment, 2010, 3(2):472-483.
[20] Gu Lei, Li Huan. Memory or time:performance evaluation for iterative operation on Hadoop and Spark[C] //Proc of International Conference on Embedded and Ubiquitous Computing. Piscataway, NJ:IEEE Press, 2013:721 - 727.
[21] Han Zhijie, Zhang Yujie. Spark:a big data processing platform based on memory computing[C] //Proc of the 7th International Symposium on Parallel Architectures, Algorithms and Programming. Piscataway, NJ:IEEE Press, 2015:172-176.
[22] El-Sappagh S H A, Hendawi A M A, Bastawissy A H E. A proposed model for data warehouse ETL processes[J] . Journal of King Saud University Computer & Information Sciences, 2011, 23(2):91-104.
[23] Bala M, Boussaid O, Alimazighi Z. P-ETL:parallel-ETL based on the MapReduce paradigm[C] //Proc of the 11th ACS/IEEE International Conference on Computer Systems and Applications. Piscataway, NJ:IEEE Press, 2014:42-49.
[24] Zhang Yan, Ma Hongtao, Xu Yunfeng. An intelligence gathering system for business based on cloud computing[C] //Proc of the 6th International Symposium on Computational Intelligence and Design. Washington DC:IEEE Computer Society, 2013:201-204.
[25] Li Dongming, Li Yan, Yuan Chao, et al. Research on private cloud platform of seed tracing based on Hadoop parallel computing[C] //Proc of the 4th International Conference on Computer Science and Network Technology. Piscataway, NJ:IEEE Press, 2016:134-137.
[26] Priya P A, Prabhakar S, Vasavi S. Entity resolution for high velocity streams using semantic measures[C] //Proc of IEEE International Advance Computing Conference. Piscataway, NJ:IEEE Press, 2015:35-40.
[27] Dewangan S K, Pandey S, Verma T. A distributed framework for event log analysis using MapReduce[C] //Proc of International Conference on Advanced Communication Control and Computing Technologies. Piscataway, NJ:IEEE Press, 2017:503-506.
[28] Chen Ruoyu, Zhang Yangsen, Bi Rongrong, et al. A MapReduce-based framework for analyzing Web logs in offline streams[C] //Proc of the 2nd International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress. Piscataway, NJ:IEEE Press, 2016:178-183.
[29] Xhafa F. Processing and analysing large log data files of a virtual campus[C] //Proc of the 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing. 2015:200-206.
[30] Cafarella M, Cutting D. Building Nutch[J] . Queue, 2004, 2(2):100-103.
[31] Li Ying, Sha Fei, Wang Shujuan, et al. The improvement of page sorting algorithm for music users in Nutch[C] //Proc of the 15th International Conference on Computer and Information Science. Piscataway, NJ:IEEE Press, 2016:1-4.
[32] Kai Zhang, Du Yuncheng, Lyu Xueqiang, et al. The study and implementation of micro-blog search engine based on nutch[C] //Proc of the 2nd International Conference on Future Computer and Communication. Piscataway, NJ:IEEE Press, 2010:850-854.
[33] Fang Cheng, Liu Jun, Lei Zhenming. Fine-grained HTTP Web traffic analysis based on large-scale mobile datasets[J] . IEEE Access, 2016, 4(11):4364-4373.
[34] Birke R, Bjerkqvist M, Kalyvianaki E, et al. Meeting latency target in transient burst:a case on spark streaming[C] //Proc of IEEE International Conference on Cloud Engineering. Piscataway, NJ:IEEE Press, 2017:149-158.
[35] Fang Feng, Cai Zhiping, Zhao Qijia, et al. Adaptive technique for real-time DDoS detection and defense using Spark streaming[J] . Journal of Frontiers of Computer Science and Technology, 2016, 10(5):601-611.
[36] Maarala A I, Rautiainen M, Salmi M, et al. Low latency analytics for streaming traffic data with apache Spark[C] //Proc of IEEE International Conference on Big Data. Washington DC:IEEE Computer Society, 2015:2855-2858.
[37] Wang Bowen, Yin Jun, Hua Qi, et al. Parallelizing K-means-based clustering on Spark[C] //Proc of International Conference on Advanced Cloud and Big Data. Piscataway, NJ:IEEE Press, 2016:31-36.
[38] Kusuma I, Ma’sum M A, Habibie N, et al. Design of intelligent K-means based on spark for big data clustering[C] // Proc of International Workshop on Big Data and Information Security. Piscataway, NJ:IEEE Press, 2017:89-96.
[39] Triguero I, Maillo J, Luengo J, et al. From big data to smart data with the K-nearest neighbours algorithm[C] // Proc of IEEE International Conference on Internet of Things. Piscataway, NJ:IEEE Press, 2016:859-864.
[40] Li Chunfeng, Wen Tingxi, Dong Huailin, et al. Implementation of parallel multi-objective artificial bee colony algorithm based on Spark platform[C] //Proc of the 11th International Conference on Computer Science & Education. Piscataway, NJ:IEEE Press, 2016:592-597.
[41] Bharill N, Tiwari A, Malviya A. Fuzzy based clustering algorithms to handle big data with implementation on Apache Spark[C] //Proc of the 2nd International Conference on Big Data Computing Service and Applications. Piscataway, NJ:IEEE Press, 2016:95-104.
[42] Lyu Yanfei, He Huihong, Zheng Yasong, et al. OLAP query performance tuning in Spark[C] //Proc of the 3rd International Conference on Cyberspace Technology. [S. l. ] :IET Press, 2015:1-5.
[43] Li Xiaopeng, Zhou Wenli. Performance comparison of Hive, impala and Spark SQL[C] //Proc of the 7th International Conference on Intelligent Human-Machine Systems and Cybernetics. Washington DC:IEEE Computer Society, 2015:418-423.
[44] Li Zhengxian, Hu Jinlong, Shen Jiazhao, et al. A scalable recipe recommendation system for mobile application[C] //Proc of the 3rd International Conference on Information Science and Control Engineering. Piscataway, NJ:IEEE Press, 2016:91-94.
[45] Wijayanto A, Winarko E. Implementation of multi-criteria collaborative filtering on cluster using apache Spark[C] //Proc of the 2nd International Conference on Science and Technology-Computer. Piscataway, NJ:IEEE Press, 2017:177-181.
[46] Ling Xiao, Yang Jiahai, Wang Dan, et al. Fast community detection in large weighted networks using GraphX in the cloud[C] //Proc of the 18th IEEE International Conference on High Performance Computing and Communications. Piscataway, NJ:IEEE Press, 2017:1-8.
[47] Abu-Doleh A, Catalyurek U V. Spaler:Spark and GraphX based de novo genome assembler[C] //Proc of IEEE International Conference on Big Data. Washington DC:IEEE Computer Society, 2015:1013-1018.
收稿日期 2017/7/5
修回日期 2017/9/4
页码 2561-2566
中图分类号 TP391
文献标志码 A