《计算机应用研究》|Application Research of Computers

实时流处理系统Storm的调度优化综述

Survey of real-time processing system Storm scheduling optimization

免费全文下载 (已被下载 次)  
获取PDF全文
作者 蔡宇,赵国锋,郭航
机构 1.重庆邮电大学 电子信息与网络研究院,重庆 400065;2.重庆市光通信与网络高校重点实验室,重庆 400065
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)09-2567-07
DOI 10.3969/j.issn.1001-3695.2018.09.002
摘要 随着大数据技术的发展,相对于Hadoop等传统的批处理系统,流式处理系统具有更好的实时性特点。在已有的流式处理系统中,Storm系统具有良好的稳定性、高可扩展性以及高容错性等优点,使它在流式数据处理系统中脱颖而出。但是在任务调度方面,Storm系统并没有做过多的考虑,默认采用相对简单的轮询调度法,导致系统在性能上存在瓶颈。近年来针对Storm系统的调度问题,研究提出了各种优化方案。从实时流处理系统Storm的调度优化出发,将这些优化方法分为四类,并详细阐述各类中具有一定代表性的方法,分析其优缺点以及适用的场景。最后,讨论了在日益发展的新环境下,Storm系统的调度优化相关研究未来可能存在的方向。
关键词 流式数据处理;ApacheStorm;性能优化;调度
基金项目 国家自然科学基金资助项目(61501075)
本文URL http://www.arocmag.com/article/01-2018-09-002.html
英文标题 Survey of real-time processing system Storm scheduling optimization
作者英文名 Cai Yu, Zhao Guofeng, Guo Hang
机构英文名 1.InstituteofElectricalInformation&NetworkResearch,ChongqingUniversityofPosts&Telecommunications,Chongqing400065,China;2.ChongqingKeyLaboratoryofOpticalCommunication&NetworkinColleges&Universities,Chongqing400065,China
英文摘要 With the development of large data technology, compared with Hadoop and other traditional batch system, streaming processing system has better real-time characteristics.In the existing flow processing system, the Storm system has the advantages of good stability, high scalability and high fault tolerance, so that it can stand out in the flow data processing system.How-ever, in the task scheduling, Storm system has not done too much consideration, the default using a relatively simple polling scheduling method, resulting in the performance bottleneck in the system.In recent years, this paper proposed a variety of optimization schemes have been for the scheduling problem of Storm system.Based on the scheduling optimization of real-time stream processing system Storm, this paper divided the method into four categories, and described the methods of representative representation in each category in detail, and analyzed their advantages and disadvantages.Finally, it discussed the possible future direction of Storm system scheduling optimization in the new environment.
英文关键词 streaming data processing; Apache Storm; performance optimization; scheduling
参考文献 查看稿件参考文献
  [1] Shvachko K, Kuang Hairong, Radia S, et al. The Hadoop distributed file system[C] //Proc of the 26th IEEE Symposium on Mass Storage Systems and Technologies. Piscataway, NJ:IEEE Press, 2010:1-10.
[2] Dean J, Ghemawat S. MapReduce:simplified data processing on large clusters[J] . Communications of the ACM, 2008, 51(1):107-113.
[3] Marz N, Warren J. Big data:principles and best practices of scalable realtime data systems[M] . [S. l. ] :Manning Publications Co, 2015.
[4] Yasumoto K, Yamaguchi H, Shigeno H. Survey of real-time processing technologies of IoT data streams[J] . Journal of Information Processing, 2016, 24(2):195-202.
[5] Liu Xiufeng, Iftikhar N, Xie Xike. Survey of real-time processing systems for big data[C] //Proc of the 8th International Database Engineering & Applications Symposium. New York:ACM Press, 2014:356-361.
[6] Hesse G, Lorenz M. Conceptual survey on data stream processing systems[C] //Proc of the 21st IEEE International Conference on Parallel and Distributed Systems. Piscataway, NJ:IEEE Press, 2015:797-802.
[7] Gorawski M, Gorawska A, Pasterak K. A survey of data stream processing tools[M] //Information Sciences and Systems. Berlin:Springer International Publishing, 2014:295-303.
[8] Karunaratne P, Karunasekera S, Harwood A. Distributed stream clustering using micro-clusters on Apache Storm[J] . Journal of Parallel & Distributed Computing, 2017, 108(10):74-84.
[9] Lu Ruirui, Wu Gang, Xie Bin, et al. Stream bench:towards benchmarking modern distributed stream computing frameworks[C] //Proc of the 7th IEEE/ACM International Conference on Utility and Cloud Computing. Washington DC:IEEE Computer Society, 2014:69-78.
[10] Van Der Veen J S, Van Der Waaij B, Lazovik E, et al. Dynamically scaling apache storm for the analysis of streaming data[C] //Proc of the 1st IEEE International Conference on Big Data Computing Service and Applications. Washington DC:IEEE Computer Society, 2015:154-161.
[11] Abadi D J, Ahmad Y, Balazinska M, et al. The design of the borealis stream processing engine[C] //Proc of the 2nd Biennial Conference on Innovative Data Systems Research. 2005:277-289.
[12] 黄馥浩. 基于Storm的微博互动平台的设计与实现[D] . 广州:中山大学, 2013.
[13] Hunt P, Konar M, Junqueira F P, et al. ZooKeeper:wait-free coordination for Internet-scale systems[C] //Proc of USENIX Annual Technical Conference. 2010:9.
[14] Toshniwal A, Taneja S, Shukla A, et al. Storm@ Twitter[C] //Proc of ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2014:147-156.
[15] Carney D, Etintemel U, Cherniack M, et al. Monitoring streams:a new class of data management applications[C] //Proc of the 28th International Conference on Very Large Data Bases. 2002:215-226.
[16] Andrade H C M, Gedik B, Turaga D S. Fundamentals of stream processing:application design, systems, and analytics[M] . New York:Cambridge University Press, 2014.
[17] Córdova P. Analysis of real time stream processing systems considering latency[D] . Toronto:University of Toronto, 2015.
[18] Stonebraker M, Cetintemel U, Zdonik S. The 8 requirements of real-time stream processing[J] . ACM Sigmod Record, 2005, 34(4):42-47.
[19] Liu Yaxiao, Liu Weidong, Song Jiaxing, et al. An empirical study on implementing highly reliable stream computing systems with private cloud[J] . Ad hoc Networks, 2015, 35(10):37-50.
[20] DeMatteis T, Mencagli G. Proactive elasticity and energy awareness in data stream processing[J] . Journal of Systems & Software, 2017, 127(5):302-319.
[21] Yang Wenjie, Liu Xingang, Zhang Lan, et al. Big data real-time processing based on Storm[C] //Proc of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications. Washington DC:IEEE Computer Society, 2013:1784-1787.
[22] Vavilapalli V K, Murthy A C, Douglas C, et al. Apache Hadoop Yarn:yet another resource negotiator[C] //Proc of the 4th Annual Symposium on Cloud Computing. New York:ACM Press, 2013:5.
[23] Junqueira F, Reed B. ZooKeeper:distributed process coordination[M] . [S. l. ] :O’Reilly Media Inc, 2013.
[24] Batyuk A, Voityshyn V. Apache Storm based on topology for real-time processing of streaming data from social networks[C] //Proc of the 1st IEEE International Conference on Data Stream Mining & Processing. Piscataway, NJ:IEEE Press, 2016.
[25] Ranjan R. Streaming big data processing in datacenter clouds[J] . IEEE Cloud Computing, 2014, 1(1):78-83.
[26] Aniello L, Baldoni R, Querzoni L. Adaptive online scheduling in Storm[C] //Proc of the 7th ACM International Conference on Distributed Event-based Systems. New York:ACM Press, 2013:207-218.
[27] Xu Jielong, Chen Zhenhua, Tang Jian, et al. T-Storm:traffic-aware online scheduling in Storm[C] //Proc of the 34th IEEE International Conference on Distributed Computing Systems. Piscataway, NJ:IEEE Press, 2014:535-544.
[28] Zhang Jing, Li Chunlin, Zhu Liye, et al. The real-time scheduling strategy based on traffic and load balancing in Storm[C] //Proc of the 18th High Performance Computing and Communications;the 14th IEEE International Conference on Smart City;the 2nd IEEE International Conference on Data Science and Systems. Piscataway, NJ:IEEE Press, 2016:372-379.
[29] 熊安萍, 王贤稳, 邹洋. 基于Storm拓扑结构热边的调度算法[J] . 计算机工程, 2017, 43(1):37-42.
[30] Fischer L, Bernstein A. Workload scheduling in distributed stream processors using graph partitioning[C] //Proc of IEEE International Conference on Big Data. Piscataway, NJ:IEEE Press, 2015:124-133.
[31] Eskandari L, Huang Zhiyi, Eyers D. P-Scheduler:adaptive hierarchical scheduling in Apache Storm[C] //Proc of Australasian Computer Science Week Multiconference. New York:ACM Press, 2016:1-10.
[32] Peng Boyang, Hosseini M, Hong Zhihao, et al. R-Storm:resourceaware scheduling in Storm[C] //Proc of the 16th ACM Annual Middleware Conference. Vancouver:ACM Press, 2015:149-161.
[33] Cardellini V, Grassi V, Lo Presti F, et al. Distributed QoS-aware scheduling in Storm[C] //Proc of the 9th ACM International Conference on Distributed Event-Based Systems. New York:ACM Press, 2015:344-347.
[34] Pietzuch P, Ledlie J, Shneidman J, et al. Network-aware operator placement for stream-processing systems[C] //Proc of the 22nd International Conference on Data Engineering. 2006:49.
[35] Nardelli M. QoS-aware deployment of data streaming applications over distributed infrastructures[C] //Proc of the 39th International Convention on Information and Communication Technology, Electronics and Microelectronics. Piscataway, NJ:IEEE Press, 2016:736-741.
[36] Farahabady M R H, Samani H R D, Wang Yidan, et al. A QoS-aware controller for apache Storm[C] //Proc of the 15th IEEE International Symposium on Network Computing and Applications. Cambridge, MA:IEEE Computer Society, 2016:334-342.
[37] Chatzistergiou A, Viglas S D. Fast heuristics for near-optimal task allocation in data stream processing over clusters[C] //Proc of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York:ACM Press, 2014:1579-1588.
[38] Sun Dawei, Zhang Guangyan, Yang Songlin, et al. Re-stream:real-time and energy-efficient resource scheduling in big data stream computing environments[J] . Information Sciences, 2015, 319(10):92-112.
[39] Chakraborty R, Majumdar S. A priority based resource scheduling technique for multitenant Storm clusters[C] //Proc of International Symposium on Performance Evaluation of Computer and Telecommunication Systems. Piscataway, NJ:IEEE Press, 2016.
[40] Bellavista P, Corradi A, Reale A, et al. Priority-based resource scheduling in distributed stream processing systems for big data applications[C] //Proc of the 7th IEEE/ACM International Conference on Utility and Cloud Computing. Piscataway, NJ:IEEE Press, 2015:363-370.
[41] Chen Yiren, Lee C R. G-Storm:a GPU-aware Storm scheduler[C] //Proc of the 14th International Conference on Dependable, Autonomic and Secure Computing, 14th International Conference on Pervasive Intelligence and Computing, 2nd International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress. Piscataway, NJ:IEEE Press, 2016:738-745.
[42] Chen Zhenhua, Xu Jielong, Tang Jian, et al. GPU-accelerated high-throughput online stream data processing[J] . IEEE Trans on Big Data, 2018, 4(2):191-202.
[43] Qian Wenjun, Shen Qingni, Qin Jia, et al. S-Storm:a slot-aware scheduling strategy for even scheduler in Storm[C] //Proc of the 18th International Conference on High Performance Computing and Communications;IEEE 14th International Conference on Smart City;IEEE 2nd International Conference on Data Science and Systems. Piscataway, NJ:IEEE Press, 2016:623-630.
[44] Li Chunlin, Zhang Jing, Luo Youlong. Real-time scheduling based on optimized topology and communication traffic in distributed real-time computation platform of Storm[J] . Journal of Network and Computer Applications, 2017, 87(6):100-115.
[45] Shieh C K, Huang Shengwei, Sun Lida, et al. A topology-based scaling mechanism for Apache Storm[J] . International Journal of Network Management, 2016, 27(3).
[46] Zang Zhida, Rao R N. DBalancer:a tool for dynamic changing of workers number in Storm[C] //Proc of the 4th International Conference on Computer Science and Network Technology. Piscataway, NJ:IEEE Press, 2015:142-145.
[47] Li J, Pu C, Chen Yuan, et al. Enabling elastic stream processing in shared clusters[C] //Proc of the 9th IEEE International Conference on Cloud Computing. Piscataway, NJ:IEEE Press, 2017:108-115.
[48] Evans R. Apache Storm, a hands on tutorial[C] //Proc of IEEE International Conference on Cloud Engineering. Piscataway, NJ:IEEE Press, 2015:2.
[49] Cardellini V, Nardelli M, Luzi D. Elastic stateful stream processing in Storm[C] //Proc of International Conference on High Performance Computing & Simulation. 2016:583-590.
[50] Gulisano V, Jiménez-Peris R, Patio-Martínez M, et al. StreamCloud:an elastic and scalable data streaming system[J] . IEEE Trans on Parallel & Distributed Systems, 2012, 23(12):2351-2365.
[51] Fernandez R C, Migliavacca M, Kalyvianaki E, et al. Integrating scale out and fault tolerance in stream processing using operator state management[C] //Proc of ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2013:725-736.
[52] Heinze T, Pappalardo V, Jerzak Z, et al. Auto-scaling techniques for elastic data stream processing[C] //Proc of the 8th ACM International Conference on Distributed Event-Based Systems, IEEE International Conference on Data Engineering Workshops. New York:ACM Press, 2014:318-321.
[53] Heinze T, Jerzak Z, Hackenbroich G, et al. Latency-aware elastic scaling for distributed data stream processing systems[C] //Proc of the 8th ACM International Conference on Distributed Event-Based Systems. New York:ACM Press, 2014:13-22.
收稿日期 2017/6/19
修回日期 2017/8/10
页码 2567-2573
中图分类号 TP301.6
文献标志码 A