《计算机应用研究》|Application Research of Computers

多MapReduce作业协同下的大数据挖掘类算法资源效率优化

Resource efficiency optimization for big data mining algorithm with multi MapReduce collaboration scenario

免费全文下载 (已被下载 次)  
获取PDF全文
作者 廖彬,张陶,于炯,黄静莱,国冰磊,刘炎
机构 1.新疆财经大学 统计与数据科学学院,乌鲁木齐 830012;2.新疆大学 信息科学与工程学院,乌鲁木齐 830008;3.清华大学 软件学院,北京 100084
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2020)05-008-1321-05
DOI 10.19734/j.issn.1001-3695.2018.11.0795
摘要 由于任意的MapReduce作业都需要独立地进行任务调度、资源分配等一系列复杂的操作,这使得同一算法协同的多个MapReduce作业之间,存在着大量的冗余磁盘I/O及资源重复申请操作,导致计算过程中资源利用效率低下。大数据挖掘类算法通常被切分成多个MapReduce job协作完成。以ItemBased算法为例,对多MapReduce作业协同下的大数据挖掘算法存在的资源效率问题进行了分析,提出基于DistributedCache的ItemBased算法,利用DistributedCache将多个MapReduce job之间的I/O数据进行缓存处理,打破作业之间独立性的缺陷,减少map与reduce任务之间的等待时延。实验结果表明,DistributedCache能够提高MapReduce作业的数据读取速度,利用DistributedCache重构后的算法极大地减少了map与reduce任务之间的等待时延,资源效率提高3倍以上。
关键词 MapReduce优化; ItemBased算法; 内存文件系统; I/O效率; 资源优化
基金项目 新疆维吾尔自治区自然科学基金资助项目(2016D01B014)
本文URL http://www.arocmag.com/article/01-2020-05-008.html
英文标题 Resource efficiency optimization for big data mining algorithm with multi MapReduce collaboration scenario
作者英文名 Liao Bin, Zhang Tao, Yu Jiong, Huang Jinglai, Guo Binglei, Liu Yan
机构英文名 1.College of Statistics & Data Science,Xinjiang University of Finance & Economics,Urumqi 830012,China;2.School of Information Science & Engineering,Xinjiang University,Urumqi 830008,China;3.School of Software,Tsinghua University,Beijing 100084,China
英文摘要 Because any MapReduce job requires a series of complex operations such as task scheduling and resource allocation independently, there are a lot of redundant disk I/O and resource duplicate application operations among multiple MapReduce jobs coordinated by the same algorithm, causing inefficient resource utilization in job computing process. Big data mining algorithms are usually divided into several MapReduce Jobs, taking ItemBased algorithm as an example, this paper analyzed the resource efficiency of mining algorithm with multi-MapReduce job collaboration scenario. It proposed an ItemBased algorithm based on DistributedCache, which used DistributedCache to cache I/O data between multiple MapReduce Jobs, broke the defect of independence between jobs, and reduced the waiting delay between Map and Reduce tasks. The experimental results show that, DistributedCache can improve the data reading speed of MapReduce jobs. The algorithm reconstructed by Distribu-tedCache greatly reduces the waiting delay between Map and Reduce tasks, and improves the resource efficiency by more than three times.
英文关键词 MapReduce optimization; ItemBased algorithm; memory file system; I/O efficiency; resource optimization
参考文献 查看稿件参考文献
 
收稿日期 2018/11/8
修回日期 2019/1/3
页码 1321-1325
中图分类号 TP393.09
文献标志码 A