《计算机应用研究》|Application Research of Computers

IABS:一个基于Spark的Apriori改进算法

IABS: parallel improved Apriori algorithm based on Spark

免费全文下载 (已被下载 次)  
获取PDF全文
作者 闫梦洁,罗军,刘建英,侯传旺
机构 国防科学技术大学 计算机学院,长沙 410073
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2017)08-2274-04
DOI 10.3969/j.issn.1001-3695.2017.08.007
摘要 Apriori算法是关联规则挖掘中最经典的算法之一,其核心问题是频繁项集的获取。针对经典Apriori算法存在的需多次遍历事务数据库及需产生候选项集等问题,首先通过转换存储结构、消除候选集产生过程等方法对Apriori算法进行优化;同时,随着大数据时代的到来,数据量与日俱增,传统算法面临巨大挑战,将优化的Apriori与Spark相结合,充分利用Spark的内存计算、弹性分布式数据集等优势,提出了IABS(improved Apriori algorithm based on Spark)。通过与已有的同类算法进行比较,IABS的数据可扩展性和节点可扩展性得以验证,并且在多种数据集上平均获得了23.88%的性能提升,尤其随着数据量的增长,性能提升更加明显。
关键词 Apriori算法;频繁项集;存储结构转换;Spark;内存计算
基金项目 国家“863”计划资助项目(2014AA01A302)
本文URL http://www.arocmag.com/article/01-2017-08-007.html
英文标题 IABS: parallel improved Apriori algorithm based on Spark
作者英文名 Yan Mengjie, Luo Jun, Liu Jianying, Hou Chuanwang
机构英文名 SchoolofComputer,NationalUniversityofDefenseTechnology,Changsha410073,China
英文摘要 Apriori algorithm is one of the most classical algorithm in association rule mining, the core problem is the generation process of frequent itemsets. Firstly, aimed at the existing problems of classical Apriori algorithm, such as it needed to scan the transaction global database for several times and needed to generate candidate itemsets, this paper optimized it by transforming storage structure and eliminating the process of candidate itemsets generation. Then, with the advent of the era of big data, data volume rises with the day, classical Apriori algorithm faces severe challenge. Based on the improved Apriori algorithm and combined with Spark platform, this paper proposed the IABS algorithm, which made full use of Spark, such as in-memory computation, resilient distributed datasets. Compared with already existing similar algorithms, the sizeup and node salability of IABS are validated, as well as, IABS achieves 23.88% performance improvement in average for various benchmarks. Especially, as the growth of data, its performance improvement is more obvious.
英文关键词 Apriori algorithm; frequent itemset; storage structure transformation; Spark; in-memory computation
参考文献 查看稿件参考文献
  [1] Buczak A L, Gifford C M. Fuzzy association rule mining for community crime pattern discovery[C] //Proc of ACM SIGKDD Workshop on Intelligence and Security Informatics. New York:ACM Press, 2010.
[2] Agrawal R, Imieliński T, Swami A. Mining association rule between sets of items in large databases[J] . ACM Sigmod Record, 1993, 22(2):207-216.
[3] Agrawal R, Srikant R. Fast algorithm for mining association rules[C] //Proc of the 20th International Conference on Very Large Data Bases. 1994:487-499.
[4] Tsay Y J, Chiang J Y. CBAR:an efficient method for mining association rules[J] . Knowledge-Based Systems, 2005, 18(2):99-105.
[5] Han Jiawei, Pei Jian, Yin Yiwen, et al. Mining frequent patterns without candidate generation:a frequent-pattern tree approach[J] . Data Mining and Knowledge Discovery, 2004, 8(1):53-87.
[6] Lin M Y, Lee P Y, Hsueh S C. Apriori-based frequent itemset mi-ning algorithms on MapReduce[C] //Proc of the 6th International Conference on Ubiquitous Information Management and Communication. New York:ACM Press, 2012:76.
[7] Qiu Hongjian, Gu Rong, Yuan Chunfeng, et al. YAFIM:a parallel frequent itemset mining algorithm with Spark[C] //Proc of Parallel and Distributed Processing Symposium. 2014:1664-1671.
[8] Rathee S, Kaul M, Kashyap A. R-Apriori:an efficient apriori based algorithm on Spark[C] //Proc of the 8th Workshop on Information and Knowledge Management. 2015:27-34.
[9] Frequent itemset mining dataset repository[EB/OL] . (2012-06-04). http://fimi. ua. ac. be/data/.
收稿日期 2016/5/19
修回日期 2016/7/4
页码 2274-2277
中图分类号 TP301.6
文献标志码 A