《计算机应用研究》|Application Research of Computers

基于MapReduce框架下K-means的改进算法

Improved K-means algorithm based on MapReduce framework

免费全文下载 (已被下载 次)  
获取PDF全文
作者 阴爱英,吴运兵,朱敏琛,张莹
机构 1.福州大学至诚学院 计算机工程系,福州 350002;2.福州大学 数学与计算机科学学院,福州 350116
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)08-2295-04
DOI 10.3969/j.issn.1001-3695.2018.08.014
摘要 针对海量数据背景下K-means聚类结果不稳定和收敛速度较慢的问题,提出了基于MapReduce框架下的K-means改进算法。首先,为了能获得K-means聚类的初始簇数,利用凝聚层次聚类法对数据集进行聚类,并用轮廓系数对聚类结果进行初步评价,将获得数据集的簇数作为K-means算法的初始簇中心进行聚类;其次,为了能适应于海量数据的聚类挖掘,将改进的K-means算法部署在MapReduce框架上进行运算。实验结果表明,在单机性能上,该方法具有较高的准确率和召回率,同时也具有较强的聚类稳定性;在集群性能上,也具有较好的加速比和运行速度。
关键词 MapReduce框架;K-means算法;数据挖掘;聚类分析
基金项目 福建省自然科学基金资助项目(2017J01755)
福建省教育厅中青年教师教育科研项目(JAT160658,JAT160077)
福建省科技计划项目(2016R0095)
本文URL http://www.arocmag.com/article/01-2018-08-014.html
英文标题 Improved K-means algorithm based on MapReduce framework
作者英文名 Yin Aiying, Wu Yunbing, Zhu Minchen, Zhang Ying
机构英文名 1.Dept.ofComputerEngineering,ZhichengCollegeofFuzhouUniversity,Fuzhou350002,China;2.CollegeofMathematics&ComputerScience,FuzhouUniversity,Fuzhou350116,China
英文摘要 Focusing on the unstable result and slow convergence of K-means clustering algorithm for huge amount of data, this paper proposed an improved K-means algorithm based on MapReduce framework. Firstly, in order to obtain the initial cluster number of K-means clustering, it used hierarchical clustering method to cluster the dataset, and evaluated the clustering result by silhouette coefficient. It clustered the cluster number of the acquired data set as the initial cluster center of the K-means algorithm. Secondly, in order to adapt to the clustering mining of massive data, it used the modified K-means algorithm to deploy in the MapReduce framework. The experimental results show that the proposed method has high precision and recall rate and strong clustering stability in single machine performance, and also has better speedup ratio and running speed in clustering performance.
英文关键词 MapReduce framework; K-means algorithm; data mining; clustering analysis
参考文献 查看稿件参考文献
  [1] Cui Xiaoli, Zhu Pingfei, Yang Xin, et al. Optimized big data K-means clustering using MapReduce[J] . Journal of Supercompu-ting, 2014, 70(3):1249-1259.
[2] Lin Kunhui, Li Xiang, Zhang Zhongnan, et al. A K-means clustering with optimized initial center based on Hadoop platform[C] // Proc of the 9th International Conference on Computer Science & Education. Piscataway, NJ:IEEE Press, 2014:263-266.
[3] Debatty T, Michiardi P, Mees W, et al. Determining the k in K-means with MapReduce[C] //Proc of EDBT/ICDT Workshops. 2014:19-28.
[4] Yuan Qilong, Shi Haibo, Zhou Xiaofeng. An optimized initialization center K-means clustering algorithm based on density[C] //Proc of IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems. Piscataway, NJ:IEEE Press, 2015:790-794.
[5] Kettani O, Ramdani F, Tadili B. AK-means:an automatic clustering algorithm based on K-means[J] . Journal of Advanced Computer Science & Technology, 2015, 4(2):231-236.
[6] Ma Li, Gu Lei, Li Bo, et al. An improved K-means algorithm based on MapReduce and grid[J] . International Journal of Grid & Distributed Computing, 2015, 8(1):189-200.
[7] 王永贵, 武超, 戴伟. 基于MapReduce的随机抽样K-means算法[J] . 计算机工程与应用, 2016, 52(8):74-79.
[8] 李兰英, 董义明, 孔银, 等. 改进K-means算法的MapReduce并行化研究[J] . 哈尔滨理工大学学报, 2016, 21(1):31-35.
[9] 刘义, 景宁, 陈荦, 等. MapReduce框架下基于R-树的K-近邻连接算法[J] . 软件学报, 2013, 24(8):1836-1851.
[10] 梁俊杰, 李凤华, 刘琼妮, 等. MapReduce框架下的优化高维索引与KNN查询[J] . 电子学报, 2016, 44(8):1873-1880.
[11] 孙玉强, 李媛媛, 陆勇. 基于MapReduce的K-means聚类算法的优化[J] . 计算机测量与控制, 2016, 24(7):272-275.
[12] 梁亚声, 徐欣, 成小菊, 等. 数据挖掘原理·算法与应用[M] . 北京:机械工业出版社, 2015.
[13] Xia S, Li W, Zhou Y, et al. Improved K-means clustering algorithm[J] . Journal of Southeast University, 2007, 23(3):435-438.
[14] 李钊, 李晓, 王春梅, 等. 一种基于MapReduce的文本聚类方法研究[J] . 计算机科学, 2016, 43(1):246-250.
[15] 李建江, 崔健, 王聃, 等. MapReduce并行编程模型研究综述[J] . 电子学报, 2011, 39(11):2635-2642.
[16] 许丞, 刘洪, 谭良. Hadoop云平台的一种新的任务调度和监控机制[J] . 计算机科学, 2013, 40(1):112-117.
收稿日期 2017/4/12
修回日期 2017/5/22
页码 2295-2298
中图分类号 TP391
文献标志码 A