《计算机应用研究》|Application Research of Computers

基于MapReduce和改进密度峰值的划分聚类算法

Partition clustering algorithm based on MapReduce and improved density peak

免费全文下载 (已被下载 次)  
获取PDF全文
作者 黄学雨,向驰,陶涛
机构 江西理工大学 信息工程学院,江西 赣州 341000
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2021)10-017-2988-06
DOI 10.19734/j.issn.1001-3695.2021.03.0093
摘要 对于基于划分的聚类算法随机选取初始聚类中心导致初始中心敏感,聚类结果不稳定、集群效率低等问题,提出一种基于MapReduce框架和改进的密度峰值的划分聚类算法(based on MapReduce framework and improved density peak partition clustering algorithm,MR-IDPACA)。首先,通过自然最近邻定义新的局部密度计算方式,将搜索样本密度峰值点作为划分聚类算法的初始聚类中心;其次针对算法在大规模数据下运行时间复杂,提出基于E2LSH(exact Euclidean locality sensitive hashing)的一种分区方法,即KLSH(<i>K</i> of locality sensitive hashing)。通过该方法对数据分区后结合MapReduce框架并行搜寻初始聚类中心,有效减少了算法在搜索初始聚类中心时的运行时间;对于MapReduce框架中的数据倾斜问题,提出ME(multistage equilibrium)策略对中间数据进行多段均衡分区,以提升算法运行效率;在MapReduce框架下并行聚类,得到最终聚类结果。实验得出MR-IDPACA 算法在单机环境下有着较高的准确率和较强的稳定性,集群性能上也有着较好的加速比和运行时间,聚类效果有所提升。
关键词 划分聚类算法; 密度峰值; 自然最近邻; MapReduce; 数据倾斜
基金项目 国家重点研发计划项目(2020YFB1713700)
本文URL http://www.arocmag.com/article/01-2021-10-017.html
英文标题 Partition clustering algorithm based on MapReduce and improved density peak
作者英文名 Huang Xueyu, Xiang Chi, Tao Tao
机构英文名 School of Information Engineering,Jiangxi University of Science & Technology,Ganzhou Jiangxi 341000,China
英文摘要 Aiming at clustering algorithm based on partition to randomly select the initial cluster center, which leads to the sensitivity of the initial center, unstable clustering result, low cluster efficiency, etc., this paper proposed a partition clustering algorithm based on MapReduce framework and improved density peak, named MR-IDPACA. Firstly, this paper defined a new local density calculation method by natural nearest neighbors, and then searched for the peak point of the sample density as the initial cluster center of the partitioning clustering algorithm. Secondly, in viewed of the complex running time of the algorithm under large-scale data, it proposed an algorithm based on E2LSH, named KLSH. In this method, the data was partitioned and combined with the MapReduce framework to search the initial cluster centers in parallel, which effectively reduced the running time of the algorithm when searching for the initial cluster centers. Next, for the data skew problem in the MapReduce framework, it proposed the ME strategy to divide the intermediate data into multi-segment equilibrium to improve the efficiency of the algorithm. Finally, parallel clustering under the MapReduce framework to obtain the final clustering result. The experiment shows that the MR-IDPACA algorithm has higher accuracy and stronger stability in a single-machine environment, and the cluster performance also has a better speedup ratio and running time, and the clustering effect has been improved.
英文关键词 partition clustering algorithm; density peak; nature's nearest neighbor; MapReduce; data skew
参考文献 查看稿件参考文献
 
收稿日期 2021/3/20
修回日期 2021/5/19
页码 2988-2993,3024
中图分类号 TP311
文献标志码 A