《计算机应用研究》|Application Research of Computers

基于绕质心聚类算法的大数据挖掘

Big data mining based on around-centroid clustering algorithm

免费全文下载 (已被下载 次)  
获取PDF全文
作者 田华,何翼
机构 铜仁学院 大数据学院,贵州 铜仁 554300
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2020)12-013-3586-04
DOI 10.19734/j.issn.1001-3695.2019.09.0540
摘要 针对大数据分析在大规模并行分布式系统和软件平台上可扩展的问题,提出了一个基于无参数围绕质心二进制分裂聚类(clustering using binary splitting,CLUBS)的大数据挖掘技术。该技术以完全无监督的方式工作,基于最小二次距离的准则进行分裂聚类将数据与噪声分离,通过中级精炼来识别仅包含异常值的块并为剩余块生成全面的簇,设计CLUBS的并行化版本以实现对大数据进行快速有效的聚类。实验表明CLUBS并行算法不受数据维度和噪声的影响,比现有算法具有更好的可扩展性且速度较快。
关键词 大数据; 分裂聚类; 凝聚聚类; 数据挖掘
基金项目 贵州省教育厅创新群体重大研究项目(黔教合KY字[2016]051)
本文URL http://www.arocmag.com/article/01-2020-12-013.html
英文标题 Big data mining based on around-centroid clustering algorithm
作者英文名 Tian Hua, He Yi
机构英文名 School of Data Science,Tongren University,Tongren Guizhou 554300,China
英文摘要 Aiming at the problem of extensibility of big data analysis on massively parallel distributed systems and software platforms, this paper proposed a big data mining technique based on parameter-free CLUBS algorithm around centroids. The technique worked in a completely unsupervised manner, splitting clusters based on minimum quadratic distance criteria to sepa-rate data from noise, identified blocks containing only outliers by intermediate refinement and generated complete clusters for the remaining blocks, it designed a parallelized version of CLUBS to enable fast and efficient clustering of big data. Experiments show that the CLUBS parallel algorithm is not affected by data dimension and noise, and is better than the existing algorithms in terms of scalability and execution time.
英文关键词 big data; split clustering; agglomerative clustering; data mining
参考文献 查看稿件参考文献
 
收稿日期 2019/9/22
修回日期 2019/11/13
页码 3586-3589
中图分类号 TP393
文献标志码 A