《计算机应用研究》|Application Research of Computers

Num-近邻方差优化的K-medoids聚类算法

Optimized K-medoids clustering algorithm by variance of Num-near neighbour

免费全文下载 (已被下载 次)  
获取PDF全文
作者 谢娟英,高瑞
机构 陕西师范大学 计算机科学学院,西安 710062
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2015)01-0030-05
DOI 10.3969/j.issn.1001-3695.2015.01.007
摘要 针对K-medoids聚类算法对初始聚类中心敏感、聚类结果依赖于初始聚类中心的缺陷,提出一种局部方差优化的K-medoids聚类算法,以期使K-medoids的初始聚类中心分布在不同的样本密集区域,聚类结果尽可能地收敛到全局最优解。该算法引入局部方差的概念,根据样本所处位置的局部样本分布定义样本的局部方差,以样本局部标准差为邻域半径,选取局部方差最小且位于不同区域的样本作为K-medoids的初始中心,充分利用了方差所提供的样本分布信息。在规模大小不等的UCI数据集以及带有不同比例噪声的不同规模的人工模拟数据集上进行实验,并利用六种聚类算法性能测试指标进行测试,结果表明该算法具有聚类效果好、抗噪性能强的优点,而且适用于大规模数据集的聚类。提出的Num-近邻方差优化的K-medoids聚类算法优于快速K-medoids聚类算法及基于邻域的改进K-medoids聚类算法。
关键词 局部方差;Num-近邻;邻域;初始聚类中心;聚类
基金项目 陕西省科技攻关基金资助项目(2013K12-03-24)
国家自然科学基金资助项目(31372250)
中央高校基本科研业务费专项资金资助项目(GK201102007)
本文URL http://www.arocmag.com/article/01-2015-01-007.html
英文标题 Optimized K-medoids clustering algorithm by variance of Num-near neighbour
作者英文名 XIE Juan-ying, GAO Rui
机构英文名 School of Computer Science, Shaanxi Normal University, Xi'an 710062, China
英文摘要 To overcome the disadvantages of K-medoids which was sensible to the initial seeds and whose clustering depended on the initial seeds, this paper proposed a new K-medoids algorithm to select the samples in different dense area as the initial seeds and made the clustering of K-medoids converge to the global optimal solution as could as possible. The new algorithm introduced the concept of the local variance, and gave the definition using the distribution pattern of exemplars in a local area.Then the local standard deviation was regarded the radius of the neighbourhood, so that the samples with the minimum local variance and lying at different areas were chosen as initial seeds for K-medoids. The proposed algorithm was tested on the real datasets with different size of samples from UCI machine learning repository and on the synthetically generated datasets with the varied size of exemplars and with some proportional noises.This paper adopted the 6 very popular criteria for evaluating clustering algorithms to value the performance of the proposed algorithm. The experimental results demonstrate that the proposed K-medoids algorithm obtains good clustering, and is robust to noises, and is scalable to cluster large scale datasets. The proposed K-medoids clustering algorithm outperforms the fast K-medoids clustering algorithm and the improved K-medoids algorithm which is based on the neighbourhood.
英文关键词 local variance; Num-nearneighbour; neibourhood; initial seeds; clustering
参考文献 查看稿件参考文献
  [1] 孙吉贵, 刘杰, 赵连宇. 聚类算法研究[J] . 软件学报, 2008, 19(1):48-61.
[2] HAN Jia-wei, KAMBER M, PEI Jing. Data mining:concepts and techniques[M] . San Francisco:Morgan Kaufmann Publishers, 2006.
[3] HUANG Zhe-xue. Clustering large data sets with mixed numeric and categorical values[C] //Proc of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining. 1997:21-34.
[4] HUANG Zhe-xue. Extensions to the K-means algorithm for clustering large data sets with categorical values[J] . Data Mining and Knowledge Discovery, 1998, 2(3):283-304.
[5] HUANG Zhe-xue, NG M K, RONG Hong-qiang, et al. Automated variable weighting in K-means type clustering[J] . IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27(5):657-668.
[6] CHEN Xiao-jun, YE Yun-ming, XU Xiao-fei, et al. A feature group weighting method for subspace clustering of high-dimensional data[J] . Pattern Recognition, 2012, 45(1):434-446.
[7] 谢娟英, 蒋帅, 王春霞, 等. 一种改进的全局 K-均值聚类算法[J] . 陕西师范大学学报:自然科学版, 2010, 38(2):18-22.
[8] 谢娟英, 张琰, 谢维信, 等. 一种新的密度加权粗糙K-均值聚类算法[J] . 山东大学学报:理学版, 2010, 45(7):1-6.
[9] 谢娟英, 马箐, 谢维信. 一种确定最佳聚类数的新算法[J] . 陕西师范大学学报:自然科学版, 2012, 40(1):13-18.
[10] 谢娟英, 郭文娟, 谢维信, 等. 基于样本空间分布密度的改进次胜者受罚竞争学习算法[J] . 计算机应用, 2012, 32(3):638-642.
[11] KAUFMAN L, ROUSSEEUW P J. Finding groups in data:an introduction to cluster analysis[M] . Hoboken:Wiley, 1990.
[12] LUCASIUS C B, DANE A D, KATEMAN G. On K-medoid clustering of large data sets with the aid of a genetic algorithm:background, feasibility and comparison[J] . Analytica Chimica Acta, 1993, 282(3):647-669.
[13] NG R T, HAN Jia-wei. Efficient and effective clustering methods for spatial data mining[C] //Proc of the 20th International Conference on Very Large Databases. 1994:144-155.
[14] WEI C P, LEE Y H, HSU C M. Empirical comparison of fast partitioning-based clustering algorithms for large data sets[J] . Expert Systems with Applications, 2003, 24(4):351-363.
[15] ZHANG Qiao-ping, COULOIGNER I. A new and efficient K-medoid algorithm for spatial clustering[C] //Computational Science and Its Applications. Berlin:Springer, 2005:181-189.
[16] PARK H S, JUN C H. A simple and fast algorithm for K-medoids clustering[J] . Expert Systems with Applications, 2009, 36 (2):3336-3341.
[17] 谢娟英, 郭文娟, 谢维信. 基于邻域的K中心点聚类算法[J] . 陕西师范大学学报:自然科学版, 2012, 40(4):1672-4291.
[18] 盛骤, 谢式千. 概率论与数理统计[M] . 4版. 北京:高等教育出版社, 2010.
[19] 杨燕, 靳蕃, MOHAMED K. 聚类有效性评价综述[J] . 计算机应用研究, 2008, 25(6):1630-1632.
[20] 张惟皎, 刘春煌, 李芳玉. 聚类质量的评价方法[J] . 计算机工程, 2005, 31(20):10 -12.
[21] VINH N X, EPPS J, BAILEY J. Information theoretic measures for clustering comparison:is a correction for chance necessary?[C] //Proc of the 26th International Conference on Machine Learning. New York:ACM Press, 2009:1073-1080.
[22] TAN Pang-ning, STEINBACH M, KUMAR V. Introduction to data mining[M] . 范明, 范建宏, 译. 北京:人民邮电出版社, 2006.
[23] FRANK A, ASUNCION A. UCI machine learning repository[EB/OL] . (2010). http://archive. ics. uci. edu/ml.
收稿日期 2013/12/1
修回日期 2014/1/18
页码 30-34
中图分类号 TP181;TP301.6
文献标志码 A