《计算机应用研究》|Application Research of Computers

一种基于距离和采样机制的数据流分类方法

Data streams classification approach based on distance and sampling

免费全文下载 (已被下载 次)  
获取PDF全文
作者 胡学钢,何俊宏,李培培
机构 合肥工业大学 计算机与信息学院,合肥 230009
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)04-0992-04
DOI 10.3969/j.issn.1001-3695.2018.04.007
摘要 数据流分类在传感器网络、网络监控等实际领域有着广泛的应用,然而,实际数据流中类分布不平衡和类标签大量缺失的问题严重加剧了数据流分类问题求解的难度。因此,针对数据流中类分布不平衡和类标签大量缺失的问题,提出了一种基于距离和采样机制的集成分类方法。该方法首先计算无标签数据与有标签正负类数据块的中心点距离来标记正负类示例,然后通过正类样本的上采样和负类样本的下采样机制重组数据流块以平衡数据块的类分布,并在其上构建集成分类模型。在模拟的具有类分布不平衡的不完全标记数据流上的实验表明,与经典的同类算法相比,所提方法能够在降低不平衡类分布影响的前提下,提高不完全标记数据流的分类精度。
关键词 分类;集成学习;类分布不平衡;类标签缺失
基金项目 国家重点研发计划项目(2016YFC0801406)
国家自然科学基金青年基金资助项目(61503112)
国家自然科学基金资助项目(61673152)
本文URL http://www.arocmag.com/article/01-2018-04-007.html
英文标题 Data streams classification approach based on distance and sampling
作者英文名 Hu Xuegang, He Junhong, Li Peipei
机构英文名 SchoolofComputer&Information,HefeiUniversityofTechnology,Hefei230009,China
英文摘要 Data stream classification is widely used in sensor networks, network monitoring and other real-world applications. However, the problem of class imbalance and label missing in data stream greatly aggravates the difficulty of data stream classification. Therefore, this paper proposed an ensemble classification method based on distance evaluation and sampling to solve the problem of incomplete labeled data stream classification with imbalanced class distribution. The proposed method first calculated the distance between the unlabeled data and the center point of the labeled data chunks to partition the positive and negative instances. Secondly, in order to balance the class distribution of the current data chunk, the data chunk was reconstructed by over-sampling positive instances and under-sampling negative instances, and then it was used to build an ensemble classification model. Experiments on the simulated incomplete labeled data stream with class imbalance show that the proposed method can improve the classification accuracy while reducing the influence of imbalanced class distribution as compared with the classical similar algorithm.
英文关键词 classification; ensemble learning; class imbalance; label missing
参考文献 查看稿件参考文献
  [1] Yu Hualong, Ni Jun, Zhao Jing. ACOSampling:an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data[J] . Neurocomputing, 2013, 101(3):309-318.
[2] Oh S H. Error back-propagation algorithm for classification of imbalanced data[J] . Neurocomputing, 2011, 74(6):1058-1061.
[3] Yang Chanyun, Yang Jr-Syu, Wang Jianjun, et al. Margin calibration in SVM class-imbalanced learning[J] . Neurocomputing, 2009, 73(1-3):397-411.
[4] Gao Jing, Ding Bolin, Fan Wei, et al. Classifying data streams with skewed class distributions and concept drifts[J] . IEEE Internet Computing, 2008, 12(6):37-49.
[5] Gao Jing, Fan Wei, Han Jiawei, et al. A general framework for mining concept-drifting data streams with skewed distributions[C] //Proc of SIAM International Conference on Data Mining. 2007:3-14.
[6] Chen Sheng, He Haibo. SERA:selectively recursive approach towards nonstationary imbalanced stream data mining[C] //Proc of International Joint Conference on Neural Networks. Washington DC:IEEE Computer Society, 2009:522-529.
[7] Lichtenwalter R, Chawla N V. Adaptive methods for classification in arbitrarily imbalanced and drifting data streams[C] //Proc of the 13th Pacific-Asia International Conference on Knowledge Discovery and Data Mining. Berlin:Springer, 2010:53-75.
[8] Ditzler G, Polikar R. An ensemble based incremental learning framework for concept drift and class imbalance[C] //Proc of International Joint Conference on Neural Networks. 2010:1-8.
[9] Ghazikhani A, Monsefi R, Yazdi H S. Ensemble of online neural networks for non-stationary and imbalanced data streams[J] . Neurocomputing, 2013, 122(12):535-544.
[10] Zhang Peng, Zhu Xingquan, Tan Jianlong, et al. Classifier and cluster ensembles for mining concept drifting data streams[C] //Proc of the 10th IEEE International Conference on Data Mining. 2010:1175-1180.
[11] Wu Xindong, Li Peipei, Hu Xuegang. Learning from concept drifting data streams with unlabeled data[J] . NeuroComputing, 2012, 92(1):145-155.
[12] Bifet A, Holmes G, Kirkby R, et al. Moa:massive online analysis[J] . Machine Learning Research, 2010, 11(2):1601-1604.
[13] He Haibo, Garcia E A. Learning from imbalanced data[J] . IEEE Trans on Knowledge & Data Engineering, 2009, 21(9):1263-1284.
[14] http://www. weka. net. nz/[EB/OL] .
收稿日期 2016/12/5
修回日期 2017/1/18
页码 992-995,1000
中图分类号 TP391
文献标志码 A