《计算机应用研究》|Application Research of Computers

面向不平衡数据分类的KFDA-Boosting算法

KFDA-Boosting algorithm oriented to imbalanced data classification

免费全文下载 (已被下载 次)  
获取PDF全文
作者 王来,樊重俊,杨云鹏,袁光辉
机构 1.上海理工大学 管理学院,上海 200093;2.上海财经大学 a.信息管理与工程学院;b.实验中心,上海 200433
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)03-033-0807-05
DOI 10.19734/j.issn.1001-3695.2017.10.0978
摘要 数据分布的不平衡性和数据特征的非线性增加了分类的困难,特别是难以识别不平衡数据中的少数类,从而影响整体的分类效果。针对该问题,结合KFDA(kernel Fisher discriminant analysis)能有效提取样本非线性特征的特性和集成学习中Boosting算法的思想,提出了KFDA-Boosting算法。为了验证该算法对不平衡数据分类的有效性和优越性,以G-mean值、少数类的查准率与查全率作为分类效果的评价指标,选取了UCI中10个数据集测试KFDA-Boosting算法性能,并与支持向量机等六种分类算法进行对比实验。结果表明,对于不平衡数据分类,尤其是对不平衡度较大或呈非线性特征的数据,相比于其他分类算法,KFDA-Boosting算法能有效地识别少数类,并且在整体上具有显著的分类效果和较好的稳定性。
关键词 核费希尔判别分析;集成学习;不平衡数据;分类
基金项目 国家自然科学基金资助项目(71303157)
上海市教育委员会科研创新重点基金项目(14ZZ131)
上海市一流学科资助基金项目(S1205YLXK)
上海市社科规划青年课题基金项目(2014EGL007)
沪江基金资助项目(D14008)
本文URL http://www.arocmag.com/article/01-2019-03-033.html
英文标题 KFDA-Boosting algorithm oriented to imbalanced data classification
作者英文名 Wang Lai, Fan Chongjun, Yang Yunpeng, Yuan Guanghui
机构英文名 1.BusinessSchool,UniversityofShanghaiforScience&Technology,Shanghai200093,China;2.a.SchoolofInformationManagement&Engineering,b.ExperimentalCenter,ShanghaiUniversityofFinance&Economics,Shanghai200433,China
英文摘要 The imbalance of data distribution and the nonlinearity of data characteristics increase the difficulty of classification, especially the recognition of the minority class samples in the imbalanced data, thus affecting the overall classification effect.For the above problem, this paper proposed an algorithm called KFDA-Boosting, which combined the characteristic of KFDA, namely kernel fisher discriminant analysis, effectively extracted the samples’nonlinear features and the idea of Boosting algorithm in the ensemble learning.In order to verify the effectiveness and superiority of the algorithm in the classification of imbalanced data, the paper used the G-mean value, the precision and recall of the minority class samples to evaluate the performance of classifier, and selected 10 datasets of UCI to test the KFDA-Boosting algorithm, which compared with other six algorithms, such as support vector machine.Compared with other algorithms, the results show that the algorithm can effectively identify the minority class, and has a significant effect on the classification of imbalanced data and better stability on the whole, especially for the data with larger unbalance degree or nonlinear characteristics.
英文关键词 kernel Fisher discriminant analysis; ensemble learning; imbalanced data; classify
参考文献 查看稿件参考文献
  [1] Laurikkala J. Improving identification of difficult small classes by balancing class distribution[C] //Proc of the 8th Conference on AI in Medicine. Berlin:Springer-Verlag, 2001:63-66.
[2] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE:synthetic minority over-sampling technique[J] . Artificial Intelligence Research, 2002, 16(3):321-357.
[3] 郑文昌, 陈淑燕, 王宣强. 面向不平衡数据集的SMOTE-SVM交通事件检测算法[J] . 武汉理工大学学报, 2012, 34(11):58-62, 123. (Zheng Wenchan, Chen Shuyan, Wang Xuanqiang. Imbalanced datasets based SMOTE-SVM-AID algorithm[J] . Journal of Wuhan University of Technology, 2012, 34(11):58-62, 123. )
[4] 衣柏衡, 朱建军, 李杰. 基于改进SMOTE的小额贷款公司客户信用风险非均衡SVM分类[J] . 中国管理科学, 2016, 24(3):24-30. (Yi Baiheng, Zhu Jianjun, Li Jie. Imbalanced data classification on micro-credit company customer credit risk assessment using improved smote support vector machine[J] . Chinese Journal of Management Science, 2016, 24(3):24-30. )
[5] 杨毅, 卢诚波, 徐根海. 面向不平衡数据集的一种精化Borderline-SMOTE方法[J] . 复旦学报:自然科学版, 2017, 56(5):537-544. (Yang Yi, Lu Chengbo, Xu Genhai. A refined Borderline-SMOTE method for imbalanced data set[J] . Journal of Fudan University:Natural Science, 2017, 56(5):537-544. )
[6] 蒋盛益, 谢照青, 余雯. 基于代价敏感的朴素贝叶斯不平衡数据分类研究[J] . 计算机研究与发展, 2011, 48(S1):387-390. (Jiang Shengyi, Xie Zhaoqing, Yu Wen. Naive Bayes classification algorithm based on cost sensitive for imbalanced data distribution[J] . Journal of Computer Research and Development, 2011, 48(S1):387-390. )
[7] 李勇, 刘战东, 张海军. 不平衡数据的集成分类算法综述[J] . 计算机应用研究, 2014, 31(5):1287-1291. (Li Yong, Liu Zhan Dong, Zhang Haijun. Review on ensemble algorithms for unbalanced data classification[J] . Application Research of Computers, 2014, 31(5):1287-1291. )
[8] 邹鹏, 莫佳卉, 江亦华, 等. 基于代价敏感决策树的客户价值细分[J] . 管理科学, 2011, 24(2):20-29. (Zou Peng, Mo Jiahui, Kiang Melody, et al. A cost-sensitive decision tree learning model:an application to customer value based segmentation[J] . Journal of Management Science, 2011, 24(2):20-29. )
[9] 师彦文, 王宏杰. 基于新型不纯度度量的代价敏感随机森林分类器[J] . 计算机科学, 2017, 44(S2):98-101. (Shi Yanwen, Wang Hongjie. Cost- sensitive random forest classifier with new impurity measurement[J] . Computer Science, 2017, 44(S2):98-101. )
[10] Schapire R E. The strength of weak learnability[J] . Machine Learning, 1990, 5(2):197-227.
[11] Breiman L. Bagging predictors[J] . Machine Learning, 1996, 24(2):123-140.
[12] Li Kewen, Fang Xianghua, Zhai Jiannan, et al. An imbalanced data classification method driven by boundary samples-boundary-boost[C] //Proc of International Conference on Information Science and Control Engineering. Piscataway, NJ:IEEE Press, 2016:194-199.
[13] 胡小生, 温菊屏, 钟勇. 动态平衡采样的不平衡数据集成分类方法[J] . 智能系统学报, 2016, 11(2):257-263. (Hu Xiaosheng, Wen Juping, Zhong Yong. Imbalanced data ensemble classification using dynamic balance sampling[J] . CAAI Trans on Intelligent Systems, 2016, 11(2):257-263. )
[14] 秦孟梅, 邱建林, 陆鹏程, 等. 基于AdaBoost的类不平衡学习算法[J] . 计算机应用研究, 2017, 34(11):3229-3232, 3254. (Qin Mengmei, Qiu Jianlin, Lu Pengcheng, et al. AdaBoost-based class imbalance learning algorithm[J] . Application Research of Computers, 2017, 34(11):3229-3232, 3254. )
[15] 应维云, 蔺楠, 谢雅雅, 等. 用LDA Boosting算法进行客户流失预测[J] . 数理统计与管理, 2010, 29(3):400-408. (Ying Weiyun, Lin Nan, Xie Yaya, et al. Research on the LDA Boosting in customer churn prediction[J] . Journal of Applied Statistics and Management, 2010, 29(3):400-408. )
[16] 李诒靖, 郭海湘, 李亚楠, 等. 一种基于Boosting的集成学习算法在不均衡数据中的分类[J] . 系统工程理论与实践, 2016, 36(1):189-199. (Li Yijing, Guo Haixiang, Li Yanan, et al. A Boosting based ensemble learning algorithm in unbalanced data classification[J] . Systems Engineering-Theory & Practice, 2016, 36(1):189-199. )
[17] 王璐林. 面向不平衡样本的Boosting分类算法研究[D] . 哈尔滨:哈尔滨工业大学, 2013. (Wang Lulin. Research of Boosting classification algorithm for imbalanced data[D] . Harbin:Harbin Institute of Technology, 2013. )
[18] 李想. Boosting分类算法的应用与研究[D] . 兰州:兰州交通大学, 2012. (Li Xiang. Research on classification algorithm of Boosting and its applications[D] . Lanzhou:Lanzhou Jiaotong University, 2012. )
[19] 常志朋, 程龙生. 核Fisher判别分析多参数自动优化算法[J] . 系统工程与电子技术, 2013, 35(1):212-217. (Chang Zhipeng, Cheng Longsheng. Automatic optimization algorithm of multiple parameters for kernel Fisher discriminant analysis[J] . Systems Engineering and Electronics, 2013, 35(1):212-217. )
[20] 李建云, 邱菀华. 核Fisher判别分析方法评估消费者信用风险[J] . 系统工程理论方法应用, 2004, 13(6):548-552, 556. (Li Jianyun, Qiu Wanhua. Evaluation consumer credit with kernel Fisher discriminant analysis[J] . Systems Engineering-Theory Methodology Applications, 2004, 13(6):548-552, 556. )
收稿日期 2017/10/24
修回日期 2017/12/25
页码 807-811
中图分类号 TP301.6
文献标志码 A