《计算机应用研究》|Application Research of Computers

基于LDA-wSVM模型的文本分类研究

Research on text categorization based on LDA-wSVM model

免费全文下载 (已被下载 次)  
获取PDF全文
作者 李锋刚,梁钰,GAO Xiao-zhi,ZENGER Kai
机构 1.合肥工业大学 管理学院,合肥 230009;2.阿尔托大学 自动化与系统技术系,芬兰FI00076;3.教育部过程优化与智能决策重点实验室,合肥 230009
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2015)01-0021-05
DOI 10.3969/j.issn.1001-3695.2015.01.005
摘要 SVM分类算法处理高维数据具有较大优势,但其未考虑语义的相似性度量问题,而LDA主题模型可以解决传统的文本分类中相似性度量和主题单一性问题。为了充分结合SVM和LDA算法的优势并提高分类精确度,提出了一种新的LDA-wSVM高效分类算法模型。利用LDA主题模型进行建模和特征选择,确定主题数和隐主题—文本矩阵;在经典权重计算方法上作改进,考虑各特征项与类别的关联度,设计了一种新的权重计算方法;在特征词空间上使用这种基于权重计算的wSVM分类器进行分类。实验基于R软件平台对搜狗实验室的新闻文本集进行分类,得到了宏平均值为0.943的高精确度分类结果。实验结果表明,提出的LDA-wSVM模型在文本自动分类中具有很好的优越性能。
关键词 文本分类;潜在狄利克雷分布;支持向量机;权重计算;吉普斯抽样
基金项目 国家自然科学基金资助项目(71301041)
南京市科技计划资助项目(2012sf542010)
国家留学基金资助项目
本文URL http://www.arocmag.com/article/01-2015-01-005.html
英文标题 Research on text categorization based on LDA-wSVM model
作者英文名 LI Feng-gang, LIANG Yu, GAO Xiao-zhi, ZENGER Kai
机构英文名 1. School of Management, Hefei University of Technology, Hefei 230009, China; 2. Dept. of Automation & Systems Technology, Aalto University, Aalto FI00076, Finland; 3. Key Laboratory of Process Optimization & Intelligent Decisionmaking of Ministry of Education, Hefei 230009, China
英文摘要 SVM algorithm has great advantages in dealing with high dimensional data, but it does not consider the problem of semantic similarity measurement.However, LDA topic model can solve the problem of similarity measurement and single theme.In order to get more precise classification and make use of the advantages of SVM and LDA, this paper proposed a new efficient method.Firstly, it studied on LDA topic model for modeling and feature selection in order to determine the number of hidden topic number and topic-document matrix. Secondly, it proposed a new method for calculating the weights which considered the features and categories of correlation based on classical weight calculation. Finally, it applied this new method to the following SVM classifier (wSVM). The experiments were based on R software for categorization of the data obtained from Sogou laboratory. The experimental results are with a high accuracy of macro_F1 0.943.It verifies LDA-wSVM model has superiority in text categorization.
英文关键词 text categorization; latent Dirichlet allocation(LDA); support vector machine(SVM); weight calculation; Gibbs sampling
参考文献 查看稿件参考文献
  [1] GUO Yi, SHAO Zhi-qing, HUA Nan. Automatic text categorization based on content analysis with cognitive situation models[J] . Information Sciences, 2010, 180(5):613-630.
[2] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J] . Journal of Machine Learning Research, 2003, 3(3):993-1022.
[3] GALLIGAN M C, SALDOCA R, CAMPBELL M P, et al. Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution[J] . BMC Bioinformatics, 2013, 14(1):155.
[4] GRIFFITHS T. Gibbs sampling in the generative model of latent dirichlet allocation[D] . Stanford:Stanford University, 2002.
[5] BOYD-GRABER J L, BLEI D M, ZHU Xiao-jin. A topic model for word sense disambiguation[C] //Proc of Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning. 2007:1024-1033.
[6] CHANG C C, LIN C J. LIBSVM:a library for support vector machines[J] . ACM Trans on Intelligent Systems and Technology, 2011, 2(3):27-65.
[7] JOACHIMS T. A support vector method for multivariate performance measures[C] //Proc of the 22nd International Conference on Machine Learning. New York:ACM Press, 2005:377-384.
[8] 霍颖瑜, 王晓峰. 一种新的SVM多类分类算法[J] . 佳木斯大学学报:自然科学版, 2006, 24(4):476-478.
[9] ZHANG Yan-kun, HONG Chu-yang, WANG C. An efficient real time rectangle speed limit sign recognition system[C] //Proc of Intelligent Vehicles Symposium. 2010:34-38.
[10] HSU C W, LIN C J. A comparison of methods for multiclass support vector machines[J] . IEEE Trans on Neural Networks, 2002, 13(2):415-425.
[11] ZHANG Wen, YOSHIDA T, TANG Xi-jin. A comparative study of TF* IDF, LSI and multi-words for text classification[J] . Expert Systems with Applications, 2011, 38(3):2758-2765.
[12] VAPNIK V N. 统计学习理论的本质[M] . 张学工, 译. 北京:清华大学出版社, 2000:1-226.
[13] BEN-HUR A, WESTON J. Data mining techniques for the life sciences:a user’s guide to support vector machines[M] //Data Mining Techniques for the Life Science. [S. l. ] :Humana Press, 2010:223-239.
收稿日期 2014/1/7
修回日期 2014/2/21
页码 21-25
中图分类号 TP391.1
文献标志码 A