《计算机应用研究》|Application Research of Computers

LSI_LDA:一种混合特征降维方法

LSI_LDA: mixture method for feature dimensionality reduction

免费全文下载 (已被下载 次)  
获取PDF全文
作者 史庆伟,从世源,唐晓亮
机构 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2017)08-2269-05
DOI 10.3969/j.issn.1001-3695.2017.08.006
摘要 LDA没有考虑到数据输入,在原始输入空间上对所有词进行主题标签,因对非作用词同样分配主题,致使主题分布不精确。针对其不足,提出了一种结合LSI和LDA的特征降维方法,预先采用LSI将原始词空间映射到语义空间,再根据语义关系筛选出原始特征集中关键的特征,最后通过LDA模型在更小、更切题的文档子集上采样建模。对复旦大学中文语料进行文本分类,新方法的分类精度较单独使用LDA模型的效果提高了1.50%。实验表明提出的LSI_LDA模型在文本分类中有更好的分类性能。
关键词 文本分类;特征降维;潜在语义索引;潜在狄利克雷分配
基金项目 国家自然科学基金青年科学基金资助项目(61401185)
辽宁省教育厅科学研究一般项目(L2013133)
本文URL http://www.arocmag.com/article/01-2017-08-006.html
英文标题 LSI_LDA: mixture method for feature dimensionality reduction
作者英文名 Shi Qingwei, Cong Shiyuan, Tang Xiaoliang
机构英文名 CollegeofSoftware,LiaoningTechnicalUniversity,HuludaoLiaoning125105,China
英文摘要 The LDA method does not take the input space into consideration effectively when making topic label to each word in the original space.As the original input holds the non-action terms, which affects the topic distribution extremely and reduces the classification accuracy.In order to remedy this imperfection, this paper proposed a new LSI_LDA algorithm.Firstly, LSI model mapped the input space to the latent semantic space. Secondly, it extracted the key features in accordance with their semantic relation.Finally, LDA model could perfectly performed on a simpler and more pertinent space.The classification accuracy was improved by 1.50% using the proposed method than that using LDA alone with Fudan University corpus.This experimental result shows that the LSI_LDA has a higher performance in text categorization.
英文关键词 text categorization; feature dimensionality reduction; latent semantic index(LSI); latent Dirichlet allocation(LDA)
参考文献 查看稿件参考文献
  [1] Guo Yi, Shao Zhiqing, Hui Nan. Automatic text categorization based on content analysis with cognitive situation model[J] . Information Sciences, 2010, 180(5):613-630.
[2] Deerwester S, Dumais S, Furnas G. Indexing by latent semantic ana-lysis[J] . Journal of the American Society for Information Scien-ce, 1990, 41(6):391-407.
[3] Hofmann T. Unsupervised learning by probabilistic latent semantic analysis[J] . Machine Learning, 2001, 42(2):177-196.
[4] Blei D, Ng A, Jordan M. Latent Dirichlet allocation[J] . Journal of Machine Learning Research, 2003, 3(3):993-1022.
[5] Wu Xiaojun, Fang Liying, Wang Pu. Performance of using LDA for Chinese news test classification[C] //Proc of IEEE Canadian Confe-rence on Electrical and Computer Engineering. 2015:1260-1264.
[6] Smith D A, McManis C. Classification of text to subject using LDA[C] //Proc of IEEE International Conference on Semantic Computing. 2015:131-135.
[7] 李锋刚, 梁钰. 基于LDA-wSVM模型的文本分类研究[J] . 计算机应用研究, 2015, 32(1):21-25.
[8] Zhao Dexin, He Jinqun, Liu Jin. An improved LDA algorithm for test classification[C] //Proc of International Conference on Information Science, Electronics and Electrical Engineering. 2014:217-221.
[9] Blei D, Lafferty J. Correlated topic models[C] //Proc of Neural Information Processing Systems Conference. 2006.
[10] Giffiths T L, Steyvers M, Blei D M, et al. Integrating topic and syntax[C] //Advance in Neural Information Processing Systems. 2004.
[11] Gruber A, Rosen-Zvi M, Weiss Y. Hidden topic Markov model[C] //Proc of Artificial Intelligence and Statistics Conference. 2007:163-170.
[12] Li Wenbo, Sun Le. Text classification based on labeled-LDA model[J] . Chinese Journal of Computers, 2008, 31(4):620-627.
[13] Blei D, Auliffe M. Supervised topic models[C] //Advance in Neural Information Processing Systems. 2007.
[14] 沈竞. 基于信息增益的LDA模型的短文本分类[J] . 重庆文理学院学报:自然科学版, 2011, 30(6):64-66.
[15] Panagiotis S, Ivaylo K, Yannis M. Text classification by aggregation of SVD eigenvectors[C] //Advances in Databases and Information Systems. Berlin:Springer, 2012:385-398.
[16] Steyvers M, Griffiths T. Probabilistic topic models[R] //Handbook of Latent Semantic Analysis. New Jersey:Springer, 2007.
[17] Cao Juan, Zhang Yongdong, Li Jintao. A method of adaptively selecting best LDA model based on density[J] . Chinese Journal of Computers, 2008, 31(10):1780-1787.
[18] 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J] . 计算机学报, 2011, 34(8):1423-1436.
[19] 于成龙. 基于特征提取的特征选择研究[D] . 南京:南京邮电大学, 2011.
[20] Chang C C, Lin C J. LIBSVM:a library for support vector machines[J] . ACM Trans on Intelligent Systems and Technology, 2011, 2(3):27-65.
收稿日期 2016/4/20
修回日期 2016/7/11
页码 2269-2273
中图分类号 TP391.1
文献标志码 A