《计算机应用研究》|Application Research of Computers

多类型分类器融合的文本分类方法研究

Research on text classification method of multi-class classifier fusion

免费全文下载 (已被下载 次)  
获取PDF全文
作者 李惠富,陆光
机构 东北林业大学 信息与计算机工程学院,哈尔滨 150040
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)03-022-0752-04
DOI 10.19734/j.issn.1001-3695.2017.09.0908
摘要 传统的文本分类方法大多数使用单一的分类器,而不同的分类器对分类任务的侧重点不同,就使得单一分类方法有一定的局限性,同时每个特征提取方法对特征词的考虑角度不同。针对以上问题,提出了多类型分类器融合的文本分类方法。该模型使用了word2vec、主成分分析、潜在语义索引以及TFIDF特征提取方法作为多类型分类器融合的特征提取方法。在多类型分类器加权投票方法中忽略了类别信息的问题,提出了类别加权的分类器权重计算方法。通过实验结果表明,多类型分类器融合方法在二元语料库、多元语料库以及特定语料库上都取得了很好的性能,类别加权的分类器权重计算方法比多类型分类器融合方法在分类性能方面提高了1.19%。
关键词 文本分类;分类器融合;主成分分析;潜在语义索引
基金项目 黑龙江省自然科学基金资助项目(F201201)
本文URL http://www.arocmag.com/article/01-2019-03-022.html
英文标题 Research on text classification method of multi-class classifier fusion
作者英文名 Li Huifu, Lu Guang
机构英文名 CollegeofInformation&ComputerEngineering,NortheastForestryUniversity,Harbin150040,China
英文摘要 Most of the traditional text classification methods use a single classifier, and different classifiers have different emphasis on classification tasks, which makes the single classification method have some limitations. At the same time, each feature extraction method has different angles of considering the feature words. Aiming at the above problems, this paper proposed a text classification method based on multi type classifier fusion, which combined word2vec, principal component analysis, latent semantic indexing and TFIDF feature extraction as feature extraction methods for the multi-type classifier fusion. The weighted voting method of multi-type classifier ignores the category information. This paper proposed a weighted classifier weight calculation method. The experimental results show that the multi classifier fusion method has achieved good performance both in two dimensional, multiple corpora and corpus specific corpus, the classification weighting method of classifier weighting improves the classification performance by 1.19% compared with the multi-type classifier fusion method.
英文关键词 text classification; classifier fusion; principal component analysis; potential semantic index
参考文献 查看稿件参考文献
  [1] 何力, 丁兆云, 贾焰, 等. 大规模层次分类中的候选类别搜索[J] . 计算机学报, 2014, 37(1):41-49. (He Li, Ding Zhaoyun, Jia Yan, et al. Category candidate search in large scale hierarchical classification[J] . Chinese Journal of Computers, 2014, 37(1):41-49. )
[2] 李荣陆, 王建会, 陈晓云, 等. 使用最大熵模型进行中文文本分类[J] . 计算机研究与发展, 2005, 42(1):94-101. (Li Ronglu, Wang Jianhui, Chen Xiaoyun, et al. Using maximum entropy model for Chinese text categorization[J] . Journal of Computer Research and Development, 2005, 42(1):94-101. )
[3] 黄文明, 莫阳. 基于文本加权KNN算法的中文垃圾短信过滤[J] . 计算机工程, 2017, 34(3):193-199. (Huang Wenming, Mo Yang. Chinese spam message filtering based on text weighted KNN algorithm[J] . Computer Engineering, 2017, 34(3):193-199. )
[4] Goudjil M, Koudil M, Bedda M, et al. A novel active learning method using SVM for text classification[J] . International Journal of Automation and Computing, 2016, 15(3):290-298.
[5] Jiang Liangxiao, Li Chaoqun, Wang Shasha, et al. Deep feature weighting for naive Bayes and its application to text classification[J] . Engineering Applications of Artificial Intelligence, 2016, 52(6):26-39.
[6] Sangodiah A, Ahmad R, Wan F W A. A review in feature extraction approach in question classification using support vector machine[C] //Proc of IEEE International Conference on Control System, Computing and Engineering. Piscataway, NJ:IEEE Press, 2014:536-541.
[7] Chen Shuoying, Jin Zhensheng. Weibo topic detection based on improved TF-IDF algorithm[J] . Science & Technology Review, 2016, 34(2):282-286.
[8] Wang Zhibo, Ma Long, Zhang Yanqing. A hybrid document feature extraction method using latent Dirichlet allocation and word2vec[C] //Proc of the 1st International Conference on Data Science in Cyberspace. Piscataway, NJ:IEEE Press, 2016:98-103.
[9] Dandibhotla T S, Babu K S, Prasad S D V, et al. Opinion mining of online product reviews from traditional LDA topic clusters using feature ontology tree and sentiwordnet[J] . International Journal of Education and Management Engineering, 2016, 6(6):34-44.
[10] Uysal A K, Gunal S. Text classification using genetic algorithm oriented latent semantic features[J] . Expert Systems with Applications, 2014, 41(13):5938-5947.
[11] Mikolov T, Chen Kai, Corrado G, et al. Efficient estimation of word representations in vector space[EB/OL] . (2013-09-07). https://arxiv. org/abs/1301. 3781.
[12] 王晓丹, 李睿, 薛爱军, 等. 基于熵的自适应加权投票HRRP融合识别方法[J] . 系统工程与电子技术, 2017, 39(4):707-713. (Wang Xiaodan, Li Rui, Xue Aijun, et al. HRRP fusion recognition by a self-adaptive weighted majority vote strategy based on entropy[J] . Systems Engineering and Electronics, 2017, 39(4):707-713. )
[13] Jongeling R, Sarkar P, Datta S, et al. On negative results when using sentiment analysis tools for software engineering research[J] . Empirical Software Engineering, 2017, 22(5):2543-2584.
[14] 魏勇, 胡丹露, 郝晨光, 等. 基于分类关键词词频模型的地缘政治主题爬虫设计[J] . 计算机工程, 2016, 42(2):45-50. (Wei Yong, Hu Danlu, Hao Chenguang, et al. Design of geopolitical topical crawler based on classified keyword term frequency model[J] . Computer Engineering, 2016, 42(2):45-50. )
[15] 段宏湘, 张秋余, 张墨逸. 基于归一化互信息的FCBF特征选择算法[J] . 华中科技大学学报:自然科学版, 2017, 45(1):52-56. (Duan Hongxiang, Zhang Qiuyu, Zhang Moyi. FCBF algorithm based on normalized mutual information for feature selection[J] . Journal of Huazhong University of Science and Technology:Natural Scie-nce, 2017, 45(1):52-56. )
收稿日期 2017/9/4
修回日期 2017/11/20
页码 752-755
中图分类号 TP301.6
文献标志码 A