《计算机应用研究》|Application Research of Computers

文本分类中一种特征选择方法研究

Study on feature selection method in text classification

免费全文下载 (已被下载 次)  
获取PDF全文
作者 赵婧,邵雄凯,刘建舟,王春枝
机构 湖北工业大学 计算机学院,武汉 430068
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)08-004-2261-05
DOI 10.19734/j.issn.1001-3695.2018.01.0078
摘要 针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。
关键词 文本分类; 特征选择; 分散度; 集中度; 频度
基金项目 国家自然科学基金面上资助项目(61772180)
本文URL http://www.arocmag.com/article/01-2019-08-004.html
英文标题 Study on feature selection method in text classification
作者英文名 Zhao Jing, Shao Xiongkai, Liu Jianzhou, Wang Chunzhi
机构英文名 School of Computer Science,Hubei University of Technology,Wuhan 430068,China
英文摘要 The traditional feature selection method of chi-square test and information gain in text classification has its inherent defect. This paper analyzed the key of feature selection in text classification being to select feature words distributed evenly and frequently in each type of documents. This should consider not only the document frequency and term frequency of feature words, but also the inter class concentration degree and the intra class scatter degree of feature words. It proposed a feature selection evaluation function that is based on document frequency of within-class and between-class and term frequency statistics. The feature selection evaluation function could select a certain proportion of the feature words in each category of the training set to form the corresponding class of the feature word library. The entire feature word library of the training set could be composed by each of such classes as a result. It carried out the experiment of Chinese text classification based on SVM. The experimental results show that the proposed method improves the effectiveness of text classification to a certain extent, compared with the traditional chi-square test and information gain.
英文关键词 text classification; feature selection; distribution; concentration; frequency
参考文献 查看稿件参考文献
 
收稿日期 2018/1/31
修回日期 2018/3/21
页码 2261-2265
中图分类号 TP391
文献标志码 A