《计算机应用研究》|Application Research of Computers

结合改进的CHI统计方法的TF-IDF算法优化

Optimization of TF-IDF algorithm combined with improved CHI statistical method

免费全文下载 (已被下载 次)  
获取PDF全文
作者 马莹,赵辉,李万龙,庞海龙,崔岩
机构 长春工业大学 计算机科学与工程学院,长春 130012
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)09-008-2596-03
DOI 10.19734/j.issn.1001-3695.2018.01.0136
摘要 为了克服传统的CHI统计方法存在特征项出现频率与类别负相关的情况和某一个特征项存在于某一个文本中的概率问题,针对传统的CHI统计方法引入了负相关判定、频度等重要因素进行了改进,并结合语义相似度的计算方法对TF-IDF算法进行了优化,在WEKA软件上采用了KNN(K-nearest neighbor)分类器和支持向量机(SVM)分类器分别对微博情感语料进行分类,该实验结果表明,新方法在文本分类的准确性上有明显的提高。
关键词 文本分类; CHI统计; TF-IDF算法; 特征选择
基金项目 国家自然科学基金资助项目(61472049)
吉林省教育厅“十二五”科学技术研究项目(2014132)
本文URL http://www.arocmag.com/article/01-2019-09-008.html
英文标题 Optimization of TF-IDF algorithm combined with improved CHI statistical method
作者英文名 Ma Ying, Zhao Hui, Li Wanlong, Pang Hailong, Cui Yan
机构英文名 College of Computer Science & Engineering,Changchun University of Technology,Changchun 130012,China
英文摘要 In order to overcome the traditional CHI statistical method, there was a negative correlation between the frequency of feature items and the category, and a probability problem that a feature item existed in a text, The traditional CHI statistical method was improved by introducing some important factors such as negative correlation judgment and frequency, and the TF-IDF algorithm was optimized by combining the calculation method of semantic similarity. The K-nearest neighbor(KNN) classifier and support vector machine(SVM) classifier were respectively used in WEKA software to classify the Weibo emotional corpus the experimental results show that the new method has obvious improvement on the accuracy of text classification.
英文关键词 text categorization; CHI statistics; TF-IDF algorithm; feature selection
参考文献 查看稿件参考文献
 
收稿日期 2018/1/27
修回日期 2018/4/10
页码 2596-2598,2603
中图分类号 TP301.6
文献标志码 A