《计算机应用研究》|Application Research of Computers

文本分类中训练集相关数量指标的影响研究

Study about effect of relevant quantitative indexes of training set in text classification

免费全文下载 (已被下载 次)  
获取PDF全文
作者 李湘东,曹环,黄莉
机构 武汉大学 a.信息管理学院;b.信息资源研究中心;c.图书馆,武汉 430072
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2014)11-3324-04
DOI 10.3969/j.issn.1001-3695.2014.11.028
摘要 针对训练集对分类性能的影响,从训练集的文本数、类别数以及特征项数这三项数量指标出发进行研究。使用多因素方差分析方法及多种语料库定量探讨该三项数量指标对分类性能的影响规律。结果发现特征项数对分类性能的影响在不同的文本数和类别数时是不同的,分类性能受训练集的这三项指标的交互影响,通过对训练集的这三项指标进行优化,提出了从分类算法、特征项选择法以外提高分类性能的途径。在真实数据上的实验结果表明,该方法可有效提高分类性能。
关键词 训练集优化;文本分类;多因素方差分析;语料库;相关数量指标
基金项目
本文URL http://www.arocmag.com/article/01-2014-11-028.html
英文标题 Study about effect of relevant quantitative indexes of training set in text classification
作者英文名 LI Xiang-dong, CAO Huan, HUANG Li
机构英文名 a. School of Information Management, b. Center for the Studies of Information ResourcesCSIR), c. Library, Wuhan University, Wuhan 430072, China
英文摘要 This paper studied the impacts on the efficiency of text automatic categorization system coming from three quantitative indexes of training set, including the number of features, categories and texts in each category. It used multifactor analysis of variance (multiple ANOVA) and took different kinds of corpus to explore the influence rule of three quantitative indexes on the system efficiency. The results show that the impact of feature numbers on the classification accuracy depends on different texts number and categories number, and three quantitative indexes in the training set affect the classification accuracy interactively. It raised a new way to improve the classify efficiency through optimizing relevant quantitative indexes of training set. The experimental results of the real world data show that the proposed method has a relative good performance to text categorization.
英文关键词 training set optimization; text classification; multiple ANOVA; corpus; relevant quantitative indexes
参考文献 查看稿件参考文献
  [1] 林琛. 李弼程, 周杰. 基于信息粒度的交叠类文本分类方法[J] . 情报学报, 2011, 30(4):339-346.
[2] JAPKOWICZ N, STEPHE S. The class imbalance problem:a syste-matic study[J] . Intelligent Data Analysis, 2002, 6(5):429-449.
[3] LI Rong-lu, HU Yun-fa. Noise reduction to text categorization based on density for KNN[C] //Proc of the 2nd International Conference on Machine Learning and Cybernetics. 2003:3119-3124.
[4] 刘海峰, 姚泽清, 苏展, 等. 文本分类中基于K-means的类偏斜KNN样本剪裁[J] . 微电子学与计算机, 2012, 29(5):24-28.
[5] 张若峰. 基于实例的文本自动分类技术的研究和实现[D] . 长春:吉林大学, 2005.
[6] 徐凤亚, 罗振声. 文本自动分类中特征权重算法的改进研究[J] . 计算机工程与应用, 2005, 41(1):181-184, 220.
[7] ZHANG Tong, OLES F J. Text categorization based on regularized linear classification methods[J] . Information Retrieval, 2001, 4(1):5-31.
[8] BEKKERMAN R, EI-YANIV R, TISHBY N, et al. Distributional world clusters vs. words for text categorization[J] . Journal of Machine Learning Research, 2003, 3(3):1183-1208.
[9] 胡晓, 王理, 潘守慧. 基于改进VSM的Web文本分类方法[J] . 情报杂志, 2012, 29(5):144-147.
[10] MARKOVITC H S, ROSENSTEIN D. Feature generation using general constructor functions[J] . Machine Learning, 2002, 49(1):59-98.
[11] ZHANG Jian, YANG Yi-ming. Robustness of regularized linear classification methods in text categorization[C] //Proc of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press, 2003:190-197.
[12] 陈玉芹. 多类别科技文献自动分类系统[D] . 武汉:华中科技大学, 2008.
[13] 樊兴华, 孙茂松. 一种高性能的两类中文文本分类方法[J] . 计算机学报, 2006, 29(1):124-131.
[14] 贾宁. 使用概念基元特征进行自动文本分类[J] . 计算机工程与应用, 2007, 43(1):24-26.
[15] ZHENG Zhao-hui, WU Xiao-yun, SRIHARI R. Feature selection for text categorization on imbalanced data[J] . ACM SIGKDD Explorations Newsletter, 2004, 6(1):80-89.
[16] GUPTA R, RATINOV L. Text categorization with knowledge transfer from heterogeneous data sources[C] //Proc of the 23rd AAAI Confe-rence on Artificial Intelligence. [S. l. ] :AAAI Press, 2008:842-847.
[17] LEWIS D D. Reuters-21578 text categorization text collection[EB/OL] . [2013-08-22] . http://www. daviddlewis. com/resources/testcollections/reuters21578.
[18] 搜狗实验室—文本分类语料库[EB/OL] . [2013-08-22] . http://www. sogou. com/labs/dl/t. html.
[19] 何琳, 刘竟, 侯汉清. 基于《中图法》的多层自动分类影响因素分析[J] . 中国图书馆学报, 2009, 35(184):49-55.
收稿日期 2013/10/7
修回日期 2013/12/2
页码 3324-3327,3332
中图分类号 TP391
文献标志码 A