《计算机应用研究》|Application Research of Computers

维吾尔文论坛中基于术语选择和Rocchio分类器的文本过滤方法

免费全文下载 (已被下载 次)  
获取PDF全文
作者 如先姑力·阿布都热西提,亚森·艾则孜,艾山·吾买尔,阿力木江·艾沙
机构 1.新疆警察学院 信息安全工程系,乌鲁木齐 830013;2.新疆大学 a.信息科学与工程学院;b.网络中心,乌鲁木齐 830046
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)03-057-0925-05
DOI 10.19734/j.issn.1001-3695.2017.10.0941
摘要 针对维吾尔文网页论坛中的文本过滤问题,提出一种基于术语选择和Rocchio分类器的文本过滤方法。首先,将论坛文本进行预处理以删除无用词,并基于N-gram 统计模型进行词干(术语)提取;然后,提出一种均衡考虑相关性和冗余性的均衡型互信息术语选择方法(BMITS),对初始术语集合进行降维,获得精简术语集;最后,将文本特征术语作为输入,通过Rocchio分类器进行分类,以此过滤掉论坛中的不良文本。在相关数据集上的实验结果表明,提出的方法能够准确地识别出不良类型文本,具有有效性。
关键词 维吾尔文论坛;文本过滤;N-gram统计模型;术语选择;Rocchio分类器
基金项目 国家自然科学基金资助项目(61762086)
国家社会科学基金资助项目(13CFX055)
新疆维吾尔自治区高校科研计划重点项目(XJEDU2017M046)
本文URL http://www.arocmag.com/article/01-2019-03-057.html
英文标题
作者英文名 Ruxianguli·Abudurexiti, Yasen·Aizezi, Aishan·Wumaier, Alimujiang·Aisha
机构英文名 1.Dept.ofInformationSecurityEngineering,XinjiangPoliceCollege,Urumqi830013,China;2.a.CollegeofInformationScience&Engineering,b.NetworkCentre,XinjiangUniversity,Urumqi830046,China
英文摘要 For the issues that the text filtering in Uyghur Web forum, this paper proposed a text filtering method based on term selection and Rocchio classifier. Firstly, it preprocessed the forum text to remove useless words and extracted stemming (term) based on the N-gram statistical model. Then, it proposed a balanced mutual information term selection method (BMITS), which considered the correlation and redundancy of equilibrium, used to reduce the dimension of initial term set and obtained the reduced term set. Finally, it made the text feature terms as input, and used Rocchio classifier to filter out the bad text. The experimental results show that the proposed method can accurately identify the bad type text, which is effective.
英文关键词 Uyghur forum; text filtering; N-gram statistical model; term selection; Rocchio classifier
参考文献 查看稿件参考文献
  [1] 刘磊, 李壮, 张鑫, 等. 中文网络文本的语义信息处理研究综述[J] . 计算机应用研究, 2015, 32(1):6-10, 16. (Liu Lei, Li Zhuang, Zhang Xin, et al. Survey on Chinese text semantic information processing in network[J] . Application Research of Computers, 2015, 32(1):6-10, 16. )
[2] 程俊霞, 李芝棠, 邹明光, 等. 基于SVM过滤的微博新闻话题检测方法[J] . 通信学报, 2013, 34(2):74-78. (Cheng Junxia, Li Zhitang, Zou Mingguang, et al. Novel topic detection method for microblog based on SVM filtration[J] . Journal on Communications, 2013, 34(2):74-78. )
[3] 亚力青·阿里玛斯, 哈力旦·阿布都热依木, 陈洋. 基于向量空间模型的维吾尔文文本过滤方法[J] . 新疆大学学报:自然科学版, 2015, 32(2):221-226. (Yaliqing Arimus, Halidan Abduliyah, Chen Yang. Uyghur text filtering based on vector space model[J] . Journal of Xinjiang University:Natural Science Edition, 2015, 32(2):221-226. )
[4] Zhang Bin, Xu Miao, Wu Minli. Research on Web filtering technology based on the dual feature selection[C] //Proc of IEEE International Conference on Network Infrastructure and Digital Content. Piscataway, NJ:IEEE Press, 2013:675-679.
[5] 阿力木江·艾沙, 吐尔根·依布拉音, 艾山·吾买尔, 等. 基于机器学习的维吾尔语文本分类研究[J] . 计算机工程与应用, 2012, 48(5):110-112. (Alimujiang Aisha, Tuergen Ibrahim, Aishan Wumaier, et al. Machine learning based Uyghur language text categorization[J] . Computer Engineering and Applications, 2012, 48(5):110-112. )
[6] 热依莱木·帕尔哈提, 孟祥涛, 艾斯卡尔·艾木都拉. 基于区分性关键词模型的维吾尔语本情感分类[J] . 计算机工程, 2014, 40(10):132-136. (Rayila Parhat, Meng Xiangtao, Askar Hamdulla. Uyghur text sentiment classification based on discriminative keyword model[J] . Computer Engineering, 2014, 40(10):132-136. )
[7] 买买提依明·哈斯木, 吾守尔·斯拉木, 维尼拉·木沙江, 等. 基于N元模型的维吾尔语文本分类技术研究[J] . 计算机应用研究, 2015, 32(7):1986-1988, 2004. (Maimaitiyiming Hasimul, Wushouer Silamul, Weinila Mushajiang, et al. Research N-gram based Uyghur text classification technique[J] . Application Research of Computers, 2015, 32(7):1986-1988, 2004. )
[8] Mi Chenggang, Yang Yating, Wang Lei, et al. Detection of loan words in Uyghur texts[J] . Communications in Computer & Information Science, 2014, 49(6):103-112.
[9] 阿不都萨拉木·达吾提, 于斯音·于苏普, 艾斯卡尔·艾木都拉. 类别区分词与情感词典相结合的维吾尔文句子情感分类[J] . 清华大学学报:自然科学版, 2017, 57(2):197-201. (Abdusalam Dawut, Hussein Yusuf, Askar Hamdulla. Emotion recognition from Uyghur sentences based on combinations of class discrimination words and a sentiment dictionary[J] . Journal of Tsinghua University:Science and Technology, 2017, 57(2):197-201. )
[10] Froud H, Lachkar A, Ouatik S A. A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications[J] . Advanced Computing:International Journal, 2012, 3(6):12-19.
[11] Hadni M, Ouatik S A, Lachkar A. Effective arabic stemmer based hybrid approach for arabic text categorization[J] . International Journal of Data Mining & Knowledge Management Process, 2013, 3(4):1-14.
[12] 姜志威, 丁晓青, 彭良瑞, 等. 低数据资源条件下基于结构信息共享的无切分维文文档识别字符建模[J] . 电子与信息学报, 2015, 37(9):2103-2109. (Jiang Zhiwei, Ding Xiaoqing, Peng Liangrui, et al. Uyghur character models with shared structure information for segmentation-free recognition under low data resource conditions[J] . Journal of Electronics & Information Technology, 2015, 37(9):2103-2109. )
[13] Uchyigit G. Experimental evaluation of feature selection methods for text classification[C] //Proc of International Conference on Fuzzy Systems and Knowledge Discovery. Piscataway, NJ:IEEE Press, 2012:1294-1298.
[14] Hoque N, Bhattacharyya D K, Kalita J K. MIFS-ND:a mutual information-based feature selection method[J] . Expert Systems with Applications, 2014, 41(14):6371-6385.
[15] Sowmya B J, Chetan, Srinivasa K G. Large scale multi-label text classification of a hierarchical dataset using Rocchio algorithm[C] //Proc of International Conference on Computation System and Information Technology for Sustainable Solutions. Piscataway, NJ:IEEE Press, 2016:291-296.
[16] Selvi S T, Karthikeyan P, Vincent A, et al. Text categorization using Rocchio algorithm and random forest algorithm[C] //Proc of International Conference on Advanced Computing. Piscataway, NJ:IEEE Press, 2017:124-129.
收稿日期 2017/10/26
修回日期 2017/12/7
页码 925-929
中图分类号 TP391
文献标志码 A