《计算机应用研究》|Application Research of Computers

基于支持向量机的中文极短文本分类模型

Classification model based on support vector machine for Chinese extremely short text

免费全文下载 (已被下载 次)  
获取PDF全文
作者 王杨,许闪闪,李昌,艾世成,张卫东,甄磊,孟丹
机构 安徽师范大学 计算机与信息学院,安徽 芜湖 241000
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2020)02-006-0347-04
DOI 10.19734/j.issn.1001-3695.2018.06.0514
摘要 为了有效提取极短文本中的关键特征信息,提出了一种基于支持向量机的极短文本分类模型。首先对原数据进行数据清洗并利用jieba分词将清洗过的数据进行处理;再将处理后的数据存入数据库,通过TF-IDF进行文本特征的提取;同时,利用支持向量机对极短文本进行分类。经过1-0检验,验证了模型的有效性。实验以芜湖市社管平台中的9 906条极短文本数据作为样本进行算法检验与分析。结果表明在分类准确率方面,该方法相比于朴素贝叶斯、逻辑回归、决策树等传统方法得到有效提高;在误分度与精确度指标上匹配结果更加均衡。
关键词 支持向量机; jieba分词; 极短文本分类; TF-IDF
基金项目 国家自然科学基金资助项目(61871412)
安徽省自然科学基金资助项目(1808085MF178)
安徽省人文社科基金资助项目(SK2014ZD033,AHSKY2017D42)
本文URL http://www.arocmag.com/article/01-2020-02-006.html
英文标题 Classification model based on support vector machine for Chinese extremely short text
作者英文名 Wang Yang, Xu Shanshan, Li Chang, Ai Shicheng, Zhang Weidong, Zhen Lei, Meng Dan
机构英文名 School of Information & Computer Science,Anhui Normal University,Wuhu Anhui 241000,China
英文摘要 In order to effectively extract the key features from the extremely short texts, this paper proposed an extremely short text classification model based on SVM. Firstly, by the data cleansing on the original data, the cleaned data was processed by the jieba segmentation and TF-IDF. Then the 1-0 test verified the validity of the model. Finally, 9906 pieces of extremely short texts in Wuhu city community management platform were used as the sample in this experiment. The results show that the proposed method can effectively improve classification accuracy compared to other traditional methods, such as naive Bayes, logistic regression and decision tree. At the same time, the matching results in terms of misclassification and accuracy are more balanced.
英文关键词 support vector machine(SVM); jieba segmentation; extremely short text; TF-ID
参考文献 查看稿件参考文献
 
收稿日期 2018/6/29
修回日期 2018/8/28
页码 347-350
中图分类号 TP391.1
文献标志码 A