一种基于语义标注特征的金融文本分类方法 - 计算机应用研究 编辑部 - 《计算机应用研究》唯一官方网站

《计算机应用研究》|Application Research of Computers

一种基于语义标注特征的金融文本分类方法

New approach of financial text classification based on semantic annotation features

免费全文下载 (已被下载 次)  
获取PDF全文
作者 罗明,黄海量
机构 上海财经大学 a.信息管理与工程学院;b.上海市金融信息技术研究重点实验室,上海 200433
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)08-2281-04
DOI 10.3969/j.issn.1001-3695.2018.08.010
摘要 针对基于词袋的机器学习文本分类方法所存在的高维度、高稀疏性、不能识别同义词、语义信息缺失等问题,和基于规则模式的文本分类所存在的虽然准确率较高但鲁棒性较差的问题,提出了一种采用词汇—语义规则模式从金融新闻文本中提取事件语义标注信息,并将其作为分类特征用于机器学习文本分类中的新方法。实验证明采用该方法相比基于词袋的文本分类方法在采用相同的特征选择算法和分类算法的基础上,F1值提高8.6 %,查准率提高7.7%,查全率提高8.8%。本方法融合了知识驱动和数据驱动在文本分类中的优点,同时避免了它们所存在的主要缺点,具有显著的实用性和研究参考价值。
关键词 文本分类;金融文本;语义标注;词汇—语义模式;有限状态机
基金项目 上海市科技人才计划项目(14XD1421000)
上海市科技创新行动计划项目(16511102900)
上海财经大学2014年研究生创新基金资助项目(CXJJ-2014-438)
本文URL http://www.arocmag.com/article/01-2018-08-010.html
英文标题 New approach of financial text classification based on semantic annotation features
作者英文名 Luo Ming, Huang Hailiang
机构英文名 a.CollegeofInformationManagement&Engineering,b.ShanghaiKeyLaboratoryofFinancialInformationTechnology,ShanghaiUniversityofFinance&Economic,Shanghai200433,China
英文摘要 The main problems of traditional machine learning text classification method which based on BOW (bag of words) are high dimension and high sparseness, can not identify synonyms and lack of semantic information etc. Meanwhile, rule based methods have high precision but have weaker robustness. In order to solve these problems, this paper proposed a novel method which based on lexical-semantic patterns to extract event semantic annotations from financial news text, and applied these annotations as features in machine learning method. The experiment shows that this method lifts F1 value 8.6% than BOW, and the precision is increased by 7.7%, recall is increased by 8.8%, which based on same feature selection algorithm and classification method. This method combines the advantages of the two methods of knowledge driven and data driven in text classification, at the same time avoids the major drawbacks of last two methods, it has a good practical and research reference value.
英文关键词 text classification; financial text; semantic annotation; lexical-semantic pattern; finite state machine
参考文献 查看稿件参考文献
  [1] Salton G, Yang C S. On the specification of term values in automatic indexing[J] . Journal of Documentation, 1973, 29(4):11-21.
[2] 张玉芳, 万斌侯, 熊忠阳. 文本分类中的特征降维方法研究[J] . 计算机应用研究, 2012, 29(7):2541-2543.
[3] Bruno T, Sasa M, Dzenana D. KNN with TF-IDF based framework for text categorization[C] //Proc of International Symposium on Intelligent Manufacturing and Automation. 2013:1356-1364.
[4] Becker J, Kuropka D. Topic-based vector space model[C] //Proc of the 6th International Conference on Business Information Systems. 2003:7-12.
[5] 刘小明, 樊孝忠, 李芳芳. 一种结合本体和焦点的问题分类方法[J] . 北京理工大学学报:自然科学版, 2012, 32(5):498-502.
[6] 张国栋, 张化祥. 基于语义的文本特征加权分类算法[J] . 计算机应用研究, 2012, 29(12):4476-4478.
[7] Li Chenghua, Yang Jucheng, Park S C. Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet[J] . Expert Systems with Applications, 2012, 39(1):765-772.
[8] Chen Y T, Chen Mengchang. Using Chi-square statistics to measure similarities for text categorization[J] . Expert Systems with Applications, 2011, 38(4):3085-3090.
[9] Hall M A. Correlation-based feature selection for discrete and numeric class machine learning[C] //Proc of the 17th International Conference on Machine Learning. [S. l. ] :Morgan Kaufmann, 2000:359-366.
[10] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J] . 计算机学报, 2011, 34(5):856-864.
[11] Qiu Xipeng, Zhang Qi, Huang Xuanjing. FudanNLP:a toolkit for Chinese natural language processing[C] //Proc of Meeting of the Association for Computational Linguistics:System Demonstrations. 2013:49-54.
[12] Gooch P, Roudsari A. Lexical patterns, features and knowledge resources for coreference resolution in clinical notes[J] . Journal of Bio-medical Informatics, 2012, 45(5):901-912.
[13] Hogenboom A, Hogenboom F, Frasincar F, et al. Semantics-based information extraction for detecting economic events[J] . Multimedia Tools and Applications, 2013, 64(1):27-52.
[14] Cunningham H, Maynard D, Bontcheva K, et al. GATE:a framework and graphical development environment for robust NLP tools and applications[C] //Proc of the 40th Anniversary Meeting of the Association for Computational Linguistics. Straudsburg, PA:Association for Computational Linguistics, 2002:168-175.
[15] Cunningham H, Maynard D, Tablan V. JAPE:a Java annotation patterns engine[EB/OL] . http://www. dcs. shef. ac. uk/~hamish.
[16] Gill A. Introduction to the theory of finite-state machines[M] . [S. l. ] :McGraw-Hill, 1962.
[17] 罗明, 黄海量. 一种基于有限状态机的中文地址标准化方法[J] . 计算机应用研究, 2016, 33(12):3691-3695.
[18] Mellouli S, Bouslama F, Akande A. An ontology for representing financial headline news[J] . Web Semantics, 2010, 8(2):203-208.
收稿日期 2017/4/9
修回日期 2017/5/24
页码 2281-2284,2288
中图分类号 TP391.1
文献标志码 A