《计算机应用研究》|Application Research of Computers

基于变系数词性空间权值定义的英文句子相似度算法研究

Research on English sentence similarity algorithm based on variable modulus part of speech space definition

免费全文下载 (已被下载 次)  
获取PDF全文
作者 黄贤英,张金鹏,赵明军,刘英涛
机构 重庆理工大学 计算机科学与工程学院,重庆 400054
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2015)04-0996-04
DOI 10.3969/j.issn.1001-3695.2015.04.008
摘要 对短文本中词项按词性进行切分构建词性向量,将词性向量中词项进行归并构建词性空间,首次提出对词性空间的权值进行动态定义。词项在词性空间中映射权值通过词项词频信息和WordNet语义词典得到,短文本之间相似度运算转换为各词性空间相似度协同运算。将改进的文本相似度算法运用于微软研究院释义语料库上,实验结果表明,改进的文本相似度算法使得文本相似度计算的准确率和稳定性有了较大的提高。
关键词 WordNet语义词典;词项语义空间映射;可变词性空间权值;词项词频;句子相似度算法
基金项目 国家自然科学基金资助项目(61173184)
重庆市教委科技计划资助项目(KJ100821)
重庆理工大学研究生创新基金资助项目(YCX2012317)
本文URL http://www.arocmag.com/article/01-2015-04-008.html
英文标题 Research on English sentence similarity algorithm based on variable modulus part of speech space definition
作者英文名 HUANG Xian-ying, ZHANG Jin-peng, ZHAO Ming-jun, LIU Ying-tao
机构英文名 College of Computer Science & Engineering, Chongqing University of Technology, Chongqing 400054, China
英文摘要 This paper divided short text into several part of speech vectors according to part of speech of term, and merged those terms in the part of speech vector in order to constitute part of speech space. This paper firstly proposed the strategy of defining the weight of part of speech space. It obtained the weight of term in the part of speech space through term frequency in short text and WordNet semantic library. And it turned into the similarity calculation between short texts the similarity between those part of speech spaces. The experimental results on an open benchmark dataset from Microsoft research paraphrase corpus(MSRP) show that the proposed algorithm acquires a high accuracy and stability compared with traditional algorithm.〓
英文关键词 WordNet semantic library; term semantic space mapping; changing weight of part of speech space; term frequency; sentence similarity algorithm
参考文献 查看稿件参考文献
  [1] BANEA C, HASSAN S, MOHLER M, et al. A supervised synergistic approach to semantic text similarity[C] //Proc of the 1st Joint Conference on Lexical and Computational Semantics. 2012:635-642.
[2] RAMAGE D, RAFFERTY A N, MANNING C D. Random walks for text semantic similarity[C] //Proc of Workshop on Graph-based Methods for Natural Language Processing. 2009:23-31.
[3] ISLAM A, INKPEN D. Semantic text similarity using corpus-based word similarity and string similarity[J] . ACM Trans on Knowledge Discovery from Data, 2008, 2(2):1-25.
[4] TASI C S, HUANG Yong-ming, LIU Chen-hang, et al. Applying VSM and LCS to develop an integrated text retrieval mechanism[J] . Expert Systems with Applications, 2012, 39(4):3974-3982.
[5] LIU Wen-yin, QUAN Xiao-jun, FENG Min, et al. A short text modeling method combining semantic and statistical information[J] . Information Sciences, 2010, 180(20):4031-4041.
[6] 刘赫, 刘大有, 裴志利, 等. 一种基于特征重要度的文本分类特征加权方法[J] . 计算机研究与发展, 2009, 46(10):1693-1703.
[7] MULLER C, GUREVYCH I, MUHLHAUSER M. Integrating semantic knowledge into text similarity and information retrieval[C] //Proc of International Conference on Semantic Computing. 2007:257-264.
[8] Stanford NLP Group. Stanford log-linear part-of-speech tagger[EB/OL] . http://nlp. stanford. edu/software/tagger. shtml/.
[9] PORTER M F. An algorithm for suffix stripping[J] . Program, 2006, 40(3):211-218.
[10] HONG J L. Data extraction for deep Web using WordNet[J] . IEEE Trans on Systems, Man, and Cybernetics, Part C:Applications and Reviews, 2011, 41(6):854-868.
[11] LIN De-kang. An information-theoretic definition of similarity[C] //Proc of the 15th International Conference on Machine Learning. 1998:296-304.
[12] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J] . 计算机学报, 2011, 34(5):856-864.
[13] CHEN Yao-tsung, CHEN Meng-chang. Using chi-square statistics to measure similarities for text categorization[J] . Expert Systems with Applications, 2011, 38(4):3085-3090.
[14] LEE M C. A novel sentence similarity measure for semantic-based expert systems[J] . Expert Systems with Applications, 2011, 38(5):6392-6399.
[15] QUIRK C, BROCKETT C, DOLAN W B. Monolingual machine translation for paraphrase generation[C] //Proc of Conference on Empirical Methods in Natural Language Processing. 2004:142-149.
[16] DOLAN B, QUIRK C, BROCKETT C. Unsupervised construction of large paraphrase corpora:exploiting massively parallel news sources[C] //Proc of the 20th International Conference on Computational Linguistics. [S. l. ] :Association for Computational Linguistics, 2004.
收稿日期 2014/3/11
修回日期 2014/4/28
页码 996-999
中图分类号 TP391.43
文献标志码 A