《计算机应用研究》|Application Research of Computers

基于序列到序列模型的无监督文本简化方法

Unsupervised text simplification with sequence-to-sequence model

免费全文下载 (已被下载 次)  
获取PDF全文
作者 李天宇,李云,钱镇宇
机构 扬州大学 信息工程学院,江苏 扬州 225137
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2021)01-018-0093-04
DOI 10.19734/j.issn.1001-3695.2019.11.0611
摘要 训练基于序列到序列(seq2seq)的文本简化模型需要大规模平行语料库,但是规模较大且标注质量较好的语料却难以获得。为此,提出一种无监督文本简化方法,使模型的学习仅需要无标注的复杂句和简单句语料。首先,利用去噪自编码器(denoising autoencoder)分别从简单句语料和复杂句语料中学习,获取简单句的自编码器和复杂句的自编码器;然后,组合两个自编码器形成初始的文本简化模型和文本复杂化模型;最后,利用回译策略(back-translation)将无监督文本简化问题转换为监督问题,不断迭代优化文本简化模型。通过在标准数据集上的实验验证,该方法在通用指标BLEU和SARI上均优于现有无监督模型,同时在词汇级别和句法级别均有简化效果。
关键词 文本简化; 无监督; 序列到序列模型; 去噪自编码器
基金项目 国家自然科学基金资助项目(61703362)
江苏省研究生科研与实践创新计划项目(SJCX19_0888)
本文URL http://www.arocmag.com/article/01-2021-01-018.html
英文标题 Unsupervised text simplification with sequence-to-sequence model
作者英文名 Li Tianyu, Li Yun, Qian Zhenyu
机构英文名 School of Information Engineering,Yangzhou University,Yangzhou Jiangsu 225137,China
英文摘要 Training text simplification model based on seq2seq requires large-scale parallel corpora. However, current task lacks large-scale and well-labeled parallel corpora. To address the above issues, this paper proposed an unsupervised text simplification algorithm that made the learning of the model only need simple and complex sentence datasets without labels. First, the method used denoising autoencoder to learn from simple sentence corpus and complex sentence corpus, respectively, to obtain a simple sentence autoencoder and a complex sentence autoencoder. Then, it combined the two autoencoders to form an initial text simplification model and a text complication model. Finally, it used back-translation to convert the unsupervised text simplification problem into a supervised problem, and iteratively optimized the text simplification model. Experiments on the standard dataset show that the method is superior to the existing unsupervised model on the general indicators BLEU and SARI, and the model has simplified effects at both the lexical and syntactic level.
英文关键词 text simplification; unsupervised; sequence-to-sequence(seq2seq) model; denoising autoencoder
参考文献 查看稿件参考文献
 
收稿日期 2019/11/22
修回日期 2020/1/7
页码 93-96,100
中图分类号 TP391
文献标志码 A