《计算机应用研究》|Application Research of Computers

由嘴唇视频直接生成语音的研究

Research on direct speech generation from lip video

免费全文下载 (已被下载 次)  
获取PDF全文
作者 贾振堂
机构 上海电力大学 电子与信息工程学院,上海 200090
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2020)06-060-1890-05
DOI 10.19734/j.issn.1001-3695.2018.11.0912
摘要 为了更便捷地实现唇声转换,提出了一种由嘴唇视频直接生成语音的方法,并对相关问题进行了研究。首先同步地计算视频文件中的嘴唇运动特征和对应的LPC10话音编码参数,然后借助LSTM人工神经网络进行端到端的训练。训练后的网络模型可以将嘴唇运动特征映射为语音编码参数,再采用语音合成技术将语音编码参数转换成可以播放的语音样值数据。该方法跳过了中间的文字环节,因此称为直接生成,具有可方便地得到训练样本、无须人工标注的优点,同时也可以避免文本重建语音方法中存在的发音歧义。测试结果表明,在词汇量有限的应用情境中,该方法可以由嘴唇视频重建较为清晰可懂的语音。
关键词 嘴唇运动特征; 语音分析与合成; LPC10; 直接生成; LSTM
基金项目 国家自然科学基金青年项目(61401269)
本文URL http://www.arocmag.com/article/01-2020-06-060.html
英文标题 Research on direct speech generation from lip video
作者英文名 Jia Zhentang
机构英文名 School of Electronic & Information Engineering,Shanghai University of Electric Power,Shanghai 200090,China
英文摘要 In order to realize lip-to-speech conversion more conveniently, this paper proposed a method to generate speech directly from lip video and studied the related problems. Firstly, it calculated lip motion features and corresponding LPC10 speech coding parameters in video files synchronously, and then carried out an end to end training using the LSTM artificial neural network. Then the trained network model could map the lip motion features into LPC10 coded parameters, and finally converted the coded parameters into speech samples by the speech synthesis algorithm. The advantages of this method were that it was easy to obtain the training samples without time-consuming manual annotation, and also avoided the pronunciation ambiguity existing in the text-speech reconstruction method. The test results show that, in the applications under limited vocabularies, the proposed method can reconstruct clear and understandable speech from lip video.
英文关键词 lip motion features; speech analysis and synthesis; LPC10; direct generation; LSTM
参考文献 查看稿件参考文献
 
收稿日期 2018/11/19
修回日期 2019/2/10
页码 1890-1894
中图分类号 TP391.4
文献标志码 A