《计算机应用研究》|Application Research of Computers

深度学习在语音识别中的研究进展综述

Overview of speech recognition based on deep learning

免费全文下载 (已被下载 次)  
获取PDF全文
作者 侯一民,周慧琼,王政一
机构 1.东北电力大学 自动化工程学院,吉林 吉林 132012;2.中国航空规划设计研究总院有限公司,北京 100120
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2017)08-2241-06
DOI 10.3969/j.issn.1001-3695.2017.08.001
摘要 在当今的大数据时代里,对于处理大量未经标注的原始语音数据的传统机器学习算法,很多都已不再适用。与此同时,深度学习模型凭借其对海量数据的强大建模能力,能够直接对未标注数据进行处理,成为当前语音识别领域的一个研究热点。主要分析和总结了当前几种具有代表性的深度学习模型,介绍了其在语音识别中对于语音特征提取及声学建模中的应用,最后总结了当前所面临的问题和发展方向。
关键词 机器学习;深度学习;语音数据;语音识别
基金项目 国家自然科学基金资助项目(61403075)
吉林省科技发展计划资助项目(20150414051GH)
本文URL http://www.arocmag.com/article/01-2017-08-001.html
英文标题 Overview of speech recognition based on deep learning
作者英文名 Hou Yimin, Zhou Huiqiong, Wang Zhengyi
机构英文名 1.SchoolofAutomationEngineering,NortheastDianliUniversity,JilinJilin132012,China;2.ChinaAviationPlanning&DesignInstituteCo.LTD.,Beijing100120,China
英文摘要 In the era of big data, many of traditional machine learning methods of disposing unlabeled raw voice data have become less applicable.At the same time, deep learning models can directly process unlabeled data because of its powerful capability of modeling to deal with the massive data, and has become a hot research in the field of speech recognition.To begin with, this paper analyzed and summarized the state-of-the-art deep learning of models.And then, it discussed the applications to speech recognition with speech features extraction and acoustic modeling. Finally, it concluded the problems faced and development orientation.
英文关键词 machine learning; deep learning; voice data; speech recognition
参考文献 查看稿件参考文献
  [1] 赵力. 语音信号处理[M] . 2版. 北京:机械工业出版社, 2011.
[2] 刘幺和, 宋庭新. 语音识别与控制应用技术[M] . 北京:科学出版社, 2008.
[3] Hinton G E, Osindero S, Teh Y. A fast learning algorithm for deep belief nets[J] . Neural Computation, 2006, 18(3):1527-1554.
[4] Bengo Y, Courcille A, Vincent P. Representation learning:a review and new perspectives[J] . IEEE Trans on Pattern Analysis and Machine Intelligence, 2013, 35(8):1798-1828.
[5] Dahl G, Yu Dong, Deng Li, et al. Context-dependent pretrained deep neural networks for large vocabulary speech recognition[J] . IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1):30-42.
[6] Hinton G E, Deng Li, Yu Dong, et al. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J] . IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[7] 余凯, 贾磊, 陈雨强, 等. 深度学习的昨天、今天和明天[J] . 计算机研究与发展, 2013, 50(9):1799-1804.
[8] 刘建伟, 刘媛, 罗雄麟. 深度学习研究进展[J] . 计算机应用研究, 2014, 31(7):1921-1930.
[9] Bengio Y. Learning deep architectures for AI[J] . Foundations and Trends in Machine Learning, 2009, 2(1):1-127.
[10] 戴武昌, 王建国, 徐天锡. 基于神经网络的蓄电池荷电状态估算[J] . 东北电力大学学报, 2016, 36(5):2-3.
[11] Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks[C] //Proc of the 19th International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 2007:153-160.
[12] Schlkopf B, Platt J, Hofmann T. Efficient learning of sparse representations with an energy-based model[C] //Advances in Neural Information Processing Systems. Cambridge:MIT Press, 2006:1137-1144.
[13] Hinton G E, Salakhutdinov R. Reducing the dimensionality of data with neural networks[J] . Science, 2006, 313(5786):504-507.
[14] Bengio Y, Dwlalleau O. On the expressive power of deep architectures[C] //Proc of the 22nd International Conference on Algorithmic Learning Theory. 2011:18-36.
[15] 郭丽丽, 丁世飞. 深度学习研究进展[J] . 计算机科学, 2015, 42(3):28-33.
[16] 刘建伟, 刘媛, 罗雄麟. 深度学习研究进展[J] . 计算机应用研究, 2014, 31(7):1921-1928.
[17] Hinton G E. A practical guide to training restricted Boltzmann machines[J] . Momentum, 2010, 9(1):599-619.
[18] 张建明, 詹智财, 成科扬, 等. 深度学习的研究与发展[J] . 江苏大学学报:自然科学版, 2015, 36(2):191-200.
[19] Cho K Y. Improved learning algorithms for restricted Boltzmann machines[D] . Espoo:Aalto University, 2011.
[20] 梁静. 基于深度学习的语音识别研究[D] . 北京:北京邮电大学, 2014.
[21] Alpayd E. 机器学习导论[M] . 范明, 等译. 北京:机械工业出版社, 2009.
[22] Larochelle H, Bengio Y, Louradour J, et al. Exploring strategies for training deep neural networks[J] . Journal of Machine Learning Research, 2009, 10(12):1-40.
[23] Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders:learning useful representations in a deep network with a local denoising criterion[J] . Journal of Machine Learning Research, 2010, 11(6):3371-3408.
[24] 刘进峰. 一种简洁高效的加速卷积神经网络的方法[J] . 科学技术与工程, 2014, 14(33):240-244.
[25] Abdel-Hamid O, Deng Li, Yu Dong. Exploring convolutional neural network structures and optimization techniques for speech recognition[J] . Interspeech, 2013, 58(4):1173-1175.
[26] Abdel-Hamid O, Mohamed A, Jiang Hui, et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C] //Proc of International Conference on Acoustics, Speech, and Signal Processing. 2012:4277-4280.
[27] Deng Li, Abdel-Hamid O, Yu Dong. A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion[C] // Proc of International Conference on Acoustics, Speech, and Signal Processing. 2013:6669-6673.
[28] 段艳杰, 吕宜生, 张杰, 等 . 深度学习在控制领域的研究现状与展望[J] . 自动化学报, 2016, 42(5):644-645.
[29] Bengio Y. Deep learning of representations:looking forward[M] // Statistical Language and Speech Processing. Berlin:Springer, 2013:1-37.
[30] Sutskever I. Training recurrent neural networks[D] . Toronto:University of Toronto, 2013.
[31] 韩纪庆, 张磊, 郑铁然. 语音信号处理[M] . 北京:清华大学出版社, 2004.
[32] Yu Dong, Deng Li. 解析深度学习——语音识别实践[M] . 俞凯, 钱彦旻, 等译. 北京:电子工业出版社, 2016.
[33] Lee T S, Mumford D. Hierarchical Bayesian inference in the visual cortex[J] . Journal of the Optical Society of America a Optics Image Science & Vision, 2003, 20(7):1434-1448.
[34] Deng Li. Industrial technology advances:deep learning from speech recognition to language and multimodal processing[J] . APSIPA Trans on Signal and Information Processing, 2016(5).
[35] Mohamed A, Yu Dong, Deng Li. Investigation of full-sequence trai-ning of deep belief networks for speech recognition[C] //Proc ofConference of the International Speech Communication Association. 2010:2846-2849.
[36] Deng Li, Yu Dong. Deep learning for signal and information proces-sing[R] . [S. l. ] :Microsoft Research, 2013.
[37] Dahl G E, Yu Dong, Deng Li, et al. Large vocabulary continuous speech recognition with context-dependent DBN-HMMs[C] //Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2011:4688-4691.
[38] Mohamed A, Dahl G, Hinton G. Acoustic modeling using deep belief networks[J] . IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1):14-22.
[39] Sivaram G S V, Hermansky H. Sparse multi-layer perceptron for phoneme recognition[J] . IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(1):23-29.
[40] Jaitly N, Hinton G. Learning a better representation of speech soundwaves using restricted Boltzmann machines[C] //Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2011:5884-5887.
[41] Seide F, Li Gang, Chen Xie, et al. Feature engineering in context-dependent deep neural networks for conversational speech transcription[C] //Proc of IEEE Workshop on Automatic Speech Recognition and Understanding. 2011:24-29.
[42] Deng Li. Switching dynamic system models for speech articulation and acoustics[M] // Mathematical Foundations of Speech and Language Processing. New York:Springer, 2004:115-133.
[43] Jaitly N, Nguyen P, Vanhouche V. Application of pretrained deep neural networks to large covabulary speech recognition[C] // Proc of Interspeech. 2012.
[44] Mohamed A R, Dahl G E, Hinton G E. Deep belief networks for phone recognition[C] //Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2011:5060-5063.
[45] 李晋辉, 杨俊安, 王一. 一种新的基于瓶颈深度信念网络的特征提取方法及其在语种识别中的应用[J] . 计算机科学, 2014, 41(3):263-266.
[46] You Yongbin, Qian Yanmin, He Tianxing, et al. An investigation on DNN-derived bottleneck features for GMM-HMM based robust speech recognition[C] //Proc of IEEE China Summit and International Conference on Signal and Information Processing. [S. l. ] :IEEE Press, 2015.
[47] Qian Yanmin, He Tianxing, Deng Wei, et al. Automatic model redundancy reduction for fast back-propagation for deep neural networks in speech recognition[C] //Proc of International Joint Conference on Neural Networks. [S. l. ] :IEEE Press, 2015.
[48] Liu Yuan, Fu Tianfan, Fan Yuchen, et al. Speaker verification with deep features[C] //Proc of International Joint Conference on Neural Networks. 2014:747-753.
[49] Imseng D, Motlicek P, Garner P, et al. Impact of deep MLP architecture on different modeling techniques for under-resourced speech recognition[C] //Proc of IEEE Workshop on Automatic Speech Recognition and Understanding. 2013:332-337.
[50] Sainath T N. Improvements to deep neural networks for large vocabulary continuous speech recognition tasks[R] . [S. l. ] :IBM Thomas J. Watson Research Center, 2014.
[51] 张晴晴, 刘勇, 潘接林, 等. 卷积神经网络在语音识别中的应用[J] . 工程科学学报, 2015, 37(9):1217-1217.
[52] Chen Jianshu, Deng Li. A primal-dual method for training recurrent neural networks constrained by the echo-state property[C] //Proc of International Conference on Learning Representations. 2013.
[53] Heigold G, Vanhoucke V, Senior A, et al. Multilingual acoustic models using distributed deep neural networks[C] //Proc of International Conference on Acoustics Speech and Signal Processing. 2013:8619-8623.
[54] Nguyen Q B, Vu T T, Chi M L. Improving acoustic model for English ASR System using deep neural network[C] // Proc of IEEE RIVF International Conference on Computing & Communication Technologies:Research, Innovation, and Vision for the Future. 2015.
[55] Mohamed A, Dahl G, Hinton G. Deep belief networks for phone re-cognition[C] //Proc of NIPS Workshop. 2010.
[56] Deng Li, Yu Dong , Hinton G . Deep learning for speech recognition and related applications[C] //Proc of NIPS Workshop. 2009.
[57] Maas A L, Qi Peng, Xie Ziang, et al. Building DNN acoustic models for large vocabulary speech recognition[J] . Computer Speech & Language, 2015, 41(1):195-213.
[58] Weng Chao, Yu Dong, Seltzer M L, et al. Deep neural networks for single-channel multi-talker speech recognition[J] . IEEE/ACM Trans on Audio Speech & Language Processing, 2015, 23(10):1670-1679.
[59] Yu Dong, Deng Li. Deep convex net:a scalable architecture for speech pattern classification[C] //Proc of the 12th Annual Conference of International Speech Communication Association. 2011:2285-2288.
[60] Zhang Shixiong, Liu Chaojun, Yao Kaisheng, et al. Deep neural support vector machines for speech recognition[C] //Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015:5885-5889.
[61] Markoff J. Scientists see promise in deep-learning programs[N] . The New York Times, 2012-11-23.
[62] Deng Li, Li Jinyu, Huang J T, et al. Recent advances in deep lear-ning for speech research at Microsoft[C] //Proc of International Conference on Acoustics Speech and Signal Processing. 2015:8604-8608.
[63] Huang J T, Li Jinyu, Yu Dong, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C] // Proc of IEEE International Conference on Acoustics, Speech & Signal Processing. 2013:7304-7308.
[64] Deng Li, Yu Dong. Deep learning:methods and applications[M] . [S. l. ] :Microsoft Research, 2016.
[65] Zhang Qingqing, Liu Yong, Wang Zhichao, et al. The application of convolutional neural network in speech recognition[J] . Journal of Network New Media, 2014, 22(10):1533-1545.
[66] Sainath T N, Mohamed A R, Kingsbury B, et al. Deep convolutional neural networks for LVCSR[C] // Proc of IEEE International Confe-rence on Acoustics, Speech and Signal Processing. 2013:8614-8618.
[67] Huang J T, Li Jinyu, Gong Yifan. An analysis of convolutional neural networks for speech recognition[C] // Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015.
[68] Palaz D, Magimai-Doss M, Collobert R. Convolutional neural networks-based continuous speech recognition using raw speech signal[C] //Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015:177-181.
[69] Chan W, Lane I. Deep convolutional neural networks for acoustic modeling in low resource languages[C] //Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015:2056-2060.
[70] Arisoy E, Sethy A, Ramabhadran B, et al. Bidirectional recurrent neural network language models for automatic speech recognition[C] // Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015:5421-5425.
[71] Zhang Shixiong, Zhao Rui, Liu Chaojun, et al. Recurrent support vector machines for speech recognition[C] // Proc of IEEE International Conference on Acoustics, Speech and Signal Processing. 2016.
收稿日期 2016/9/19
修回日期 2016/11/22
页码 2241-2246
中图分类号 TP181
文献标志码 A