ТЕХНОЛОГИЯ ПОВЫШЕНИЯ РОБАСТНОСТИ АКУСТИЧЕСКОЙ МОДЕЛИ В ЗАДАЧЕ РАСПОЗНАВАНИЯ РЕЧИ

Y.S. Pikaliov; T. V. Yermolenko

Y.S. Pikaliov State institute «Institute of Artificial Intelligence Problems»
T. V. Yermolenko Donetsk National University

Keywords: Automatic speech recognition, hidden markov models, gaussian mixture models, discriminative learning, informative acoustic features, deep neural networks

Abstract

In this paper proposed is a technology of increasing the robustness of an acoustic model in the problem of speech recognition using deep machine learning. This technology is based on the use of informative acoustic features extracted from hierarchical neural network models, as well as on hybrid acoustic models trained on the basis of machine deep learning using a discriminative ap-proach. The conditions in which automatic speech recognition systems operate almost never coincide with the conditions in which acoustic models were trained. The consequence is that the constructed models are not optimal for these conditions. The following factors influence the speech signal: addi-tive noise; voice path and manner of speaking the speaker; reverberation; amplitude-frequency char-acteristic of the microphone and transmission channel; Nyquist filter signal conversion and quantiza-tion noise. The proposed technology is aimed at increasing the stability of the model to the above factors. One way to increase the robustness of a model is to extract informative acoustic features from phonograms obtained using neural networks. As acoustic features, chalk-skeptal coefficients, their first and second derivatives, as well as perceptual linear prediction coefficients are used. An informative feature extraction scheme is proposed, consisting of three connected neural network blocks with a narrow neck (with contexts of 2, 5, and 10 frames), as well as a ResBlock block, which is based on the ResNet-50 architecture. An additional transformation using ResBlock allows you to define patterns that have a big impact on the model, i.e., are key features. The presented neural net-work architecture for classifying phonemes consists of layers of a neural network with time delays, a bi-directional neural network with long short-term memory, using the attention mechanism. The input features for this neural network are bank filters transformed using linear discriminative analysis and features extracted from the neural network. A feature of this approach is that high model ac-curacy (ensuring good class separability) is achieved, unlike end-to-end systems, without the use of a voluminous training set of audio data. In addition, this model is invariant to changes in input features. A series of numerical experiments were conducted for the task of recognizing Russian speech using the VoxForge and SpokenCorpora speech bodies. The experimental results demon-strate a high accuracy of recognition of Russian speech.

References

1. Amodei D., Ananthanarayanan S., Anubhai R. Deep speech 2: End-to-end speech recognition in english and mandarin, International conference on machine learning, 2016, pp. 173-182. 2. Markovnikov N.M., Kipyatkova I.S. Issledovanie metodov postroeniya modeley koderdekoder dlya raspoznavaniya russkoy rechi [Research of methods for constructing coderdecoder mod-els for Russian speech recognition], Informatsionno-upravlyayushchie sistemy [Information and control systems], 2019, No. 4, pp. 45-53. 3. Tampel' I.B., Karpov A.A. Avtomaticheskoe raspoznavanie rechi: ucheb. posobie [Automatic speech recognition: tutorial]. Saint Petersburg: Universitet ITMO, 2016.
4. Yu D., Seltzer M., Li J. et al. Feature Learning in Deep Neural Networks – studies on Speech Recognition Tasks, Proc. ICLR-2013. Available at: https://arxiv.org/abs/1301.3605 (accessed 15 January 2020).
5. Hermansky H., Ellis D.P.W., Sharma S. Tandem connectionist feature extraction for conven-tional HMM systems, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100). IEEE, 2000, Vol. 3, pp. 1635-1638.
6. Grézl F. et al. Probabilistic and bottle-neck features for LVCSR of meetings, 2007 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing-ICASSP'07. IEEE, 2007, Vol. 4, pp. 757-760.
7. Sainath T., Kingsbury B., Ramabhadran B. Auto-encoder bottleneck features using deep belief networks, 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2012, pp. 4153-4156.
8. Gehring J. et al. Extracting deep bottleneck features using stacked auto-encoders, 2013 IEEE inter-national conference on acoustics, speech and signal processing. IEEE, 2013, pp. 3377-3381.
9. Saon G. et al. Speaker adaptation of neural network acoustic models using i-vectors, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2013, pp. 55-59.
10. Zhang Y., Chuangsuwanich E., Glass J. Extracting deep neural network bottleneck features using low-rank matrix factorization, 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014, pp. 185-189.
11. Povey D. et al. Subspace Gaussian mixture models for speech recognition, 2010 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing. IEEE, 2010, pp. 4330-4333. 12. Medennikov I.P. Dvukhetapnyy algoritm initsializatsii obucheniya akusticheskikh modeley na osnove glubokikh neyronnykh setey [Two-stage algorithm for initialization of acoustic model training based on deep neural networks], Nauchno-tekhnicheskiy vestnik informatsionnykh tekhnologiy, mekhaniki i optiki [Scientific and technical Bulletin of information technologies, mechanics and optics], 2016, Vol. 16, No. 2, pp. 379-381. 13. Xue J., Li J., Gong Y. Restructuring of deep neural network acoustic models with singular value decomposition, Interspeech, 2013, pp. 2365-2369.
14. He K. et al. Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
15. Xu K. et al. Show, attend and tell: Neural image caption generation with visual attention, In-ternational conference on machine learning, 2015, pp. 2048-2057.
16. Sawai H. TDNN-LR continuous speech recognition system using adaptive incremental TDNN training, ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Pro-cessing. IEEE, 1991, pp. 53-56.
17. Kipyatkova I., Karpov A. DNN-based acoustic modeling for Russian speech recognition using Kaldi, International Conference on Speech and Computer. Springer, Cham, 2016, pp. 246-253.
18. Graves A., Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural networks, 2005, Vol. 18, No. 5-6, pp. 602-610.
19. Shmyrev N.V. Svobodnye rechevye bazy dannykh voxforge.org [Free speech databases voxforge.org], Komp'yuternaya lingvistika i intellektual'nye tekhnologii: Po materialam ezhegodnoy Mezhdunarodnoy konferentsii «Dialog» (Bekasovo, 4–8 iyunya 2008 g.) [Com-puter linguistics and intelligent technologies: based on the materials of the annual international conference "Dialogue" (Bekasovo, June 4-8, 2008)], Issue 7 (14). Moscow: RGGU, 2008, pp. 585-517.
20. Fedorova O.V. Rasskazy o snovideniyakh: Korpusnoe issledovanie ustnogo russkogo diskursa [Stories about dreams: a Corpus study of oral Russian discourse], ed. by Kibrika A.A. i Podlesskoy V.I. Moscow: Yazyki slavyanskikh kul'tur, 2009, 736 p. Russkiy yazyk v nauchnom osveshchenii [Russian language in scientific coverage], 2010, No. 2, pp. 305-312.

TECHNOLOGY FOR INCREASING THE ROBUSTNESS OF THE ACOUSTIC MODEL IN THE PROBLEM OF SPEECH RECOGNITION

Abstract

References