RECOGNITION OF EMOTIONAL STATES IN RUSSIAN SPEECH USING MFCC FUNCTIONS AND THE BLSTM MODEL FOR THE DUSHA DATASET
Abstract
This paper investigates the task of automatic emotion recognition from speech signals using contemporary deep learning techniques. The relevance of this study arises from the increasing demand for intelligent systems capable of assessing human emotional states, with potential applications in medicine, psychology, information systems, and personnel management. The primary objective is to develop an efficient neural network model for emotion recognition in Russian speech that outperforms existing state-of-the-art architectures. The experiments were conducted using the open-source Russian-language dataset Dusha, which contains 300,000 audio recordings. A total of 183,055 samples from the Crowd subset, annotated with four emotional categories—joy, sadness, anger, and neutral state—were used for training. Mel-frequency cepstral coefficients (MFCCs) were extracted as input features (20 coefficients with a 20 ms window and 10 ms overlap), followed by normalization. The baseline architecture employed a bidirectional long short-term memory network (BLSTM), capable of modeling both past and future temporal dependencies. To improve generalization and mitigate overfitting, the model was enhanced with convolutional layers (CNN), MaxPooling layers, and regularization mechanisms including Dropout and Batch Normalization. The resulting hybrid CNN–BLSTM architecture achieved 62.9% accuracy on the test set, exceeding the baseline performance (56.2%) by 6.7%. The results were further compared with state-of-the-art architectures such as MobileNetV2, HuBERT, and WavLM. The analysis highlights future directions for improving model performance through structural optimization, class balancing, and incorporation of additional acoustic features.
References
1. Lian H., Lu C., Li S., Zhao Y. A Survey of deep learning-based multimodal emotion recognition: speech, text, and face, Entropy, 2023, No. 25 (10).
2. Bogdanova D.R., Akushev A.T. Raspoznavanie emotsiy po rechevomu signalu [Emotion recognition based on speech signal], E-Scio: elektron. nauchn. zhurn [E-Scio: electronic scientific journal], 2021, No. 6 (57).
3. Nazarova E.K. Vliyanie psikhiki na proizvoditel'nost' truda [Influence of the psyche on labor productivi-ty], Universum: psikhologiya i obrazovanie [Universum: Psychology and Education], 2024, No. 7, pp. 53-56.
4. Gorshkov Yu.G. Vizualizatsiya emotsional'noy napryazhennosti cheloveka po rechevomu signalu [Visu-alization of human emotional tension based on speech signal], Nauchnaya vizualizatsiya [Scientific Vis-ualization], 2023, No. 2, pp. 102-112.
5. Nikiforov A.A. Razrabotka modulya raspoznavaniya emotsiy razgovora koll-tsentra s ispol'zovaniem rekurrentnykh iskusstvennykh neyronnykh setey, dlya vyyavleniya nezhelatel'nogo kontenta [Develop-ment of a module for recognizing emotions in call center conversations using recurrent artificial neural networks to identify unwanted content], Vestnik nauki [Herald of Science], 2023, No. 7, pp. 226-231.
6. Malygina Yu.P. Neyronnye seti: osobennosti, tendentsii, perspektivy razvitiya [Neural networks: fea-tures, trends, prospects for development], Molodoy issledovatel' Dona: elektronnyy nauchnyy zhurnal [Young researcher of the Don: electronic scientific journal], 2018, No. 5 (14), pp. 79-82.
7. Ekman P., Oster H. Facial expressions of emotions, Annual Review of Psychology, 1979, No. 30,
pp. 527-554.
8. Cowen A.S., Keltner D. Self-report captures 27 distinct categories of emotion bridged by continuous gradients, Proceedings of the National Academy of Sciences, PNAS, 2017, No. 114 (38), pp. E7900-E7909.
9. Wagner J., Triantafyllopoulos A., Wierstorf H., Schmitt M., Burkhardt F., Eyben F., Schuller B.W. Dawn of the transformer era in speech emotion recognition: closing the valence gap, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
10. Ayadi M., Kamel M., Karray F. Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition, 2011, No. 44, pp. 572-587.
11. Dataset dlya raspoznavaniya emotsiy – Dusha [Dataset for emotion recognition – Dusha]. Available at: https://developers.sber.ru/portal/products/dusha?ysclid=m7hrx3w5t3967089717.
12. Makarova V., Petrushin V. Ruslana: a database of Russian emotional utterances, 7th International Con-ference on Spoken Language Processing, 2002, pp. 2041-2044.
13. Russian emotional speech dialogs (RESD). Available at: https://www.kaggle.com/datasets/ ar4ikov/resd-dataset.
14. Sahidullah M., Saha G. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition, Speech Communication, 2012, Vol. 54, Issue 4, pp. 543-565.
15. Jagtap S., Desai K., Patil J. A Survey on Speech Emotion Recognition Using MFCC and Different Classifier, 2022.
16. Badr Y., Mukherjee P., Thumati S. Speech Emotion Recognition using MFCC and Hybrif Neural Net-works, 13th International Conference on Neural Computation Theory and Applications, 2021.
17. Librosa – librosa 0.10.2 documentation. – Режим доступа: https://librosa.org/doc/latest/index.html.
18. Hochreiter S., Schmidhuber J. Long Short-Term Memory, Neural Computation, 1997, No. 9 (8),
pp. 1735-1780.
19. Kondratenko V., Sokolov A., Karpov N., Kutuzov O., Savushkin N., Minkin F. Large raw emotional da-taset with aggregation mechanism, ArXiv (Cornell University), 2022.
20. Lemaev V.I., Lukashevich N.V. Avtomaticheskaya klassifikatsiya emotsiy v rechi: metody i dannye [Au-tomatic classification of speech emotions: methods and data], Litera. Nota bene, 2024, No. 4, pp. 159-173.








