Deep Learning Based Human Emotional State Recognition in a Video

A. A. Moskvin; A.G. Shishkin

doi:10.32732/jmo.2020.12.1.51

A. A. Moskvin
A.G. Shishkin

DOI: https://doi.org/10.32732/jmo.2020.12.1.51

Keywords: Artificial neural networks; Deep learning; Emotion recognition; Video; Speech signal.

Abstract

Human emotions play significant role in everyday life. There are a lot of applications of automatic emotion recognition in medicine, e-learning, monitoring, marketing etc. In this paper the method and neural network architecture for real-time human emotion recognition by audio-visual data are proposed. To classify one of seven emotions, deep neural networks, namely, convolutional and recurrent neural networks are used. Visual information is represented by a sequence of 16 frames of 96 × 96 pixels, and audio information - by 140 features for each of a sequence of 37 temporal windows. To reduce the number of audio features autoencoder was used. Audio information in conjunction with visual one is shown to increase recognition accuracy up to 12%. The developed system being not demanding to be computing resources is dynamic in terms of selection of parameters, reducing or increasing the number of emotion classes, as well as the ability to easily add, accumulate and use information from other external devices for further improvement of classification accuracy.