Real-Time Emotion Recognition with CNN and LSTM

Sanmay Kotkar

0 evaluations Published on May 9, 2025

This article on Sciety

Abstract

I present two coupled real-time emotion perception pipelines: (1) spatial attention-augmented convolutional neural network (CNN) for face emotion identification, and (2) temporal attention-supported bidirectional long short-term memory (Bi-LSTM) network for speech emotion processing with Mel-frequency cepstral coefficients (MFCCs). Utilizing benchmark sets FER-2013 and RAVDESS, I use state-of-the-art data augmentation techniques (MixUp, CutMix), attention methods, and noise-insensitive preprocessing. My face pipeline is 70%–74% FER-2013 accurate and performs better under different illuminations and occlusions. My speech pipeline is also 82%–85% RAVDESS accurate with additional perturbations due to vocal tract length and filtering speech enhancements. I also provide precision, recall, and class-wise F1-scores, analyze confusion matrices, and have vision-transformer and hybrid CNN-Transformer baselines. Comprehensive discussion includes class imbalance solutions, ethical considerations in emotion AI, multimodal fusion techniques, and paradigms of lifelong learning. I end with directions towards culturally adaptive, light-weight edge deployment and real-world testing protocols.

Related articles are currently not available for this article.