Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion

Authors are hidden for peer review1
1 Affiliation is hidden for peer review
INTERSPEECH 2024 (submitted)

TODO List

Abstract

Automatic emotion recognition is critical to human-computer interaction. However, current methods suffer from limited applicability as they tend to overfit on single-corpus datasets that reduces their real-world effectiveness when faced with new corpora. We propose a first multi-corpus emotion recognition method with high generalizability evaluated by the leave-one-corpus-out protocol. The method uses three encoders for each modality (audio, video, and text) and a decoder that employs a gated attention mechanism to combine features from all three modalities. The method is evaluated on four multimodal corpora: CMU-MOSEI, MELD, IEMOCAP, and AFEW. Our method achieves the state-of-the-art results on the research corpora and establishes the first baselines for multi-corpus studies. Also, our results demonstrate that the models trained with MELD exhibit the best generalizability to new data.

Pipeline of the proposed multimodal ER method

pipeline
Train, Validation, and Test subsets, number of emotions, class imbalance and performance measures of research corpora. Acc refers to Accuracy, wAcc to Weighted Accuracy, F1 to F1-score, UAR to Unweighted Average Recall
Corpus Train Val. Test Classes of emotions Class imbalance (Mean / STD) Evaluation measures
IEMOCAP 5229 581 1623 6 959.7 / 344.4 Acc / F1 / UAR
MOSEI 16216 1835 4625 6 3734.3 / 2441.1 Acc / wAcc / F1 / UAR
MELD 9989 1109 2610 7 1426.9 / 1427.0 Acc / F1 / UAR
AFEW 773 383 7 110.4 / 31.1 Acc / F1 / UAR
Single-corpus results (Recall, %) based on single-corpus models trained with the weighted loss. A, V and T refer to audio, video and text modalities, respectively. ↓ and ↑ show the most and the least informative modality.
Modality Neutral Happy Sad Angry Excited Frustrated UAR
IEMOCAP
A+V+T 76.6 51.7 78.0 78.8 75.3 69.6 71.6
w/o A 69.5 42.7 74.3 60.0 ↓ 72.6 68.5 64.6
w/o V 57.0 ↓ 58.0 ↑ 69.0 72.4 51.5 ↓ 77.4 ↑ 64.2 ↓
w/o T 82.8 ↑ 14.7 ↓ 66.1 ↓ 77.6 72.9 3.4 ↓ 52.9 ↓
Modality Neutral Happy / Joy Sad Angry Surprise Fear Disgust UAR
MOSEI
A+V+T 56.9 68.9 48.4 54.0 77.4 57.3 60.5
w/o A 69.7 ↑ 68.1 44.3 52.2 72.8 52.7 ↓ 60.0
w/o V 55.7 72.5 ↑ 50.7 51.7 71.6 ↓ 53.2 59.2
w/o T 45.8 ↓ 56.3 ↓ 45.0 49.9 ↓ 72.2 62.9 ↑ 55.3 ↓
MELD
A+V+T 68.7 65.9 53.8 59.7 69.0 44.0 42.6 57.7
w/o A 64.1 67.4 ↑ 52.4 48.7 71.5 ↑ 40.0 30.9 ↓ 53.6
w/o V 58.2 65.9 59.1 ↑ 38.6 70.5 ↑ 40.0 35.3 52.5
w/o T 37.9 ↓ 2.7 ↓ 10.6 ↓ 84.9 ↑ 1.1 ↓ 20.0 ↓ 41.2 28.3 ↓
AFEW
A+V+T 84.1 88.7 74.2 70.3 48.9 60.9 53.7 68.7
w/o A 76.2 90.3 ↑ 53.2 43.8 ↓ 51.1 ↑ 54.3 46.3 59.3
w/o V 49.2 ↓ 40.3 ↓ 35.5 ↓ 54.7 26.7 ↓ 50.0 29.3 ↓ 40.8 ↓
w/o T 60.3 74.4 75.8 ↑ 75.0 42.2 32.6 ↓ 34.1 56.8
Multi-corpus experimental results (UAR, %) based on single-corpus trained models. Ξ” means difference between diagonal and average values
Training corpus Test subset Average
IEMOCAP MOSEI MELD AFEW
IEMOCAP 71.3 15.5 22.9 34.1 24.7
MOSEI 38.2 38.5 34.3 44.3 38.9
MELD 42.3 30.4 62.0 42.1 38.3
AFEW 47.3 28.4 33.3 79.3 37.6
Average 42.6 (Ξ” 28.7) 24.8 (Ξ” 13.7) 30.2 (Ξ” 31.8) 40.2 (Ξ” 39.1)
Multi-corpus results (UAR, %) based on multi-corpus trained models and different encoders. Without corpus (e.g. w/o MELD) performance is for Test subsets of all training corpora, excluding MELD. ↓ shows the least generalization ability
Encoder Test subset
W/o IEMOCAP LOCO IEMOCAP W/o MOSEI LOCO MOSEI W/o MELD LOCO MELD W/o AFEW LOCO AFEW
IEMOCAP 59.7 31.7 49.6 32.8 50.7 42.9 ↓
MOSEI 43.0 ↓ 44.5 43.2 ↓ 33.0 49.4 48.6
MELD 50.1 44.4 59.9 31.6 53.1 49.0
AFEW 48.2 44.1 59.8 28.5 ↓ 52.3 32.2
Average 47.1 44.3 59.8 30.6 48.4 32.7 51.1 46.8
Comparison with existing ER methods. wF1 refers to weighted F1, mF1 to macro F1, WL to models trained with the weighted loss
Method Year Corpus Modality Metrics
wAcc wF1 Acc F1
Le et al. 2023 MOSEI A+V+T 67.8 47.6
MAGDRA 2024 48.8 56.3
TAILOR 2022 48.8 56.9
Ours (w/o WL) 2024 61.4 80.4 49.7 54.0
Ours (WL) 2024 69.3 77.7 46.2 53.4
Method Year Corpus Modality Metrics
Acc wF1 mF1 UAR
TelME 2024 IEMOCAP A+V+T 70.5 68.6
CORECT 2023 69.9 70.0 70.9
M3Net 2023 72.5 72.5 71.5
Ours (w/o WL) 2024 71.9 71.7 70.5 70.2
Ours (WL) 2024 72.9 72.8 72.0 71.6
Method Year Corpus Modality Metrics
Acc wF1 mF1 UAR
SDT 2023 MELD A+V+T 66.6 67.5 49.8 48.0
M3Net 2023 68.3 67.1 51.0
TelME 2024 67.4 51.4 50.0
Ours (w/o WL) 2024 68.8 67.7 50.8 48.8
Ours (WL) 2024 64.8 65.7 54.5 57.7
Method Year Corpus Modality Metrics
Acc wF1 mF1 UAR
Nguyen et al. 2019 AFEW A+V 62.3
Zhou et al. 2019 65.5
Abdrahimov et al. 2022 67.8 62.0
Ours (w/o WL) 2024 A+V+T 70.2 69.6 67.2 67.3
Ours (WL) 2024 70.8 70.5 68.7 68.7

Conclusion

In this paper, we proposed the multi-corpus ER method designed to exhibit high generalizability to new unseen data. Our method incorporates three encoders to extract features from audio, video, and text modalities. In order to capture the context of audio and video signals, our method extracts statistic information from both modalities. The gated attention mechanism efficiently fuses information from all three modalities. The method was evaluated on IEMOCAP, MOSEI, MELD, and AFEW in both single- and multi-corpus setups. The new baselines have been obtained on all four corpora.