Automatic emotion recognition is critical to human-computer interaction. However, current methods suffer from limited applicability as they tend to overfit on single-corpus datasets that reduces their real-world effectiveness when faced with new corpora. We propose a first multi-corpus emotion recognition method with high generalizability evaluated by the leave-one-corpus-out protocol. The method uses three encoders for each modality (audio, video, and text) and a decoder that employs a gated attention mechanism to combine features from all three modalities. The method is evaluated on four multimodal corpora: CMU-MOSEI, MELD, IEMOCAP, and AFEW. Our method achieves the state-of-the-art results on the research corpora and establishes the first baselines for multi-corpus studies. Also, our results demonstrate that the models trained with MELD exhibit the best generalizability to new data.
Corpus | Train | Val. | Test | Classes of emotions | Class imbalance (Mean / STD) | Evaluation measures |
---|---|---|---|---|---|---|
IEMOCAP | 5229 | 581 | 1623 | 6 | 959.7 / 344.4 | Acc / F1 / UAR |
MOSEI | 16216 | 1835 | 4625 | 6 | 3734.3 / 2441.1 | Acc / wAcc / F1 / UAR |
MELD | 9989 | 1109 | 2610 | 7 | 1426.9 / 1427.0 | Acc / F1 / UAR |
AFEW | 773 | 383 | — | 7 | 110.4 / 31.1 | Acc / F1 / UAR |
Modality | Neutral | Happy | Sad | Angry | Excited | Frustrated | UAR | |
---|---|---|---|---|---|---|---|---|
IEMOCAP | ||||||||
A+V+T | 76.6 | 51.7 | 78.0 | 78.8 | 75.3 | 69.6 | 71.6 | |
w/o A | 69.5 | 42.7 | 74.3 | 60.0 β | 72.6 | 68.5 | 64.6 | |
w/o V | 57.0 β | 58.0 β | 69.0 | 72.4 | 51.5 β | 77.4 β | 64.2 β | |
w/o T | 82.8 β | 14.7 β | 66.1 β | 77.6 | 72.9 | 3.4 β | 52.9 β | |
Modality | Neutral | Happy / Joy | Sad | Angry | Surprise | Fear | Disgust | UAR |
MOSEI | ||||||||
A+V+T | — | 56.9 | 68.9 | 48.4 | 54.0 | 77.4 | 57.3 | 60.5 |
w/o A | — | 69.7 β | 68.1 | 44.3 | 52.2 | 72.8 | 52.7 β | 60.0 |
w/o V | — | 55.7 | 72.5 β | 50.7 | 51.7 | 71.6 β | 53.2 | 59.2 |
w/o T | — | 45.8 β | 56.3 β | 45.0 | 49.9 β | 72.2 | 62.9 β | 55.3 β |
MELD | ||||||||
A+V+T | 68.7 | 65.9 | 53.8 | 59.7 | 69.0 | 44.0 | 42.6 | 57.7 |
w/o A | 64.1 | 67.4 β | 52.4 | 48.7 | 71.5 β | 40.0 | 30.9 β | 53.6 |
w/o V | 58.2 | 65.9 | 59.1 β | 38.6 | 70.5 β | 40.0 | 35.3 | 52.5 |
w/o T | 37.9 β | 2.7 β | 10.6 β | 84.9 β | 1.1 β | 20.0 β | 41.2 | 28.3 β |
AFEW | ||||||||
A+V+T | 84.1 | 88.7 | 74.2 | 70.3 | 48.9 | 60.9 | 53.7 | 68.7 |
w/o A | 76.2 | 90.3 β | 53.2 | 43.8 β | 51.1 β | 54.3 | 46.3 | 59.3 |
w/o V | 49.2 β | 40.3 β | 35.5 β | 54.7 | 26.7 β | 50.0 | 29.3 β | 40.8 β |
w/o T | 60.3 | 74.4 | 75.8 β | 75.0 | 42.2 | 32.6 β | 34.1 | 56.8 |
Training corpus | Test subset | Average | |||
---|---|---|---|---|---|
IEMOCAP | MOSEI | MELD | AFEW | ||
IEMOCAP | 71.3 | 15.5 | 22.9 | 34.1 | 24.7 |
MOSEI | 38.2 | 38.5 | 34.3 | 44.3 | 38.9 |
MELD | 42.3 | 30.4 | 62.0 | 42.1 | 38.3 |
AFEW | 47.3 | 28.4 | 33.3 | 79.3 | 37.6 |
Average | 42.6 (Ξ 28.7) | 24.8 (Ξ 13.7) | 30.2 (Ξ 31.8) | 40.2 (Ξ 39.1) | — |
Encoder | Test subset | |||||||
---|---|---|---|---|---|---|---|---|
W/o IEMOCAP | LOCO IEMOCAP | W/o MOSEI | LOCO MOSEI | W/o MELD | LOCO MELD | W/o AFEW | LOCO AFEW | |
IEMOCAP | — | — | 59.7 | 31.7 | 49.6 | 32.8 | 50.7 | 42.9 β |
MOSEI | 43.0 β | 44.5 | — | — | 43.2 β | 33.0 | 49.4 | 48.6 |
MELD | 50.1 | 44.4 | 59.9 | 31.6 | — | — | 53.1 | 49.0 |
AFEW | 48.2 | 44.1 | 59.8 | 28.5 β | 52.3 | 32.2 | — | — |
Average | 47.1 | 44.3 | 59.8 | 30.6 | 48.4 | 32.7 | 51.1 | 46.8 |
Method | Year | Corpus | Modality | Metrics | |||
---|---|---|---|---|---|---|---|
wAcc | wF1 | Acc | F1 | ||||
Le et al. | 2023 | MOSEI | A+V+T | 67.8 | — | — | 47.6 |
MAGDRA | 2024 | — | — | 48.8 | 56.3 | ||
TAILOR | 2022 | — | — | 48.8 | 56.9 | ||
Ours (w/o WL) | 2024 | 61.4 | 80.4 | 49.7 | 54.0 | ||
Ours (WL) | 2024 | 69.3 | 77.7 | 46.2 | 53.4 | ||
Method | Year | Corpus | Modality | Metrics | |||
Acc | wF1 | mF1 | UAR | ||||
TelME | 2024 | IEMOCAP | A+V+T | — | 70.5 | — | 68.6 |
CORECT | 2023 | 69.9 | 70.0 | — | 70.9 | ||
M3Net | 2023 | 72.5 | 72.5 | 71.5 | — | ||
Ours (w/o WL) | 2024 | 71.9 | 71.7 | 70.5 | 70.2 | ||
Ours (WL) | 2024 | 72.9 | 72.8 | 72.0 | 71.6 | ||
Method | Year | Corpus | Modality | Metrics | |||
Acc | wF1 | mF1 | UAR | ||||
SDT | 2023 | MELD | A+V+T | 66.6 | 67.5 | 49.8 | 48.0 |
M3Net | 2023 | 68.3 | 67.1 | 51.0 | — | ||
TelME | 2024 | — | 67.4 | 51.4 | 50.0 | ||
Ours (w/o WL) | 2024 | 68.8 | 67.7 | 50.8 | 48.8 | ||
Ours (WL) | 2024 | 64.8 | 65.7 | 54.5 | 57.7 | ||
Method | Year | Corpus | Modality | Metrics | |||
Acc | wF1 | mF1 | UAR | ||||
Nguyen et al. | 2019 | AFEW | A+V | 62.3 | — | — | — |
Zhou et al. | 2019 | 65.5 | — | — | — | ||
Abdrahimov et al. | 2022 | 67.8 | — | — | 62.0 | ||
Ours (w/o WL) | 2024 | A+V+T | 70.2 | 69.6 | 67.2 | 67.3 | |
Ours (WL) | 2024 | 70.8 | 70.5 | 68.7 | 68.7 |
In this paper, we proposed the multi-corpus ER method designed to exhibit high generalizability to new unseen data. Our method incorporates three encoders to extract features from audio, video, and text modalities. In order to capture the context of audio and video signals, our method extracts statistic information from both modalities. The gated attention mechanism efficiently fuses information from all three modalities. The method was evaluated on IEMOCAP, MOSEI, MELD, and AFEW in both single- and multi-corpus setups. The new baselines have been obtained on all four corpora.