Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion

Authors are hidden for peer review¹

¹ Affiliation is hidden for peer review
INTERSPEECH 2024 (submitted)

Paper (coming soon) Code (in progress) 🤗 Model (coming soon) 🤗 Demo (coming soon)

TODO List

INTERSPEECH paper submission

GitHub page creation

Release code and models (soon)

Release Demo (soon)

arXiv paper submission (after accepting)

Abstract

Automatic emotion recognition is critical to human-computer interaction. However, current methods suffer from limited applicability as they tend to overfit on single-corpus datasets that reduces their real-world effectiveness when faced with new corpora. We propose a first multi-corpus emotion recognition method with high generalizability evaluated by the leave-one-corpus-out protocol. The method uses three encoders for each modality (audio, video, and text) and a decoder that employs a gated attention mechanism to combine features from all three modalities. The method is evaluated on four multimodal corpora: CMU-MOSEI, MELD, IEMOCAP, and AFEW. Our method achieves the state-of-the-art results on the research corpora and establishes the first baselines for multi-corpus studies. Also, our results demonstrate that the models trained with MELD exhibit the best generalizability to new data.

Pipeline of the proposed multimodal ER method

Train, Validation, and Test subsets, number of emotions, class imbalance and performance measures of research corpora. Acc refers to Accuracy, wAcc to Weighted Accuracy, F1 to F1-score, UAR to Unweighted Average Recall

Corpus	Train	Val.	Test	Classes of emotions	Class imbalance (Mean / STD)	Evaluation measures
IEMOCAP	5229	581	1623	6	959.7 / 344.4	Acc / F1 / UAR
MOSEI	16216	1835	4625	6	3734.3 / 2441.1	Acc / wAcc / F1 / UAR
MELD	9989	1109	2610	7	1426.9 / 1427.0	Acc / F1 / UAR
AFEW	773	383	—	7	110.4 / 31.1	Acc / F1 / UAR

Single-corpus results (Recall, %) based on single-corpus models trained with the weighted loss. A, V and T refer to audio, video and text modalities, respectively. ↓ and ↑ show the most and the least informative modality.

Modality	Neutral	Happy	Sad	Angry	Excited	Frustrated		UAR
IEMOCAP
A+V+T	76.6	51.7	78.0	78.8	75.3	69.6		71.6
w/o A	69.5	42.7	74.3	60.0 ↓	72.6	68.5		64.6
w/o V	57.0 ↓	58.0 ↑	69.0	72.4	51.5 ↓	77.4 ↑		64.2 ↓
w/o T	82.8 ↑	14.7 ↓	66.1 ↓	77.6	72.9	3.4 ↓		52.9 ↓
Modality	Neutral	Happy / Joy	Sad	Angry	Surprise	Fear	Disgust	UAR
MOSEI
A+V+T	—	56.9	68.9	48.4	54.0	77.4	57.3	60.5
w/o A	—	69.7 ↑	68.1	44.3	52.2	72.8	52.7 ↓	60.0
w/o V	—	55.7	72.5 ↑	50.7	51.7	71.6 ↓	53.2	59.2
w/o T	—	45.8 ↓	56.3 ↓	45.0	49.9 ↓	72.2	62.9 ↑	55.3 ↓
MELD
A+V+T	68.7	65.9	53.8	59.7	69.0	44.0	42.6	57.7
w/o A	64.1	67.4 ↑	52.4	48.7	71.5 ↑	40.0	30.9 ↓	53.6
w/o V	58.2	65.9	59.1 ↑	38.6	70.5 ↑	40.0	35.3	52.5
w/o T	37.9 ↓	2.7 ↓	10.6 ↓	84.9 ↑	1.1 ↓	20.0 ↓	41.2	28.3 ↓
AFEW
A+V+T	84.1	88.7	74.2	70.3	48.9	60.9	53.7	68.7
w/o A	76.2	90.3 ↑	53.2	43.8 ↓	51.1 ↑	54.3	46.3	59.3
w/o V	49.2 ↓	40.3 ↓	35.5 ↓	54.7	26.7 ↓	50.0	29.3 ↓	40.8 ↓
w/o T	60.3	74.4	75.8 ↑	75.0	42.2	32.6 ↓	34.1	56.8

Multi-corpus experimental results (UAR, %) based on single-corpus trained models. Δ means difference between diagonal and average values

Training corpus	Test subset				Average
Training corpus	IEMOCAP	MOSEI	MELD	AFEW	Average
IEMOCAP	71.3	15.5	22.9	34.1	24.7
MOSEI	38.2	38.5	34.3	44.3	38.9
MELD	42.3	30.4	62.0	42.1	38.3
AFEW	47.3	28.4	33.3	79.3	37.6
Average	42.6 (Δ 28.7)	24.8 (Δ 13.7)	30.2 (Δ 31.8)	40.2 (Δ 39.1)	—

Multi-corpus results (UAR, %) based on multi-corpus trained models and different encoders. Without corpus (e.g. w/o MELD) performance is for Test subsets of all training corpora, excluding MELD. ↓ shows the least generalization ability

Encoder	Test subset
Encoder	W/o IEMOCAP	LOCO IEMOCAP	W/o MOSEI	LOCO MOSEI	W/o MELD	LOCO MELD	W/o AFEW	LOCO AFEW
IEMOCAP	—	—	59.7	31.7	49.6	32.8	50.7	42.9 ↓
MOSEI	43.0 ↓	44.5	—	—	43.2 ↓	33.0	49.4	48.6
MELD	50.1	44.4	59.9	31.6	—	—	53.1	49.0
AFEW	48.2	44.1	59.8	28.5 ↓	52.3	32.2	—	—
Average	47.1	44.3	59.8	30.6	48.4	32.7	51.1	46.8

Comparison with existing ER methods. wF1 refers to weighted F1, mF1 to macro F1, WL to models trained with the weighted loss

Method	Year	Corpus	Modality	Metrics
Method	Year	Corpus	Modality	wAcc	wF1	Acc	F1
Le et al.	2023	MOSEI	A+V+T	67.8	—	—	47.6
MAGDRA	2024			—	—	48.8	56.3
TAILOR	2022			—	—	48.8	56.9
Ours (w/o WL)	2024			61.4	80.4	49.7	54.0
Ours (WL)	2024			69.3	77.7	46.2	53.4
Method	Year	Corpus	Modality	Metrics
Method	Year	Corpus	Modality	Acc	wF1	mF1	UAR
TelME	2024	IEMOCAP	A+V+T	—	70.5	—	68.6
CORECT	2023			69.9	70.0	—	70.9
M³Net	2023			72.5	72.5	71.5	—
Ours (w/o WL)	2024			71.9	71.7	70.5	70.2
Ours (WL)	2024			72.9	72.8	72.0	71.6
Method	Year	Corpus	Modality	Metrics
Method	Year	Corpus	Modality	Acc	wF1	mF1	UAR
SDT	2023	MELD	A+V+T	66.6	67.5	49.8	48.0
M³Net	2023			68.3	67.1	51.0	—
TelME	2024			—	67.4	51.4	50.0
Ours (w/o WL)	2024			68.8	67.7	50.8	48.8
Ours (WL)	2024			64.8	65.7	54.5	57.7
Method	Year	Corpus	Modality	Metrics
Method	Year	Corpus	Modality	Acc	wF1	mF1	UAR
Nguyen et al.	2019	AFEW	A+V	62.3	—	—	—
Zhou et al.	2019			65.5	—	—	—
Abdrahimov et al.	2022			67.8	—	—	62.0
Ours (w/o WL)	2024		A+V+T	70.2	69.6	67.2	67.3
Ours (WL)	2024		A+V+T	70.8	70.5	68.7	68.7

Conclusion

In this paper, we proposed the multi-corpus ER method designed to exhibit high generalizability to new unseen data. Our method incorporates three encoders to extract features from audio, video, and text modalities. In order to capture the context of audio and video signals, our method extracts statistic information from both modalities. The gated attention mechanism efficiently fuses information from all three modalities. The method was evaluated on IEMOCAP, MOSEI, MELD, and AFEW in both single- and multi-corpus setups. The new baselines have been obtained on all four corpora.