This repository contains the code and resources for the Multimodal SER Model designed to recognize emotions from speech by combining text and acoustic data. The model fine-tunes a Multilayer Perceptron from text features extracted by DeBERTaV3 and another Multilayer Perceptron from acoustic features extracted by Wav2Vec2, then integrates these features through a Multilayer Perceptron (MLP) for improved emotion classification.
The Multimodal SER Model leverages both textual and acoustic features to classify emotions more accurately. The architecture consists of:
- Initial Classification: Each model (MLP 1, respectively MLP 2) individually classifies the emotion based on their respective features (extracted with Wav2Vec2, respectively DebertaV3).
- Fusion and Final Classification: The extracted features and initial predictions are combined using a Multi-Layer Perceptron (MLP 3) to provide the final emotion classification.
A research report detailing the development and evaluation of this architecture can be found at research-raport.pdf.
- Clone this repository.
- Create a virtual python
3.11
environment. - Set the python packet manager to version
23.3.1
, using:$ pip upgrade --install pip==23.3.1
- Install the imported libraries using:
$ pip install requriements.txt
The dataset can be downloaded from:
- kaggle - unofficial version
- official website - official version, available upon request
This project is licensed under the MIT License - see the LICENSE file for details.