Speech emotion recognition using LSTM, SVM and MLP.
Improve the feature extraction method and get higher accuracy (about 80%). The original version is saved in Branch: First-Version.
Python 3.6.7
├── Common_Model.py // Common part of all models
├── ML_Model.py // SVM & MLP
├── DNN_Model.py // LSTM
├── Utils.py // Load models, plot graphs
├── Opensmile_Feature.py // Use Opensmile for features extracting
├── Librosa_Feature.py // Use librosa for features extracting
├── SER.py // Using different models for speech emotion recognition
├── File.py // Organize dataset (classify and rename)
├── Config.py // Configuration parameters
├── cmd.py // Use argparse for getting args from command line
├── cmd_example.sh // Examples of command line input
├── Models // Restore trained models
└── Feature // Restore extracted features
- scikit-learn: SVM & MLP, split data into training set and testing set
- Keras: LSTM
- TensorFlow: Backend of keras
- librosa: Extract features, waveform
- SciPy: Spectrogram
- pandas: Load features
- Matplotlib: Plot graphs
- numpy
- Opensmile: Extract features
English, around 1500 audios from 24 people (12 male and 12 female) including 8 different emotions (the third number of each file name represents the emotional type): 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised.
English, around 500 audios from 4 people (male) including 7 different emotions (the first letter of each file name represents the emotional type): a = anger, d = disgust, f = fear, h = happiness, n = neutral, sa = sadness, su = surprise.
German, around 500 audios from 10 people (5 male and 5 female) including 7 different emotions (the second to last letter of each file name represents the emotional type): N = neutral, W = angry, A = fear, F = happy, T = sad, E = disgust, L = boredom.
Chinese, around 1200 audios from 4 people (2 male and 2 female) including 6 different emotions: neutral, happy, sad, angry, fearful and surprised.
Install dependencies:
pip install -r requirements.txt
Install Opensmile.
Parameters can be configured in Config.py
About Opensmile standard feature sets, currently only following 6 feature sets are supported:
: The INTERSPEECH 2009 Emotion Challenge, 384 features;IS10_paraling
: The INTERSPEECH 2010 Paralinguistic Challenge, 1582 features;IS11_speaker_state
: The INTERSPEECH 2011 Speaker State Challenge, 4368 features;IS12_speaker_trait
: The INTERSPEECH 2012 Speaker Trait Challenge, 6125 features;IS13_ComParE
: The INTERSPEECH 2013 ComParE Challenge, 6373 features;ComParE_2016
: The INTERSPEECH 2016 Computational Paralinguistics Challenge, 6373 features.
You should modify FEATURE_NUM
parameter if you need to use other feature sets.
Long option | Option | Description |
--option |
-o |
Option [ p : predict / t : train ] [ required ] |
--model_type |
-mt |
Model type [ svm / mlp / lstm ] [ default is svm ] |
--model_name |
-mn |
Name of the model file which will be saved or loaded [ default is default ] |
--load |
-l |
Load exist features or not [ 0 : no / 1 : yes ] [ default is 1 ] |
--feature |
-f |
How to extract features [ o : Opensmile / l : librosa ] [ default is o ] |
--audio |
-a |
Path of audio which will be predicted [ default is default.wav ] |
python3 cmd.py -o t -mt 'svm' -mn 'SVM' -l 1 -f 'o'
python3 cmd.py -p t -mt 'svm' -mn 'SVM' -f 'o' -a [audio path]
More examples can be found in cmd_example.sh
The path of datasets can be configured in Config.py
. Audios which express the same emotion should be put in the same folder (File.py
can be used to organize the data), for example:
└── Datasets
├── Angry
├── Happy
├── Sad
from SER import Train
model_name: model type (SVM / MLP / LSTM)
save_model_name: name of the model file
if_load: load exist features or not (True / False)
feature_method: how to extract features ('o': Opensmile / 'l': librosa)
model: a trained model
model = Train(model_name, save_model_name, if_load, feature_method)
from Utils import load_model
load_model_name: name of the model file which will be loaded
model_name: model type (SVM / MLP / LSTM)
model: a model
model = load_model(load_model_name, model_name)
from SER import Predict
model: a trained or loaded model
model_name: model type (SVM / MLP / LSTM)
file_path: path of audio which will be predicted
feature_method: how to extract features ('o': Opensmile / 'l': librosa)
predict result and probability
Predict(model, model_name, file_path, feature_method)
Features extracted by Opensmile will be save in .csv
files and by librosa will be save in .p
import Librosa_Feature as of
import Opensmile_Feature as of
data_path: path of dataset / audio which will be predicted
feature_path: path for saving features
train: training data or not
Training data:
Ouput: samples of training data, samples of testing data and their labels
# Opensmile
x_train, x_test, y_train, y_test = of.get_data(data_path, feature_path, train = False)
# librosa
x_train, x_test, y_train, y_test = lf.get_data(data_path, feature_path, train = False)
Predicting data:
Output: features of audio
# Opensmile
test_feature = of.get_data(data_path, feature_path, train = True)
# librosa
test_feature = lf.get_data(data_path, feature_path, train = True)
import Librosa_Feature as lf
import Opensmile_Feature as of
feature_path: path for loading features
train: training data or not
Training data:
Output: samples of training data, samples of testing data and their labels
# Opensmile
x_train, x_test, y_train, y_test = of.load_feature(feature_path, train = True)
# librosa
x_train, x_test, y_train, y_test = lf.load_feature(feature_path, train = True)
Predicting data:
Output: features of audio
# Opensmile
test_feature = of.load_feature(feature_path, train = False)
# librosa
test_feature = lf.load_feature(feature_path, train = False)
Plot a radar chart of probability.
Source: Radar
from Utils import Radar
data_prob: probability
Plot a waveform of an audio.
from Utils import Waveform
Plot a spectrogram of an audio.
from Utils import Spectrogram
@Zhaofan-Su and @Guo Hui。