Speech Emotion Recognition using CNN

In this article we will provide the implementation of Speech Emotion Recognition using CNN. Urdu language is used as a primary language to detect emotions from speech.

Hello, My name is Muhammad Umar. I’m a Lecturer of Computer Science. Email me if you need any other help: mohammadumar146@gmail.com. Please share the article to appreciate my work.

We have used 4 basic emotions sad, happy, angry, and neutral. Emotions are assigned labels as 01, 02, 03, 04 as we will see in the code section.

Speech Emotion Recognition using MLP

IDE used for Speech Emotion Recognition using CNN

We used Google Colab as the development environment.

Place your dataset into your Google Drive

You can also read:

URL Shortener App in Python with Output

Python Program to Download Youtube Videos

So let’s dig into code:

Following code is machine learning model that can recognize emotions in sound files. This is an important application of artificial intelligence that can be used in a variety of fields including speech recognition, music analysis, and mental health diagnosis.

Here is a step-by-step description of the code:

  1. The first line !pip install librosa soundfile numpy sklearn pyaudio installs the required libraries using pip.
  2. The import statements bring in the necessary libraries and modules.
  3. The extract_feature function takes a file name as input, reads the sound file, and extracts various features like MFCC, chroma, and mel from it.
  4. A dictionary named emotions is defined which maps the emotion codes to their respective emotions.
  5. A list named observed_emotions is defined that includes the emotions to be observed.
  6. The load_data function loads the data from the sound files, extracts the features, and prepares the data for training the model. It loops through all the files in the directory “./drive/My Drive/data/Emotion_/.wav” and extracts the emotion code from the filename. If the emotion is not in the observed_emotions list, the loop continues to the next file. Otherwise, the features are extracted using the extract_feature function, and both the features and the emotion code are appended to the x and y lists respectively.
  7. The load_data function then splits the data into training and testing sets using the train_test_split function from the sklearn.model_selection module.
  8. The training and testing sets are converted to NumPy arrays and then to one-hot encoded arrays using the LabelEncoder and np_utils.to_categorical functions from the sklearn.preprocessing and keras.utils modules, respectively.
  9. The training and testing sets are then reshaped using the np.expand_dims function so that they can be used as input to the convolutional neural network.
  10. A Sequential model is created using the Conv1D, MaxPooling1D, Flatten, Dropout, Dense, and Activation layers from the keras.layers module.
  11. The model is compiled with a RMSprop optimizer, categorical_crossentropy loss function, and accuracy metric.
  12. The model is then trained on the training data using the fit method with a batch size of 16 and 500 epochs.
  13. Finally, the model is evaluated on the testing data using the evaluate method, and the accuracy of the model is printed.

Install Libraries

!pip install librosa soundfile numpy sklearn pyaudio

Import Libraries

import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Feature Extraction Function

Feature Extraction Function
Speech Emotion Recognition using CNN Feature Extraction Function
#Extract features (mfcc, chroma, mel) from a sound file
def extract_feature(file_name, mfcc, chroma, mel):
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
               #X = sound_file.read()
        sample_rate=sound_file.samplerate
        X, sample_rate = librosa.load(file_name)
        #sample_rate=sound_file.samplerate
        if chroma:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma1=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma1))
        if mel:
            mel1=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel1))
        return result

Labeling Emotions From Dataset

#Emotions in the  dataset
emotions={
  '03':'03',
  '02':'02',
  '04':'04',
  '01':'01',
}
# - Emotions to observe
observed_emotions=['01', '02', '03', '04']

Dataset Loading from Google Drive

# - Load the data and extract features for each sound file
def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob("./drive/My Drive/data/Emotion_*/*.wav"):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[1]]
        if emotion not in observed_emotions:
            continue
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
        feature = [int(x) for x in feature] 

    return train_test_split(np.asarray(x), np.asarray(y), test_size=test_size, random_state=15)

Dividing Dataset For Training and Testing

x_train, x_test, y_train, y_test = load_data(test_size=0.25)
X_train = np.array(x_train)
y_train = np.array(y_train).ravel()
X_test = np.array(x_test)
y_test = np.array(y_test).ravel()

One Hot Encoding

# One-Hot Encoding
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
lb = LabelEncoder()
y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))

x_traincnn =np.expand_dims(X_train, axis=2)
x_testcnn= np.expand_dims(X_test, axis=2)

Build CNN Model

# To build Neural Network and Create desired Model
import keras
from keras.models import Model
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D #, AveragePooling1D
from keras.layers import Flatten, Dropout, Activation # Input, 
from keras.layers import Dense #, Embeddi
model = Sequential()

model.add(Conv1D(256, 5,padding='same',input_shape=(x_traincnn.shape[1],x_traincnn.shape[2])))
model.add(Activation('relu'))
model.add(Conv1D(128, 5,padding='same'))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(MaxPooling1D(pool_size=(8)))
model.add(Conv1D(128, 5,padding='same',))
model.add(Activation('relu'))
model.add(Conv1D(128, 5,padding='same',))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(y_train.shape[1]))
model.add(Activation('softmax'))
opt = keras.optimizers.RMSprop(lr=0.00001, decay=1e-6)

model.summary()

model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])

Fitting Model

cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=500, validation_data=(x_testcnn, y_test))

Saving Model

model_name = 'umar.h5'
save_dir = os.path.join(os.getcwd(), 'Trained_Models')
# Save model and weights
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)
model_path = os.path.join(save_dir, model_name)
model.save(model_path)
print('Saved trained model at %s ' % model_path)

import json
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

Loading JSON and creating model

# loading json and creating model
from keras.models import model_from_json
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("./Trained_Models/umar.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
score = loaded_model.evaluate(x_testcnn, y_test, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))

Accuracy

Accuracy of Model
Speech Emotion Recognition using CNN Accuracy of Model

If you need any help, Please comment I’m happy to help.

Explaining Dataset

Your dataset should be in the following manner:

Create Four Folders for 4 Emotions Sad, Happy, Netural, and Angry Like:

Sad: Emotion_01

Happy: Emotion_02

as shown in the picture below

Dataset Description
Dataset Description

Then in Emotion_01 Folder Lable emotion audio file as 01-01-00 for first emotion and so on. For Emotion_02 Folder name emotion file as 02-02-00 and so on.As shown in below picture.

Dataset Description 2
Dataset Description 2

Download Dataset

Here are some useful links related to the code provided:

  1. librosa documentation: https://librosa.org/doc/latest/index.html
  2. soundfile documentation: https://pysoundfile.readthedocs.io/en/latest/
  3. numpy documentation: https://numpy.org/doc/stable/
  4. scikit-learn documentation: https://scikit-learn.org/stable/
  5. pyaudio documentation: https://people.csail.mit.edu/hubert/pyaudio/docs/
  6. How to install Python packages using pip: https://realpython.com/what-is-pip/
  7. How to use glob to find files in Python: https://www.geeksforgeeks.org/how-to-use-glob-to-find-files-recursively-in-python/
  8. One-hot encoding in scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
  9. Label encoding in scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
  10. Keras documentation: https://keras.io/api/
  11. Convolutional Neural Networks (CNNs) in Keras: https://keras.io/guides/sequential_model/
  12. RMSprop optimizer in Keras: https://keras.io/api/optimizers/rmsprop/