This project is an individual assignment for the "Speech and Audio Processing" course, offered in the 8th semester of the 2024 academic year at the University of Piraeus, Department of Informatics. The project focuses on developing a Python program that segments speech recordings into words and classifies them into background or foreground sounds using machine learning classifiers. The system must process the input audio and identify the word boundaries without any prior knowledge of the number of words.
- Institution: University of Piraeus
- Department: Department of Informatics
- Course: Speech and Audio Processing (2024)
- Semester: 8th
- Python
- Libraries:
TensorFlow
: For building and training neural networks.NumPy
: For numerical computations.librosa
: For audio feature extraction.pandas
: For handling datasets.inquirer
: For interactive CLI options.speechrecognition
: For basic speech processing.pydub
: For audio playback and processing.
This project implements four different classifiers for segmenting and classifying the audio:
- Support Vector Machines (SVM)
- Multilayer Perceptron (MLP)
- Recurrent Neural Networks (RNN)
- Least Squares
Each classifier is trained to identify whether a portion of the audio corresponds to background noise or speech. The classifiers are trained and tested using features extracted from the audio dataset.
The VOiCES dataset is used for training and testing. It contains various audio samples divided into:
- Background sound: Contains only noise without speech.
- Foreground sound: Contains speech, with or without background noise.
The dataset is organized into the following structure:
auxiliary2024/input
├── VOiCES_devkit
├── distant-16k
├── references
├── source-16k
The repository is structured as follows:
/source2024/
svm.py # SVM classifier implementation
mlp.py # MLP classifier implementation
rnn.py # RNN classifier implementation
least_squares.py # Least Squares classifier implementation
main.py # Main script for full process execution
/docs/
Project-description.pdf # Description of the project
Project-documentation.pdf # Detailed project documentation
/images/
program_generations.png # Generation messages
program_solution_exists.png # Solution found message
program_solution_exists_image.png # Graph visualization
program_start.png # Program start message
/auxiliary2024/
input/ # Contains the VOiCES dataset
output/ # Output directory for classifier predictions
Additionally:
main_menu.py
: This is located in the root directory and provides an interactive menu for selecting which part of the project to execute (e.g., dataset loading, feature extraction, model training, etc.).
Ensure you have Anaconda or Miniconda installed on your system. You can download it from here.
-
Clone the repository and navigate to the project directory:
git clone https://github.com/thkox/speech-and-audio-processing cd speech-and-audio-processing
-
Download the VOiCES dataset from here.
- Extract the dataset to the
auxiliary2024/input/
directory.
- Extract the dataset to the
-
Install the required libraries:
python setup.py
-
Activate the conda environment:
conda activate speech-and-audio-processing
Once the environment is activated, you can execute the program using either of the following scripts:
main.py
: Executes the entire process, including loading the dataset, extracting features, training the models, and making predictions.main_menu.py
: Opens an interactive menu where you can choose specific tasks, such as loading the dataset, extracting features, training models, and transcribing audio files.
Upon successful execution, the following will be displayed:
- The progression of training and testing, along with the best classifier performances.
- The final solution showing the word boundaries detected by the classifiers.
- A graphical representation of the speech waveform with detected word intervals.
For detailed explanations of the code, classifiers, and algorithms used in this project, refer to the Project-documentation.pdf
located in the /docs
directory. This document contains thorough explanations of all steps, including dataset preparation, feature extraction, model training, and evaluation.
This project is licensed under the MIT License - see the LICENSE file for details.