train.py
is a Python script designed for training audio tagging models. It supports various data transformations, augmentations, and utilizes a custom training loop encapsulated within a Trainer
class. The script is highly configurable through command-line arguments, allowing for flexible experimentation with different model architectures, data preprocessing techniques, and training parameters.
Ensure that all required Python packages are installed.
pip install -r requirements.txt
python train.py --data_dir "path/to/data" --model_class_name "ModelClassName" --epochs 20 --learning_rate 0.001 --batch_size 32
-
--data_dir
: Directory containing the audio data. -
--train_annotations
: Path to the training annotations file. -
--val_annotations
: Path to the validation annotations file. -
--test_annotations
: Path to the test annotations file. -
--sample_rate
: Sample rate for audio processing. -
--target_length
: Target length of audio samples in seconds. -
--batch_size
: Batch size for training. -
--num_workers
: Number of workers for data loading. -
--apply_augmentations
: Apply pitch shift and time-stretch augmentations. -
--model_class_name
: Class name of the model to be used. -
--learning_rate
: Learning rate for training. -
--epochs
: Number of training epochs. -
--model_path
: Directory to save the trained model.
- login to hpc
- Git setup on HPC:
- copy public key to clipboard and go to github settings and create new ssh key
Run
cat .ssh/id_rsa.pub
from inside home directory and copy output to clipboard - Clone our repo
git clone [email protected]:syeon0928/Tagging-Music-Sequences.git
- Update repo as usual (see below under Git commands)
- copy public key to clipboard and go to github settings and create new ssh key
Run
sbatch setup_conda_env
to setup environment (install packages etc)sbatch run_jupyter_notebook.sh
to run the jupyternotebookcat jupyter-notebook-{your job number}.log
to show output of running script- copy ssh command from log file and run on another terminal ex)
ssh -N -L 8248:desktop2:8248 [[email protected]]
- open the URL from the log file (last link)
Music plays an important role in our lives, while the landscape of contemporary music is vast. In order to understand music taste and build recommender systems for music, we need to learn to tag music first. In this project, we want to build a classifier that can tag music pieces with a genre or category after listening to an arbitrary long example. For this, we want to consider the following datsets:
- GTZAN
- The MagnaTagATune Dataset (MTAT), and
- for advanced studies: the Free Music Archive (FMA).
- Research literature about sound and music pre-processing, transformation, and representation. What type of pre-processing is best for music pieces, i.e. what is the state-of-the-art of spectrograms vs. raw waveform?
- Train an encoding model (deep recurrent and/or CNN network) with appropriate representation to classify sequences of music pieces. Your options are vast as you can consider all the tools that we covered in class: GRUs? CNNs? Variational Encoders? Combinations thereof? Make use of recent examples from literature! Can you identify an architecture (and meta-parameter settings) that can be trained to tag/classify considerably well?
- Study the performance for edge cases, such as particularly short input sequences or music pieces for rare genres/categories. Can you identify characteristics of such edge cases that make performance particularly high or low?
- Identify differences in quantitative performance and qualitative characteristics (look into how your model decides in edge cases) between different pre-processing options.
- Build your music tagger by training only on one of the datasets and comparing generalisation on the other. Given that you took good care of appropriate representation and pre-processing for both, can you explain the performance differences?
- Look into pre-trained options (e.g. from paperswithcode.com) and finetune your extended models. How is performance (quantitative and qualitative) different?