This repository is the result of a research project in the Bachelor's Degree in Data Science and Engineering in Universitat Politècnica de Catalunya (UPC).
It is an end-to-end approach to mono to binaural conversion, having 2.5D Visual Sound as the baseline and focused on Conv-TasNet's architecture.
More information can be found in paper_mono2binaural_tasnet.pdf
(The code has beed tested under the following system environment: Ubuntu 18.04.5 LTS, CUDA 11.1, Python 3.6.9, PyTorch 1.6.0)
Download the FAIR-Play dataset.
Generate the frames from the mp4 videos with the script
. -
Set relative path to the splits with the script
. -
[OPTIONAL] Preprocess the audio files using
to accelerate the training process. -
Use the following command to train a model:
python3 --hdf5FolderPath /YOUR_CODE_PATH/2.5d_visual_sound/hdf5/ --name mono2binaural --model MODEL_NAME --checkpoints_dir /YOUR_CHECKPOINT_PATH/ --save_epoch_freq 50 --display_freq 10 --save_latest_freq 100 --batchSize 32 --learning_rate_decrease_itr 10 --niter 1000 --lr_visual 0.0001 --lr_audio 0.001 --nThreads 32 --gpu_ids 0,1,2,3,4,5,6,7 --validation_on --validation_freq 100 --validation_batches 50 --tensorboard True --use_visual_info |& tee -a training.log
The model
parameter refers to either tasnet
or audioVisual
If it does not fit into the gpu, use the stepBatchSize
- Use the following command to test your trained mono2binaural model:
python3 --input_audio_path /BINAURAL_AUDIO_PATH --video_frame_path /VIDEO_FRAME_PATH --weights_visual /VISUAL_MODEL_PATH --weights_audio /AUDIO_MODEL_PATH --output_dir_root /YOUT_OUTPUT_DIR/ --input_audio_length 10 --hop_size 0.05 --model MODEL_NAME --use_visual_info
- Use the following command for evaluation:
python --results_root /YOUR_RESULTS --normalization True
This code is manly based on 2.5 Visual Sound.
The Conv-TasNet implementation is based on Demucs.
The code is CC BY 4.0 licensed, as found in the LICENSE file.