This is a PyTorch implementation of a self-attentive speaker embedding.
VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
dev | test | |
---|---|---|
# of speakers | 1,211 | 40 |
# of utterances | 148,642 | 4,874 |
Download following files into "data" folder:
- vox1_dev_wav_partaa
- vox1_dev_wav_partab
- vox1_dev_wav_partac
- vox1_dev_wav_partad
- vox1_test_wav.zip
Then concatenate the files using the command:
$ cat vox1_dev* > vox1_dev_wav.zip
- Python 3.5.2
- PyTorch 1.0.0
Extract vox1_dev_wav.zip & vox1_test_wav.zip:
$ python extract.py
Split dev set to train and valid samples:
$ python pre_process.py
$ python train.py
If you want to visualize during training, run in your terminal:
$ tensorboard --logdir runs
Model | Margin-s | Margin-m | Test(%) | Inference speed |
---|---|---|---|---|
1 | 10.0 | 0.2 | 88.48% | 18.18 ms |
Visualize speaker embeddings from test set:
$ python visualize.py