Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the capacity of this network? #9

Open
nkcdy opened this issue Aug 16, 2019 · 14 comments
Open

What's the capacity of this network? #9

nkcdy opened this issue Aug 16, 2019 · 14 comments

Comments

@nkcdy
Copy link

nkcdy commented Aug 16, 2019

what's the maximum speakers number during training? the paper use 17 speakers. what will happen if speaker number is larger than 17?

@bshall
Copy link
Owner

bshall commented Aug 16, 2019

Hi @nkcdy, the pretrained model I uploaded was trained on 102 speakers and it seems to work really well. It seems that the model learns a very general way to invert the melspectrograms. If you check out the samples here you'll see that the model generalizes well to out of domain speakers and I would expect it to improve by training on more speakers.

@nkcdy
Copy link
Author

nkcdy commented Aug 17, 2019

@bshall in fact, I have trained a network with 400 mandarin speakers but failed. the quality is very poor after 350k steps using default hparams. then I retrain the network with another mandarin corpus which has total 20 speakers. The loss stay at 2.7 after 120k steps. its good for the audio within the corpus. but for outside audio (such as my own voice)the quality is poor. now I abort the scheduler and retrain the network to see what will happen.

@nkcdy
Copy link
Author

nkcdy commented Aug 18, 2019

@bshall by the way, what's the final loss of your pretrained model?

@bshall
Copy link
Owner

bshall commented Aug 18, 2019

@nkcdy, that's very interesting. Thanks for sharing your findings. I did tune the hyperparameters on the ZeroSpeech dataset so it is a good idea to try as different schedule. I'm very interested to know if you get it working. I might try to train on Librispeech this week as well. If I get anything interesting I'll let you know.

I can't remember the exact loss right now. I'll check tomorrow morning. I did notice that the final loss is very dataset specific so it may not be all that helpful.

@nkcdy
Copy link
Author

nkcdy commented Aug 19, 2019

@bshall As expected, the loss never go below 2.66 even after 350k steps. the quality of the generated audio is not good enough, either. some audios sound good, but some audios have a lot of glitch in them.

@bshall
Copy link
Owner

bshall commented Aug 19, 2019

@nkcdy, that's unfortunate. Is the mandarin dataset you are using open? Maybe I could take a look to see if I can find any issues.

@nkcdy
Copy link
Author

nkcdy commented Aug 19, 2019

@bshall Yes, it is free. http://www.openslr.org/resources/18/data_thchs30.tgz. There are total speakers including 31 female voices and 9 male voices. I add some initialization on GRU cell and try again...

@nkcdy
Copy link
Author

nkcdy commented Aug 20, 2019

@bshall why only a small slice frame is picked as the Mel spectrogram condition? what will happen if a piece of silent voice is selected?

@bshall
Copy link
Owner

bshall commented Aug 20, 2019

@nkcdy, I've downloaded the mandarin dataset and will play around with it today and let you know.

I chose a slice of the output of the conditioning network to speed up the training time. If a silent section is selected then the network should learn to generate silence for that section of the spectrogram. I think it is most likely that the preprocessing step isn't tuned well for your dataset but I'll investigate that now.

@nkcdy
Copy link
Author

nkcdy commented Aug 20, 2019

@bshall I found that the loss maybe isn't a big deal for this network. i retrained it with ZeroSpeech corpus. the loss descrease quickly to 3.0 and stuck at this point even after 160k steps. the quality of the test audio(my own voice) is not good either. Now i am wondering that if the english test wav is good enough because, you know, the more familiar with some kind of languages, the more attention will be paid for the details. I am not familiar with the english language so I am not sure the details is as good as expected.

@bshall
Copy link
Owner

bshall commented Aug 21, 2019

@nkcdy, I've trained a model on the mandarin dataset for 60k steps so far. The audio isn't bad but I have noticed some glitches too. It seems like the glitches are happening on plosive and fricative sounds. After listening to a few of the original samples from the dataset you can hear a lot of plosive pops (that sound you hear on P, T and B when the microphone is too close) and the model may have a hard time synthesizing them. Something that might help is increasing audio_slice_frames but it'll make training slower.

As far as generalization goes, I found an utterance from the test set without any plosive pops and I think it sounds okay (see attachment)
I'll do some more testing when the model is fully trained.
gen_D4_751_model_steps_60000.wav.zip.

@nkcdy
Copy link
Author

nkcdy commented Aug 22, 2019

@bshall Here is my results.
UniversalVocoding.zip
there are five test wavs picked from the training corpus. The file named "cdy.wav" and "cdy_long.wav" are my orignal voices
and corresponding generated wavs are also included. what I concerned was the generalization capability of this structure. I dont think the generated my own voice is as good as the voice seen by the network during training. Maybe the number of the speakers is the main reason as the paper said in section 4.2.1.

@bshall
Copy link
Owner

bshall commented Aug 22, 2019

@nkcdy, there definitely is a difference in quality when you move to out of domain speakers, even on the ZeroSpeech corpus. I do think your model may be overfitting a little bit. I get slightly better results after 100k steps on your voice.
samples.zip

One thing I tried that does improve the quality a little is to multiply the logits on this line by 1.5 (those are the clips with "_x1.5" at the end). This sharpens the distribution and dampens the noise a bit. If you use too large a factor though it doesn't work and it may cause problems on silences.

There are also a number of other things that are worth investigating:

  1. The mandarin dataset has fewer male than female speakers, does this affect generalization to males vs females?
  2. All the audio in the dataset is presumably recorded on the same microphone. Does this affect generalization when the audio is recorded on a different set up.
  3. The paper uses 24kHz audio and a 10 bit predictions. I would guess that this would improve the quality as well.

@bshall bshall closed this as completed Aug 22, 2019
@bshall bshall reopened this Aug 22, 2019
@nkcdy
Copy link
Author

nkcdy commented Aug 23, 2019

@bshall i test another out of domain femail voice. it is still not good.
OutOfDomainFemaleVoice.zip

So, it do overfits the training corpus. i plan to add more speakers from different scenario into the corpus and keep on training and will let you know if i find something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants