-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the capacity of this network? #9
Comments
Hi @nkcdy, the pretrained model I uploaded was trained on 102 speakers and it seems to work really well. It seems that the model learns a very general way to invert the melspectrograms. If you check out the samples here you'll see that the model generalizes well to out of domain speakers and I would expect it to improve by training on more speakers. |
@bshall in fact, I have trained a network with 400 mandarin speakers but failed. the quality is very poor after 350k steps using default hparams. then I retrain the network with another mandarin corpus which has total 20 speakers. The loss stay at 2.7 after 120k steps. its good for the audio within the corpus. but for outside audio (such as my own voice)the quality is poor. now I abort the scheduler and retrain the network to see what will happen. |
@bshall by the way, what's the final loss of your pretrained model? |
@nkcdy, that's very interesting. Thanks for sharing your findings. I did tune the hyperparameters on the ZeroSpeech dataset so it is a good idea to try as different schedule. I'm very interested to know if you get it working. I might try to train on Librispeech this week as well. If I get anything interesting I'll let you know. I can't remember the exact loss right now. I'll check tomorrow morning. I did notice that the final loss is very dataset specific so it may not be all that helpful. |
@bshall As expected, the loss never go below 2.66 even after 350k steps. the quality of the generated audio is not good enough, either. some audios sound good, but some audios have a lot of glitch in them. |
@nkcdy, that's unfortunate. Is the mandarin dataset you are using open? Maybe I could take a look to see if I can find any issues. |
@bshall Yes, it is free. http://www.openslr.org/resources/18/data_thchs30.tgz. There are total speakers including 31 female voices and 9 male voices. I add some initialization on GRU cell and try again... |
@bshall why only a small slice frame is picked as the Mel spectrogram condition? what will happen if a piece of silent voice is selected? |
@nkcdy, I've downloaded the mandarin dataset and will play around with it today and let you know. I chose a slice of the output of the conditioning network to speed up the training time. If a silent section is selected then the network should learn to generate silence for that section of the spectrogram. I think it is most likely that the preprocessing step isn't tuned well for your dataset but I'll investigate that now. |
@bshall I found that the loss maybe isn't a big deal for this network. i retrained it with ZeroSpeech corpus. the loss descrease quickly to 3.0 and stuck at this point even after 160k steps. the quality of the test audio(my own voice) is not good either. Now i am wondering that if the english test wav is good enough because, you know, the more familiar with some kind of languages, the more attention will be paid for the details. I am not familiar with the english language so I am not sure the details is as good as expected. |
@nkcdy, I've trained a model on the mandarin dataset for 60k steps so far. The audio isn't bad but I have noticed some glitches too. It seems like the glitches are happening on plosive and fricative sounds. After listening to a few of the original samples from the dataset you can hear a lot of plosive pops (that sound you hear on P, T and B when the microphone is too close) and the model may have a hard time synthesizing them. Something that might help is increasing As far as generalization goes, I found an utterance from the test set without any plosive pops and I think it sounds okay (see attachment) |
@bshall Here is my results. |
@nkcdy, there definitely is a difference in quality when you move to out of domain speakers, even on the ZeroSpeech corpus. I do think your model may be overfitting a little bit. I get slightly better results after 100k steps on your voice. One thing I tried that does improve the quality a little is to multiply the There are also a number of other things that are worth investigating:
|
@bshall i test another out of domain femail voice. it is still not good. So, it do overfits the training corpus. i plan to add more speakers from different scenario into the corpus and keep on training and will let you know if i find something. |
what's the maximum speakers number during training? the paper use 17 speakers. what will happen if speaker number is larger than 17?
The text was updated successfully, but these errors were encountered: