What's the capacity of this network? #9

nkcdy · 2019-08-16T15:10:43Z

what's the maximum speakers number during training? the paper use 17 speakers. what will happen if speaker number is larger than 17?

bshall · 2019-08-16T20:13:51Z

Hi @nkcdy, the pretrained model I uploaded was trained on 102 speakers and it seems to work really well. It seems that the model learns a very general way to invert the melspectrograms. If you check out the samples here you'll see that the model generalizes well to out of domain speakers and I would expect it to improve by training on more speakers.

nkcdy · 2019-08-17T02:48:17Z

@bshall in fact, I have trained a network with 400 mandarin speakers but failed. the quality is very poor after 350k steps using default hparams. then I retrain the network with another mandarin corpus which has total 20 speakers. The loss stay at 2.7 after 120k steps. its good for the audio within the corpus. but for outside audio (such as my own voice)the quality is poor. now I abort the scheduler and retrain the network to see what will happen.

nkcdy · 2019-08-18T09:15:23Z

@bshall by the way, what's the final loss of your pretrained model?

bshall · 2019-08-18T12:21:19Z

@nkcdy, that's very interesting. Thanks for sharing your findings. I did tune the hyperparameters on the ZeroSpeech dataset so it is a good idea to try as different schedule. I'm very interested to know if you get it working. I might try to train on Librispeech this week as well. If I get anything interesting I'll let you know.

I can't remember the exact loss right now. I'll check tomorrow morning. I did notice that the final loss is very dataset specific so it may not be all that helpful.

nkcdy · 2019-08-19T03:00:12Z

@bshall As expected, the loss never go below 2.66 even after 350k steps. the quality of the generated audio is not good enough, either. some audios sound good, but some audios have a lot of glitch in them.

bshall · 2019-08-19T08:12:02Z

@nkcdy, that's unfortunate. Is the mandarin dataset you are using open? Maybe I could take a look to see if I can find any issues.

nkcdy · 2019-08-19T08:26:12Z

@bshall Yes, it is free. http://www.openslr.org/resources/18/data_thchs30.tgz. There are total speakers including 31 female voices and 9 male voices. I add some initialization on GRU cell and try again...

nkcdy · 2019-08-20T06:45:46Z

@bshall why only a small slice frame is picked as the Mel spectrogram condition? what will happen if a piece of silent voice is selected?

bshall · 2019-08-20T09:30:48Z

@nkcdy, I've downloaded the mandarin dataset and will play around with it today and let you know.

I chose a slice of the output of the conditioning network to speed up the training time. If a silent section is selected then the network should learn to generate silence for that section of the spectrogram. I think it is most likely that the preprocessing step isn't tuned well for your dataset but I'll investigate that now.

nkcdy · 2019-08-20T23:03:25Z

@bshall I found that the loss maybe isn't a big deal for this network. i retrained it with ZeroSpeech corpus. the loss descrease quickly to 3.0 and stuck at this point even after 160k steps. the quality of the test audio(my own voice) is not good either. Now i am wondering that if the english test wav is good enough because, you know, the more familiar with some kind of languages, the more attention will be paid for the details. I am not familiar with the english language so I am not sure the details is as good as expected.

bshall · 2019-08-21T15:28:33Z

@nkcdy, I've trained a model on the mandarin dataset for 60k steps so far. The audio isn't bad but I have noticed some glitches too. It seems like the glitches are happening on plosive and fricative sounds. After listening to a few of the original samples from the dataset you can hear a lot of plosive pops (that sound you hear on P, T and B when the microphone is too close) and the model may have a hard time synthesizing them. Something that might help is increasing audio_slice_frames but it'll make training slower.

As far as generalization goes, I found an utterance from the test set without any plosive pops and I think it sounds okay (see attachment)
I'll do some more testing when the model is fully trained.
gen_D4_751_model_steps_60000.wav.zip.

nkcdy · 2019-08-22T02:48:27Z

@bshall Here is my results.
UniversalVocoding.zip
there are five test wavs picked from the training corpus. The file named "cdy.wav" and "cdy_long.wav" are my orignal voices
and corresponding generated wavs are also included. what I concerned was the generalization capability of this structure. I dont think the generated my own voice is as good as the voice seen by the network during training. Maybe the number of the speakers is the main reason as the paper said in section 4.2.1.

bshall · 2019-08-22T09:27:02Z

@nkcdy, there definitely is a difference in quality when you move to out of domain speakers, even on the ZeroSpeech corpus. I do think your model may be overfitting a little bit. I get slightly better results after 100k steps on your voice.
samples.zip

One thing I tried that does improve the quality a little is to multiply the logits on this line by 1.5 (those are the clips with "_x1.5" at the end). This sharpens the distribution and dampens the noise a bit. If you use too large a factor though it doesn't work and it may cause problems on silences.

There are also a number of other things that are worth investigating:

The mandarin dataset has fewer male than female speakers, does this affect generalization to males vs females?
All the audio in the dataset is presumably recorded on the same microphone. Does this affect generalization when the audio is recorded on a different set up.
The paper uses 24kHz audio and a 10 bit predictions. I would guess that this would improve the quality as well.

nkcdy · 2019-08-23T02:35:47Z

@bshall i test another out of domain femail voice. it is still not good.
OutOfDomainFemaleVoice.zip

So, it do overfits the training corpus. i plan to add more speakers from different scenario into the corpus and keep on training and will let you know if i find something.

bshall closed this as completed Aug 22, 2019

bshall reopened this Aug 22, 2019

Approximetal mentioned this issue May 8, 2020

Result remains little noise, but loss does not decrease #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the capacity of this network? #9

What's the capacity of this network? #9

nkcdy commented Aug 16, 2019

bshall commented Aug 16, 2019

nkcdy commented Aug 17, 2019

nkcdy commented Aug 18, 2019

bshall commented Aug 18, 2019

nkcdy commented Aug 19, 2019

bshall commented Aug 19, 2019

nkcdy commented Aug 19, 2019

nkcdy commented Aug 20, 2019

bshall commented Aug 20, 2019

nkcdy commented Aug 20, 2019

bshall commented Aug 21, 2019 •

edited

Loading

nkcdy commented Aug 22, 2019

bshall commented Aug 22, 2019 •

edited

Loading

nkcdy commented Aug 23, 2019

What's the capacity of this network? #9

What's the capacity of this network? #9

Comments

nkcdy commented Aug 16, 2019

bshall commented Aug 16, 2019

nkcdy commented Aug 17, 2019

nkcdy commented Aug 18, 2019

bshall commented Aug 18, 2019

nkcdy commented Aug 19, 2019

bshall commented Aug 19, 2019

nkcdy commented Aug 19, 2019

nkcdy commented Aug 20, 2019

bshall commented Aug 20, 2019

nkcdy commented Aug 20, 2019

bshall commented Aug 21, 2019 • edited Loading

nkcdy commented Aug 22, 2019

bshall commented Aug 22, 2019 • edited Loading

nkcdy commented Aug 23, 2019

bshall commented Aug 21, 2019 •

edited

Loading

bshall commented Aug 22, 2019 •

edited

Loading