Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audios cannot be longer of 12 seconds #4

Open
jordimas opened this issue Mar 5, 2021 · 1 comment
Open

Audios cannot be longer of 12 seconds #4

jordimas opened this issue Mar 5, 2021 · 1 comment

Comments

@jordimas
Copy link

jordimas commented Mar 5, 2021

It seems that the generated audios cannot be longer of 12 seconds. You can try for example the text "VilaWeb fou el primer mitjà digital català en incorporar una plataforma de blogs personals fàcilment gestionable pels mateixos usuaris, el 2004 oferí als lectors i col·laboradors la possibilitat de crear els seus propis blogs, que aconseguiren cert protagonisme i activitat els anys següents."

I see a warning: "Warning! Reached max decoder steps" I do not know if this is related

@jordimas jordimas changed the title AAudios cannot be longer of 12 seconds Audios cannot be longer of 12 seconds Mar 5, 2021
@gullabi
Copy link

gullabi commented Mar 7, 2021

This is a general problem with the architecture of neural TTS. The length of the synthesized audio is determined at the training phase. And since the model is trained with 12 seconds segments, it can only synthesize 12 seconds. The reason is the memory restrictions mostly during the training, since everything is calculated in the memory; although the limit can be increased with training the models with higher memory GPUs, it will be marginal and will never reach audiobook lengths.

There are currently better alternatives for the architecture, which are able to synthesize longer text/audio with better performance

Having said that, even these architectures do not solve the problem of very long synthesis and can reach up to minutes of audio length. For a more thorough discussion of how to handle this architectural variety and evolution, see the future of the repo issue.

But for now the solution would be to use a text parser and synthesize the audio sequentially in chunks, as it is done in with the mycroft catotron plugin. And in fact, one positive outcome of this would be the possibility of parallelization which would address the other problem of latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants