yourTTS streaming? #1764
-
Hello. I've made an application that essentially streams audio from an input in chunks into modified versions of the transfer_voice and tts functions from the coqui-ai TTS repository files using the yourTTS model. However at the area where the chunks connect, they don't continue cleanly (after conversion), I guess because each chunk doesn't have the data from the previous one to continue the audio smoothly. Is there a way to solve this? I don't actually know much about AI or the tensor libraries themselves, I just did this using modified versions of the existing functions in the coquiTTS utils files. I did try saving the audio of the previous chunk, prepending it to the current audio and pushing it through the tts functions and then cutting the output in half, but unless I did it wrong, it seems that method, while it does sound better, still doesn't sound completely smooth either, so unless I did it wrong, I'm guessing thats not the correct solution either. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 17 replies
-
In case anyone is wondering, I actually ended up getting it to sound pretty decent by expanding my above idea and sending 3 chunks each time and extracting just the output for the middle chunk. To generate chunk 2's audio you send chunk 1, 2, and 3. I don't know if maybe there is an actual function somewhere that allows the generation of chunk output but continuing from the previous chunk smoothly, but the alternate method I mentioned above sounds relatively smooth although does still have a few clicks and whatnot here and there, but that could be due to other reasons. Anyway, I'm going to mark this as the answer for now. |
Beta Was this translation helpful? Give feedback.
-
Has anyone found that function for streaming CoquiTTS speech output which @Disastorm asked about? |
Beta Was this translation helpful? Give feedback.
-
Do we have a stable streaming? |
Beta Was this translation helpful? Give feedback.
In case anyone is wondering, I actually ended up getting it to sound pretty decent by expanding my above idea and sending 3 chunks each time and extracting just the output for the middle chunk.
To generate chunk 2's audio you send chunk 1, 2, and 3.
To generate chunk 3's audio you send chunk 2, 3, and 4.
Although of course this means the output gets delayed by an additional chunk length.
I don't know if maybe there is an actual function somewhere that allows the generation of chunk output but continuing from the previous chunk smoothly, but the alternate method I mentioned above sounds relatively smooth although does still have a few clicks and whatnot here and there, but that could be du…