yourTTS streaming? #1764

Disastorm · 2022-07-21T03:42:57Z

Disastorm
Jul 21, 2022

Hello.

I've made an application that essentially streams audio from an input in chunks into modified versions of the transfer_voice and tts functions from the coqui-ai TTS repository files using the yourTTS model.

However at the area where the chunks connect, they don't continue cleanly (after conversion), I guess because each chunk doesn't have the data from the previous one to continue the audio smoothly. Is there a way to solve this?

I don't actually know much about AI or the tensor libraries themselves, I just did this using modified versions of the existing functions in the coquiTTS utils files.

I did try saving the audio of the previous chunk, prepending it to the current audio and pushing it through the tts functions and then cutting the output in half, but unless I did it wrong, it seems that method, while it does sound better, still doesn't sound completely smooth either, so unless I did it wrong, I'm guessing thats not the correct solution either.

Answered by Disastorm

Jul 22, 2022

In case anyone is wondering, I actually ended up getting it to sound pretty decent by expanding my above idea and sending 3 chunks each time and extracting just the output for the middle chunk.

To generate chunk 2's audio you send chunk 1, 2, and 3.
To generate chunk 3's audio you send chunk 2, 3, and 4.
Although of course this means the output gets delayed by an additional chunk length.

I don't know if maybe there is an actual function somewhere that allows the generation of chunk output but continuing from the previous chunk smoothly, but the alternate method I mentioned above sounds relatively smooth although does still have a few clicks and whatnot here and there, but that could be du…

View full answer

Disastorm · 2022-07-22T02:55:55Z

Disastorm
Jul 22, 2022
Author

In case anyone is wondering, I actually ended up getting it to sound pretty decent by expanding my above idea and sending 3 chunks each time and extracting just the output for the middle chunk.

To generate chunk 2's audio you send chunk 1, 2, and 3.
To generate chunk 3's audio you send chunk 2, 3, and 4.
Although of course this means the output gets delayed by an additional chunk length.

I don't know if maybe there is an actual function somewhere that allows the generation of chunk output but continuing from the previous chunk smoothly, but the alternate method I mentioned above sounds relatively smooth although does still have a few clicks and whatnot here and there, but that could be due to other reasons. Anyway, I'm going to mark this as the answer for now.

14 replies

Daniel-Kelvich Jun 30, 2023

Hey @Disastorm! Have you be any chance released the code somewhere? I would like to take a look at your implementation.

Disastorm Jul 5, 2023
Author

Sorry I havn't, and my code was pretty sloppy and confusing, too. I think there are alot of solutions for realtime ai voice cloning these days like the RVC projects, you might be able to check some of those.

catyung Sep 12, 2023

Hi @Disastorm , thanks for your advice. I am trying to replicate your idea , in order to create low-latency TTS approach, however, I am wondering in your suggested approach, how can you locate and extract the middle chunk ? Since there is no timestamp being provided from the log.

Appreciate your help in advance. ^^

Cat

Disastorm Sep 12, 2023
Author

@catyung I feel like the RVC stuff sounds better than yourTTS these days, but if you really still want to try this out, what I was doing is each chunk is a specific size of bytes, so you just use that to calculate it, something like
outputs = outputs[outputSize:(outputSize*2)]

catyung Sep 13, 2023

@Disastorm Thanks ! Will try both suggestions :) ! :)

agilebean · 2023-11-11T15:07:38Z

agilebean
Nov 11, 2023

Has anyone found that function for streaming CoquiTTS speech output which @Disastorm asked about?

3 replies

erogol Nov 13, 2023
Maintainer

this is the only streaming model https://tts.readthedocs.io/en/latest/models/xtts.html#streaming-inference

agilebean Nov 14, 2023

@erogol that's the only thing I found too - thanks for the confirmation

Disastorm Jan 14, 2024
Author

I actually wasn't using a streaming model. Basically I was streaming the audio from the microphone via standard python libraries, dividing it into chunks and then processing it with yourTTS, and then merging the outputs. I guess maybe you can say it was a custom streaming function? I made a quick diagram for it just now ( using 1s chunks in the example image below ):

cod3r0k · 2025-01-08T14:36:05Z

cod3r0k
Jan 8, 2025

Do we have a stable streaming?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yourTTS streaming? #1764

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 17 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

yourTTS streaming? #1764

Disastorm Jul 21, 2022

Replies: 3 comments · 17 replies

Disastorm Jul 22, 2022 Author

Daniel-Kelvich Jun 30, 2023

Disastorm Jul 5, 2023 Author

catyung Sep 12, 2023

Disastorm Sep 12, 2023 Author

catyung Sep 13, 2023

agilebean Nov 11, 2023

erogol Nov 13, 2023 Maintainer

agilebean Nov 14, 2023

Disastorm Jan 14, 2024 Author

cod3r0k Jan 8, 2025

Disastorm
Jul 21, 2022

Replies: 3 comments 17 replies

Disastorm
Jul 22, 2022
Author

Disastorm Jul 5, 2023
Author

Disastorm Sep 12, 2023
Author

agilebean
Nov 11, 2023

erogol Nov 13, 2023
Maintainer

Disastorm Jan 14, 2024
Author

cod3r0k
Jan 8, 2025