Creating consistent hybrid voices #60
wavymulder
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Using Tortoise to create hybrid voices yields some cool results, but I've found also introduces some "inconsistency". I'm not talking about the typical accent or tone variation, but entirely different voices between lines; even switching to a woman's voice despite having only male voices in the pool. In an effort to create more consistent hybrids, I have been experimenting with sampling successfully combined voices then feeding them back into Tortoise as new, separate voices. This creates a much more consistent version of the hybrid voice, meaning higher success rate and less rerunning the program to stitch together good results.
Picking good samples for your new voice is incredibly important, more so than with typical voices. Ideally, you want to keep running traditional hybrids until you have a collection of samples that sound truly, eerily human. For this stage of the process I use the high_quality preset, as to me the difference (albeit slight) is important to reduce robotic sounds. Typical artifacts like Tortoise drawing out a word like "I", or particularly robotic sounding vowels, will be heavily present in your new voice if you don't cull them. I have also been attempting to remove reverb from the clips with varying success, as reverb seems to be one of the most noticeable feedback artifacts (not real reverb, of course, but I can't think of anything better to call it).
With this method I crafted two voices. The first, Terry, is a combination of my own voice, train_lescault, and the voice of John Lithgow (whose voice works exceptionally well in Tortoise; he's either well represented in the data set or he's got the perfect voice). The second voice I created, Layton, is a combination of G. Wilson, C. Allen, and H. Leyva, three audiobook narrators. I'm quite pleased with how both these voices turned out and have attached them below.
Once I had two voices made in this method, naturally I had to combine them. It seemed that it took longer to find good samples for this second generation voice than it did for the first two, I assume because of some sort of "feedback". In my testing, this third voice, which I've named Jordan (in an attempt to keep with the desert theme... wrong desert I know but best I could do lol), has mostly pleased me. When directly compared to a traditional hybrid (voice1&voice2&etc) of the six component voices, there's really no competition. Even if we ignore the traditional hybrid switching to a woman's voice, it still jumps around nearly every line in pitch and "weight" (i.e., more like voice1, then more like voice3, then more like voice2). I'm not very well versed in programming and ML, but the culprit behind this might be the same as behind random voice latents not consistently producing the same voice, which I've seen mentioned before.
Mostly unrelated finding: in an attempt to reduce reverb, I took the voice samples I was using for Jordan and EQ'd them to be more neutral. For some reason, this turned the voice consistently British. This really surprised me, as it's the exact same sample clips but with EQ applied. I wonder what sort of dataset magic caused this.
Jordan really wanted to have a super Southern accent for some reason. It took several attempts at culling extremely drawn-out vowels to get it where it is now (still Southern).
On accents: in a way, Tortoise has recreated the classic Trans-Atlantic accent, where Americans think it sounds British and Brits think it sounds American.
Attachments:
My three custom voices: "jordan.zip" , "terry.zip" , and "layton.zip":
jordan.zip
layton.zip
terry.zip
Comparisons between my method of hybrid voices and the traditional method of using &: "jordancompare.zip":
jordancompare.zip
Extra samples, included because I thought they were interesting:
extras.zip
In extras.zip:
In the future, I'm going to mess with making hybrids via raw conditioning latents rather than sample clips, as it may produce a different result.
Thanks neonbjb for making and sharing such a cool program for us all to mess around with.
Beta Was this translation helpful? Give feedback.
All reactions