Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hallucinations in results #186

Open
travisapple opened this issue Jan 12, 2025 · 2 comments
Open

Hallucinations in results #186

travisapple opened this issue Jan 12, 2025 · 2 comments

Comments

@travisapple
Copy link

First off, I love this project. THANK YOU for your time here.

I'm seeing some instances of hallucinations (if that's even the right word for it here), even on simple text like "hi there" on the large data set (rev 9). It gives me 20 seconds of nightmare fuel sound. Slow whispered repeating words in low quality. If I run the same string in the mini language set it works, though the pronunciation could be better.

I have a large amount of text I want to break up into small bits and generate speech, to then later stitch back together. My plan is to try each bit with the large data set, then if it fails I will re-generate with the mini data set.

My question is how I can programmatically detect a failure? How can I test for hallucinations?

@SaKiEQ
Copy link

SaKiEQ commented Jan 23, 2025

In my limited experience so far, gabled audio that in unintelligible relates to padding scheme and max_new_tokens too low.
If you increase the max_new_tokens in you args config, then the generation works better, but slower.

@travisapple
Copy link
Author

travisapple commented Jan 23, 2025

Tokenizer:

prompt = tokenizer(renderString.strip(), return_tensors="pt", padding=True).to(device)

My generate() looks like this now:


	generation = model.generate(
		input_ids=inputs.input_ids,
		attention_mask=inputs.attention_mask,
		prompt_input_ids=prompt.input_ids,
		min_new_tokens=10,
		max_new_tokens=2580,
		pad_token_id=1024,
		do_sample=True,
		temperature=0.8, # 1.0 more diverse, 0.0 more the same - smaller values take longer to gen
	)

I've added max_new_tokens but I have no clue what max_new_tokens should be. I took this number from an example I found but really who knows.

I manually set the max length of the prompt text input by the length of prompt.input_ids[0]. 35 seems to be about where this thing starts to die.

After that, I do some audio tests on the output to see if it died but it seems like there should be a better way to tell how confident we are in the output.

Should padding be set to something other than True?

Also, sometimes the last few words are gone. I'm assuming this has something to do with the truncate setting (???) but I don't know how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants