Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues when testing wikitext #57

Open
yangfuwei opened this issue Feb 26, 2024 · 9 comments · Fixed by #78
Open

issues when testing wikitext #57

yangfuwei opened this issue Feb 26, 2024 · 9 comments · Fixed by #78
Labels
bug Something isn't working

Comments

@yangfuwei
Copy link

Hi, I'm using the lighteval to test several benchmarks, but I met issues when testing the following 2 benchmarks:

  1. lighteval|wikitext|0|0
  2. helm|wikitext:103|0|0

When test wikitext, I got error
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 615, in create_requests_from_tasks reqs = task.construct_requests(doc, ctx, doc_id_seed, cur_task_name) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 380, in construct_requests LoglikelihoodRollingRequest(task_name=current_task_name, doc_id=document_id_seed, ctx=context)#LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) TypeError: LoglikelihoodRollingRequest.__init__() got an unexpected keyword argument 'doc_id'

I modified https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/lighteval_task.py#L380 to LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) ends up with Nan perplexity.

When run wikitext:103, I got error like
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in create_requests_from_tasks task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in <listcomp> task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 296, in eval_docs self._docs = self._get_docs_from_split(self.evaluation_split) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 266, in _get_docs_from_split docs.extend(as_list(self.formatter(item, self.name))) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/tasks_prompt_formatting.py", line 2068, in wikitext_103 return Doc(task_name=task_name, query=line["text"]) TypeError: Doc.__init__() missing 2 required positional arguments: 'choices' and 'gold_index'

Commands I used are
python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "lighteval|wikitext|0|0" --override_batch_size 1 --save_details --output_dir="tmp/" and python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "helm|wikitext:103|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"

Any advice on the issues? Thanks.

@clefourrier
Copy link
Member

I'm investigating - it's likely that we missed changing the perplexity evals in our last refactor (before the release), so I'll add a patch for that - thanks a lot for the report!

@clefourrier clefourrier self-assigned this Feb 26, 2024
@clefourrier clefourrier added the bug Something isn't working label Feb 27, 2024
@yangfuwei
Copy link
Author

Hi @clefourrier , I saw the PR and I tried the new code, however, I'm confused about the test results, for example, for GPT2, the result is 19.1188, which is not match with the result in GPT2 paper (29.41)

Task Version Metric Value Stderr
lighteval:wikitext:0 0 word_perplexity 19.1188 ± 0.2441
byte_perplexity 1.7364 ± 0.0034
bits_per_byte 0.7961 ± 0.0028

and I also tested google/gemma-2b, the results are as follows, it's unreasonable that the perplexity is so bad.

Task Version Metric Value Stderr
lighteval:wikitext:0 0 word_perplexity 574.7692 ± 17.9129
byte_perplexity 3.2812 ± 0.0145
bits_per_byte 1.7142 ± 0.0064

Did you have the corresponding test results? Thanks.

@clefourrier
Copy link
Member

clefourrier commented Feb 28, 2024

Hi!
It's very hard to reproduce this paper, as they don't describe precisely which subset of Wikitext they use (the raw form? the filtered form?) nor do they say (unless I missed it) if the perplexity they report is per word/byte or just another sort of average - do you have another model with more information that I could try/test this with?

@clefourrier
Copy link
Member

However, I can confirm the mismatch with gemma2, investigating

@clefourrier
Copy link
Member

clefourrier commented Feb 28, 2024

I added more choices to make more explicit what each call is doing.

The Helm and Harness tasks use an aggregated version of the text at the document level, with the harness applying a normalisation preprocessing. The lighteval task uses the document as split in the dataset, so looks at the paragraph level rather than the document level, with no preprocessing.

I merged the above patch to make sure the evals are running, but I'm OK with taking more time to investigate the results if you feel like there is a need

@yangfuwei
Copy link
Author

Hi @clefourrier Thanks for your reply, I don't have other recommend models because seems that recent models didn't report wikitext results. Could you please help with the Gemma model? The results of gemma on wikitext is unnormal, thank you.

@clefourrier
Copy link
Member

Thanks a lot for your message! It seems to come from situations when the perplexity is supposed to be computed on something longer than the context length, I'm investigating.

@yangfuwei
Copy link
Author

Hi @clefourrier is the code merged into the main branch? I used the latest version to test google/gemma-2b and gemma-7b on lighteval|wikitext:2|0|0, but still get weird results. For gemma-2b, I word_perplexity is 65.32650153145742 while for gemma-7b, word_perplexity is 359323042.28705126.

@clefourrier
Copy link
Member

Hi @yangfuwei ,
Yes, it was.

Interesting bug on gemma-7b, I'll investigate again to see if I can reproduce, thanks!

@clefourrier clefourrier reopened this Mar 14, 2024
@clefourrier clefourrier removed their assignment Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants