-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues when testing wikitext #57
Comments
I'm investigating - it's likely that we missed changing the perplexity evals in our last refactor (before the release), so I'll add a patch for that - thanks a lot for the report! |
Hi @clefourrier , I saw the PR and I tried the new code, however, I'm confused about the test results, for example, for GPT2, the result is 19.1188, which is not match with the result in GPT2 paper (29.41)
and I also tested google/gemma-2b, the results are as follows, it's unreasonable that the perplexity is so bad.
Did you have the corresponding test results? Thanks. |
Hi! |
However, I can confirm the mismatch with gemma2, investigating |
I added more choices to make more explicit what each call is doing. The Helm and Harness tasks use an aggregated version of the text at the document level, with the harness applying a normalisation preprocessing. The lighteval task uses the document as split in the dataset, so looks at the paragraph level rather than the document level, with no preprocessing. I merged the above patch to make sure the evals are running, but I'm OK with taking more time to investigate the results if you feel like there is a need |
Hi @clefourrier Thanks for your reply, I don't have other recommend models because seems that recent models didn't report wikitext results. Could you please help with the Gemma model? The results of gemma on wikitext is unnormal, thank you. |
Thanks a lot for your message! It seems to come from situations when the perplexity is supposed to be computed on something longer than the context length, I'm investigating. |
Hi @clefourrier is the code merged into the main branch? I used the latest version to test google/gemma-2b and gemma-7b on lighteval|wikitext:2|0|0, but still get weird results. For gemma-2b, I word_perplexity is 65.32650153145742 while for gemma-7b, word_perplexity is 359323042.28705126. |
Hi @yangfuwei , Interesting bug on gemma-7b, I'll investigate again to see if I can reproduce, thanks! |
Hi, I'm using the lighteval to test several benchmarks, but I met issues when testing the following 2 benchmarks:
When test wikitext, I got error
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 615, in create_requests_from_tasks reqs = task.construct_requests(doc, ctx, doc_id_seed, cur_task_name) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 380, in construct_requests LoglikelihoodRollingRequest(task_name=current_task_name, doc_id=document_id_seed, ctx=context)#LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) TypeError: LoglikelihoodRollingRequest.__init__() got an unexpected keyword argument 'doc_id'
I modified https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/lighteval_task.py#L380 to
LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context)
ends up with Nan perplexity.When run wikitext:103, I got error like
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in create_requests_from_tasks task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in <listcomp> task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 296, in eval_docs self._docs = self._get_docs_from_split(self.evaluation_split) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 266, in _get_docs_from_split docs.extend(as_list(self.formatter(item, self.name))) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/tasks_prompt_formatting.py", line 2068, in wikitext_103 return Doc(task_name=task_name, query=line["text"]) TypeError: Doc.__init__() missing 2 required positional arguments: 'choices' and 'gold_index'
Commands I used are
python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "lighteval|wikitext|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"
andpython run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "helm|wikitext:103|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"
Any advice on the issues? Thanks.
The text was updated successfully, but these errors were encountered: