issues when testing wikitext #57

yangfuwei · 2024-02-26T16:36:05Z

Hi, I'm using the lighteval to test several benchmarks, but I met issues when testing the following 2 benchmarks:

lighteval|wikitext|0|0
helm|wikitext:103|0|0

When test wikitext, I got error
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 615, in create_requests_from_tasks reqs = task.construct_requests(doc, ctx, doc_id_seed, cur_task_name) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 380, in construct_requests LoglikelihoodRollingRequest(task_name=current_task_name, doc_id=document_id_seed, ctx=context)#LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) TypeError: LoglikelihoodRollingRequest.__init__() got an unexpected keyword argument 'doc_id'

I modified https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/lighteval_task.py#L380 to LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) ends up with Nan perplexity.

When run wikitext:103, I got error like
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in create_requests_from_tasks task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in <listcomp> task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 296, in eval_docs self._docs = self._get_docs_from_split(self.evaluation_split) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 266, in _get_docs_from_split docs.extend(as_list(self.formatter(item, self.name))) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/tasks_prompt_formatting.py", line 2068, in wikitext_103 return Doc(task_name=task_name, query=line["text"]) TypeError: Doc.__init__() missing 2 required positional arguments: 'choices' and 'gold_index'

Commands I used are
python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "lighteval|wikitext|0|0" --override_batch_size 1 --save_details --output_dir="tmp/" and python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "helm|wikitext:103|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"

Any advice on the issues? Thanks.

The text was updated successfully, but these errors were encountered:

clefourrier · 2024-02-26T19:23:24Z

I'm investigating - it's likely that we missed changing the perplexity evals in our last refactor (before the release), so I'll add a patch for that - thanks a lot for the report!

yangfuwei · 2024-02-28T06:45:04Z

Hi @clefourrier , I saw the PR and I tried the new code, however, I'm confused about the test results, for example, for GPT2, the result is 19.1188, which is not match with the result in GPT2 paper (29.41)

Task	Version	Metric	Value		Stderr
lighteval:wikitext:0	0	word_perplexity	19.1188	±	0.2441
		byte_perplexity	1.7364	±	0.0034
		bits_per_byte	0.7961	±	0.0028

and I also tested google/gemma-2b, the results are as follows, it's unreasonable that the perplexity is so bad.

Task	Version	Metric	Value		Stderr
lighteval:wikitext:0	0	word_perplexity	574.7692	±	17.9129
		byte_perplexity	3.2812	±	0.0145
		bits_per_byte	1.7142	±	0.0064

Did you have the corresponding test results? Thanks.

clefourrier · 2024-02-28T15:01:15Z

Hi!
It's very hard to reproduce this paper, as they don't describe precisely which subset of Wikitext they use (the raw form? the filtered form?) nor do they say (unless I missed it) if the perplexity they report is per word/byte or just another sort of average - do you have another model with more information that I could try/test this with?

clefourrier · 2024-02-28T15:59:12Z

However, I can confirm the mismatch with gemma2, investigating

clefourrier · 2024-02-28T16:46:53Z

I added more choices to make more explicit what each call is doing.

The Helm and Harness tasks use an aggregated version of the text at the document level, with the harness applying a normalisation preprocessing. The lighteval task uses the document as split in the dataset, so looks at the paragraph level rather than the document level, with no preprocessing.

I merged the above patch to make sure the evals are running, but I'm OK with taking more time to investigate the results if you feel like there is a need

yangfuwei · 2024-03-01T03:47:59Z

Hi @clefourrier Thanks for your reply, I don't have other recommend models because seems that recent models didn't report wikitext results. Could you please help with the Gemma model? The results of gemma on wikitext is unnormal, thank you.

clefourrier · 2024-03-01T12:49:39Z

Thanks a lot for your message! It seems to come from situations when the perplexity is supposed to be computed on something longer than the context length, I'm investigating.

yangfuwei · 2024-03-12T17:07:32Z

Hi @clefourrier is the code merged into the main branch? I used the latest version to test google/gemma-2b and gemma-7b on lighteval|wikitext:2|0|0, but still get weird results. For gemma-2b, I word_perplexity is 65.32650153145742 while for gemma-7b, word_perplexity is 359323042.28705126.

clefourrier · 2024-03-14T06:54:16Z

Hi @yangfuwei ,
Yes, it was.

Interesting bug on gemma-7b, I'll investigate again to see if I can reproduce, thanks!

clefourrier self-assigned this Feb 26, 2024

clefourrier added the bug Something isn't working label Feb 27, 2024

clefourrier mentioned this issue Feb 27, 2024

Fixes wikitext prompts + some patches on tg models #64

Merged

clefourrier mentioned this issue Mar 1, 2024

Fixing rolling loglikelihood management #78

Merged

clefourrier closed this as completed in #78 Mar 8, 2024

clefourrier reopened this Mar 14, 2024

clefourrier removed their assignment Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues when testing wikitext #57

issues when testing wikitext #57

yangfuwei commented Feb 26, 2024

clefourrier commented Feb 26, 2024

yangfuwei commented Feb 28, 2024

clefourrier commented Feb 28, 2024 •

edited

Loading

clefourrier commented Feb 28, 2024

clefourrier commented Feb 28, 2024 •

edited

Loading

yangfuwei commented Mar 1, 2024

clefourrier commented Mar 1, 2024

yangfuwei commented Mar 12, 2024

clefourrier commented Mar 14, 2024

issues when testing wikitext #57

issues when testing wikitext #57

Comments

yangfuwei commented Feb 26, 2024

clefourrier commented Feb 26, 2024

yangfuwei commented Feb 28, 2024

clefourrier commented Feb 28, 2024 • edited Loading

clefourrier commented Feb 28, 2024

clefourrier commented Feb 28, 2024 • edited Loading

yangfuwei commented Mar 1, 2024

clefourrier commented Mar 1, 2024

yangfuwei commented Mar 12, 2024

clefourrier commented Mar 14, 2024

clefourrier commented Feb 28, 2024 •

edited

Loading

clefourrier commented Feb 28, 2024 •

edited

Loading