You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPQA uses a fixed prompt for zero-shot and few-shot evaluation (see Appendix A.3.1 of the paper). For example, this is the format of the zero-shot prompt:
What is the correct answer to this question: {QUESTION}
Choices:
(A) {CHOICE_A}
(B) {CHOICE_B}
(C) {CHOICE_C}
(D) {CHOICE_D}
Format your response as follows: "The correct answer is (insert answer here)".
In particular, note the final instruction to format the answer and that they also mention that they use a regex parser to extract the desired answer:
We extracted answers from the model response using a simple regex matching phrases like ‘answer is’, ‘answer:’ etc.
However, inspecting the details from lighteval I see we have the following for zero-shot:
Select the correct answer to the following questions.
Question: Identify the final product produced when cyclobutyl(cyclopropyl)methanol reacts with phosphoric acid in water.
A. spiro[3.4]oct-5-ene
B. 1,2-dimethylcyclohexa-1,4-diene
C. 1,2,3,4,5,6-hexahydropentalene
D. [1,1'-bi(cyclobutan)]-1-ene
Answer:
The trouble with this format is that it heavily penalises chat models which will typically produce a long-winded explanation and thus fail to produce the expected format (A,B,C,D) that a base model typically will.
Another thing I noticed is that the paper uses a fixed few-shot CoT prompt (link) which can be adapted to pure few-shot by removing the reasoning steps. However, it seems that lighteval samples fewshot prompts from the dataset and I wonder if it makes sense to align the evaluation in both cases (zeroshot / fewshot) in line with the paper?
Happy to take a stab at this one if you agree!
The text was updated successfully, but these errors were encountered:
Cool points! We could completely have 2 versions, one multichoice looking at logprobs (which is cool because very, very fast) and the other following the original implem as closely as possible, therefore being generative if I understood well.
You can add the second one under the original keyword if you want 😃
Regarding the few shot CoT prompt, let's add it to #8 and do it in another PR - we'll notably need to change the format a bit if we want to allow to pass fixed few shot example files for example. Wdyt?
GPQA uses a fixed prompt for zero-shot and few-shot evaluation (see Appendix A.3.1 of the paper). For example, this is the format of the zero-shot prompt:
In particular, note the final instruction to format the answer and that they also mention that they use a regex parser to extract the desired answer:
However, inspecting the details from
lighteval
I see we have the following for zero-shot:The trouble with this format is that it heavily penalises chat models which will typically produce a long-winded explanation and thus fail to produce the expected format (A,B,C,D) that a base model typically will.
Another thing I noticed is that the paper uses a fixed few-shot CoT prompt (link) which can be adapted to pure few-shot by removing the reasoning steps. However, it seems that
lighteval
samples fewshot prompts from the dataset and I wonder if it makes sense to align the evaluation in both cases (zeroshot / fewshot) in line with the paper?Happy to take a stab at this one if you agree!
The text was updated successfully, but these errors were encountered: