Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align GPQA zero-shot / few-shot prompts with paper? #70

Open
lewtun opened this issue Feb 27, 2024 · 3 comments
Open

Align GPQA zero-shot / few-shot prompts with paper? #70

lewtun opened this issue Feb 27, 2024 · 3 comments
Assignees
Labels
feature request New feature/request

Comments

@lewtun
Copy link
Member

lewtun commented Feb 27, 2024

GPQA uses a fixed prompt for zero-shot and few-shot evaluation (see Appendix A.3.1 of the paper). For example, this is the format of the zero-shot prompt:

What is the correct answer to this question: {QUESTION}
Choices:
(A) {CHOICE_A}
(B) {CHOICE_B}
(C) {CHOICE_C}
(D) {CHOICE_D}

Format your response as follows: "The correct answer is (insert answer here)".

In particular, note the final instruction to format the answer and that they also mention that they use a regex parser to extract the desired answer:

We extracted answers from the model response using a simple regex matching phrases like ‘answer is’, ‘answer:’ etc.

However, inspecting the details from lighteval I see we have the following for zero-shot:

Select the correct answer to the following questions.

Question: Identify the final product produced when cyclobutyl(cyclopropyl)methanol reacts with phosphoric acid in water.
A. spiro[3.4]oct-5-ene
B. 1,2-dimethylcyclohexa-1,4-diene
C. 1,2,3,4,5,6-hexahydropentalene
D. [1,1'-bi(cyclobutan)]-1-ene
Answer: 

The trouble with this format is that it heavily penalises chat models which will typically produce a long-winded explanation and thus fail to produce the expected format (A,B,C,D) that a base model typically will.

Another thing I noticed is that the paper uses a fixed few-shot CoT prompt (link) which can be adapted to pure few-shot by removing the reasoning steps. However, it seems that lighteval samples fewshot prompts from the dataset and I wonder if it makes sense to align the evaluation in both cases (zeroshot / fewshot) in line with the paper?

Happy to take a stab at this one if you agree!

@clefourrier
Copy link
Member

Cool points! We could completely have 2 versions, one multichoice looking at logprobs (which is cool because very, very fast) and the other following the original implem as closely as possible, therefore being generative if I understood well.

You can add the second one under the original keyword if you want 😃

Regarding the few shot CoT prompt, let's add it to #8 and do it in another PR - we'll notably need to change the format a bit if we want to allow to pass fixed few shot example files for example. Wdyt?

@lewtun
Copy link
Member Author

lewtun commented Feb 27, 2024

Yes, a generative version sounds great! I can start with the vanilla zero-shot and few-shot prompts and we add the CoT ones later as you suggest :)

@clefourrier
Copy link
Member

Sounds good, I'll assign this to you then :)

@clefourrier clefourrier added the feature request New feature/request label Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request
Projects
None yet
Development

No branches or pull requests

2 participants