why use different judge model? #716

kydxh · 2025-01-09T07:08:25Z

I notice that in run.py, for different datasets, the judge models are different. For example, MCQ datasets use 'chatgpt-0125', and Mathvista uses 'gpt-4-turbo'. Why not use the same judge model for fair comparasion?

PhoenixZ810 · 2025-01-10T07:24:54Z

Hi,

Thank you for your interest and feedback.

It's important to note that the judging model used for each benchmark is aligned with the configurations detailed in the original paper to ensure a fair and consistent comparison. Any changes to the judging model can lead to significantly different results.

We appreciate your understanding on this matter and are open to any further questions or discussions you might have.

PhoenixZ810 self-assigned this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why use different judge model? #716

why use different judge model? #716

kydxh commented Jan 9, 2025 •

edited

Loading

PhoenixZ810 commented Jan 10, 2025

why use different judge model? #716

why use different judge model? #716

Comments

kydxh commented Jan 9, 2025 • edited Loading

PhoenixZ810 commented Jan 10, 2025

kydxh commented Jan 9, 2025 •

edited

Loading