You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notice that in run.py, for different datasets, the judge models are different. For example, MCQ datasets use 'chatgpt-0125', and Mathvista uses 'gpt-4-turbo'. Why not use the same judge model for fair comparasion?
The text was updated successfully, but these errors were encountered:
It's important to note that the judging model used for each benchmark is aligned with the configurations detailed in the original paper to ensure a fair and consistent comparison. Any changes to the judging model can lead to significantly different results.
We appreciate your understanding on this matter and are open to any further questions or discussions you might have.
I notice that in run.py, for different datasets, the judge models are different. For example, MCQ datasets use 'chatgpt-0125', and Mathvista uses 'gpt-4-turbo'. Why not use the same judge model for fair comparasion?
The text was updated successfully, but these errors were encountered: