-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduction Fail on Llama 3 #3
Comments
Hi! Thank you so much for trying out our codebase and for your interest in our work! :) The metrics you shared vary from our reported results by 1–3 points, with no clear pattern (i.e., neither all metrics increase nor all decrease). This discrepancy is likely due to the inherent variance in the model's outputs caused by the non-deterministic nature of the generation process. Specifically, with the default temperature setting of 0.5, the model introduces some randomness, which can lead to slightly different results across runs. Even in our internal testing, we observed that the model’s responses can vary from run to run, resulting in a variance of approximately 1–3 points in evaluation scores. This behavior is entirely normal and expected. We hope this clarifies! |
Hi there, apologies for the delayed response! I’d like to clarify a few points:
Even without this information, differences in results could stem from the specific checkpoint used for testing. I assume you followed the implementation precisely and evaluated the output after two epochs. However, we trained for a maximum of two epochs and then selected the best checkpoint based on training curves. To reproduce our results, you can use the models we have open-sourced here: |
Hello, thank you for the tremendous contribution your team has made to trustworthy inference for large models. I encountered some issues while reproducing your experiments and would appreciate your help in resolving them.
When using the eval method you mentioned to test the metrics, we found discrepancies between our results and yours. Specifically, when testing the Llama-3-8b-Instruct model trained with DPO on the
asqa_eval_top100_calibrated.json
, the metrics did not match those reported in the paper.The results I obtained are as follows:
My testing environment is: Ubuntu 20.04, Torch 2.5.0, vllm 0.6.5, with 4 A100 80G GPUs. The DPO training parameters are consistent with those in your code (only the dataset and model load paths were modified). Could you help explain the possible cause of this discrepancy?
The text was updated successfully, but these errors were encountered: