Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction Fail on Llama 3 #3

Open
jeremery opened this issue Jan 23, 2025 · 3 comments
Open

Reproduction Fail on Llama 3 #3

jeremery opened this issue Jan 23, 2025 · 3 comments

Comments

@jeremery
Copy link

Hello, thank you for the tremendous contribution your team has made to trustworthy inference for large models. I encountered some issues while reproducing your experiments and would appreciate your help in resolving them.

When using the eval method you mentioned to test the metrics, we found discrepancies between our results and yours. Specifically, when testing the Llama-3-8b-Instruct model trained with DPO on the asqa_eval_top100_calibrated.json, the metrics did not match those reported in the paper.

The results I obtained are as follows:

Image

My testing environment is: Ubuntu 20.04, Torch 2.5.0, vllm 0.6.5, with 4 A100 80G GPUs. The DPO training parameters are consistent with those in your code (only the dataset and model load paths were modified). Could you help explain the possible cause of this discrepancy?

@shanghongsim
Copy link
Collaborator

Hi! Thank you so much for trying out our codebase and for your interest in our work! :)

The metrics you shared vary from our reported results by 1–3 points, with no clear pattern (i.e., neither all metrics increase nor all decrease). This discrepancy is likely due to the inherent variance in the model's outputs caused by the non-deterministic nature of the generation process. Specifically, with the default temperature setting of 0.5, the model introduces some randomness, which can lead to slightly different results across runs. Even in our internal testing, we observed that the model’s responses can vary from run to run, resulting in a variance of approximately 1–3 points in evaluation scores. This behavior is entirely normal and expected. We hope this clarifies!

@jeremery
Copy link
Author

Thank you very much for taking the time to reply. Regarding the DPO training method mentioned in the paper, I have conducted corresponding experiments on Llama2-7b chat hf, Llama2-13b chat hf, and Llama3.1-8b, and the experimental results are as follows:

Image

These data were all tested on the asqa-eval_top100_calibrated.json dataset, and the experimental parameters were consistent with yours. From the experimental results, it was found that the model trained by DPO showed a decrease in corresponding indicators. How can this result be explained?

@shanghongsim
Copy link
Collaborator

Hi there, apologies for the delayed response! I’d like to clarify a few points:

  • Did you perform SFT before running DPO?
  • Was llama2-7b-chat-hf tested with or without ICL?
  • Was llama2-7b-chat-hf-dpo tested with or without ICL?

Even without this information, differences in results could stem from the specific checkpoint used for testing. I assume you followed the implementation precisely and evaluated the output after two epochs. However, we trained for a maximum of two epochs and then selected the best checkpoint based on training curves.

To reproduce our results, you can use the models we have open-sourced here:
DeCLaRe-Lab Trust-Align Collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants