Reproduction Fail on Llama 3 #3

jeremery · 2025-01-23T07:26:53Z

Hello, thank you for the tremendous contribution your team has made to trustworthy inference for large models. I encountered some issues while reproducing your experiments and would appreciate your help in resolving them.

When using the eval method you mentioned to test the metrics, we found discrepancies between our results and yours. Specifically, when testing the Llama-3-8b-Instruct model trained with DPO on the asqa_eval_top100_calibrated.json, the metrics did not match those reported in the paper.

The results I obtained are as follows:

My testing environment is: Ubuntu 20.04, Torch 2.5.0, vllm 0.6.5, with 4 A100 80G GPUs. The DPO training parameters are consistent with those in your code (only the dataset and model load paths were modified). Could you help explain the possible cause of this discrepancy?

The text was updated successfully, but these errors were encountered:

shanghongsim · 2025-01-23T20:09:05Z

Hi! Thank you so much for trying out our codebase and for your interest in our work! :)

The metrics you shared vary from our reported results by 1–3 points, with no clear pattern (i.e., neither all metrics increase nor all decrease). This discrepancy is likely due to the inherent variance in the model's outputs caused by the non-deterministic nature of the generation process. Specifically, with the default temperature setting of 0.5, the model introduces some randomness, which can lead to slightly different results across runs. Even in our internal testing, we observed that the model’s responses can vary from run to run, resulting in a variance of approximately 1–3 points in evaluation scores. This behavior is entirely normal and expected. We hope this clarifies!

jeremery · 2025-01-24T11:03:51Z

Thank you very much for taking the time to reply. Regarding the DPO training method mentioned in the paper, I have conducted corresponding experiments on Llama2-7b chat hf, Llama2-13b chat hf, and Llama3.1-8b, and the experimental results are as follows:

These data were all tested on the asqa-eval_top100_calibrated.json dataset, and the experimental parameters were consistent with yours. From the experimental results, it was found that the model trained by DPO showed a decrease in corresponding indicators. How can this result be explained?

shanghongsim · 2025-01-30T14:26:24Z

Hi there, apologies for the delayed response! I’d like to clarify a few points:

Did you perform SFT before running DPO?
Was llama2-7b-chat-hf tested with or without ICL?
Was llama2-7b-chat-hf-dpo tested with or without ICL?

Even without this information, differences in results could stem from the specific checkpoint used for testing. I assume you followed the implementation precisely and evaluated the output after two epochs. However, we trained for a maximum of two epochs and then selected the best checkpoint based on training curves.

To reproduce our results, you can use the models we have open-sourced here:
DeCLaRe-Lab Trust-Align Collection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction Fail on Llama 3 #3

Reproduction Fail on Llama 3 #3

jeremery commented Jan 23, 2025

shanghongsim commented Jan 23, 2025

jeremery commented Jan 24, 2025

shanghongsim commented Jan 30, 2025

Reproduction Fail on Llama 3 #3

Reproduction Fail on Llama 3 #3

Comments

jeremery commented Jan 23, 2025

shanghongsim commented Jan 23, 2025

jeremery commented Jan 24, 2025

shanghongsim commented Jan 30, 2025