[SUBMISSION] December 2024 Completion of Module 2 #153
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
December 2024 Student Submission
Module Completed
Changes Made
Describe what you've done in this PR:
What concepts did you learn?
I learned about the DPO and ORPO methods which I was not aware, also how can we apply these methods for llm alignment of base llm instead of using RLHF. Additionally, the dataset format required for this method usage is also critical.
In the process of learning I came to know about agrilla and distillabel tools which can be leveraged to prepare dpo datasets required for applying these methods.
Also, I have learned that DPO alignment method is not up to the mark and ORPO alignment method is much better and optimized then DPO but it requires non-sft based llm model to be used for alignment.
What changes or additions did you make?
I have tried using truthy-dpo-v0.1 dataset for DPO method and also added inference method call for newly trained smol-135M llm.
For ORPO I have used the same dataset ultrafeedback_binarized.
Yes, for ORPO with
ultrafeedback_binarized
dataset the alignment time was ~3 hours but yes, I'm happy it got completed.Notebooks Added/Modified
List any notebooks you've added or modified:
2_preference_alignment/student_examples/ShankarChavan/dpo_finetuning_example.ipynb
Checklist
december-2024
branchQuestions or Discussion Points
Add any questions you have or points you'd like to discuss:
prompt|chosen|rejected
(truthy_dpo) andchosen|rejected
(ultrafeedback)?How do we decide on this if you can show and point me to link or blog which can guide me based on base llm we are using for alignment?
Additional Notes
Any other information that might be helpful for reviewers: