Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUBMISSION] December 2024 Completion of Module 2 #153

Open
wants to merge 5 commits into
base: december-2024
Choose a base branch
from

Conversation

ShankarChavan
Copy link

@ShankarChavan ShankarChavan commented Dec 31, 2024

December 2024 Student Submission

Module Completed

  • Module 1: Instruction Tuning
  • Module 2: Preference Alignment
  • Module 3: Parameter-efficient Fine-tuning
  • Module 4: Evaluation
  • Module 5: Vision-language Models
  • Module 6: Synthetic Datasets
  • Module 7: Inference
  • Module 8: Deployment

Changes Made

Describe what you've done in this PR:

  1. What concepts did you learn?
    I learned about the DPO and ORPO methods which I was not aware, also how can we apply these methods for llm alignment of base llm instead of using RLHF. Additionally, the dataset format required for this method usage is also critical.

    In the process of learning I came to know about agrilla and distillabel tools which can be leveraged to prepare dpo datasets required for applying these methods.

    Also, I have learned that DPO alignment method is not up to the mark and ORPO alignment method is much better and optimized then DPO but it requires non-sft based llm model to be used for alignment.

  2. What changes or additions did you make?
    I have tried using truthy-dpo-v0.1 dataset for DPO method and also added inference method call for newly trained smol-135M llm.

For ORPO I have used the same dataset ultrafeedback_binarized.

  1. Any challenges you faced?
    Yes, for ORPO with ultrafeedback_binarized dataset the alignment time was ~3 hours but yes, I'm happy it got completed.

Notebooks Added/Modified

List any notebooks you've added or modified:

  • Added new example in 2_preference_alignment/student_examples/ShankarChavan/dpo_finetuning_example.ipynb
  • Modified existing notebook with additional examples
  • Added documentation or comments

Checklist

  • I have read the module materials
  • My code runs without errors
  • I have pushed models and datasets to the huggingface hub
  • My PR is based on the december-2024 branch

Questions or Discussion Points

Add any questions you have or points you'd like to discuss:

  1. I was finding it difficult which DPO dataset format must be used for e.g. between prompt|chosen|rejected (truthy_dpo) and chosen|rejected (ultrafeedback)?
    How do we decide on this if you can show and point me to link or blog which can guide me based on base llm we are using for alignment?

Additional Notes

Any other information that might be helpful for reviewers:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant