-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] best way to have my own reward model which is backed by rules #2518
Comments
Hi, I have the same question as you do. I think that there must be some easy way to simply write a reward function as a But I also think that |
@yananchen1989 Thanks for posting this as I was stuck with a similar issue (but for |
@yananchen1989 @oliveiraeliel @nityadav @hwhyyds @schmidtj3 |
correct me if i am wrong. i understand that recent TRL versions wants to unify the pipeline in a more neat and organized manner across these different RL methods, where Trainer is the pivotal module and kick off the trainer.train() and all set. however. this could cause excessive encapsulation since it if hard to modularize the the reward module. in my view, there is no need to rigidly transfer these rl methods into a unified training framework. pls advise. |
Ultimately, TRL is a Hugging Face library built on top of Transformers and is part of the Hugging Face ecosystem. If the Trainer does limit flexibility, then Transformers will need to adapt; otherwise, we will have to maintain a much larger and more complex codebase. We'll come up with a way to add these features and prepare a PR soon! |
@qgallouedec, do you want to comment? |
Maybe having a Alternatively, releasing the type of In any case, I believe that the best approach is to discuss around a PR if anyone is willing to propose their approach |
i hear u. thanks |
hi there,
i see that in recent versions from 0.11, especially in PPO, there is a big change that only
reward_model
ofnn.Module
is accepted for scoring. in contrast to previous versions that I can use my own reward module to get the scores on the fly.thus, my question is, if i have my own reward module, which is backed by rules instead of a LM, what is the recommended practice to integrate it into the current version ?
what comes into my mind are two manners:
get_reward
https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L1051 and write my rules within it ?My question also applied to
OnlineDPOTrainer
.Thanks.
The text was updated successfully, but these errors were encountered: