Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] best way to have my own reward model which is backed by rules #2518

Open
yananchen1989 opened this issue Dec 24, 2024 · 8 comments · May be fixed by #2540
Open

[question] best way to have my own reward model which is backed by rules #2518

yananchen1989 opened this issue Dec 24, 2024 · 8 comments · May be fixed by #2540
Assignees
Labels
🏋 PPO Related to PPO ❓ question Seeking clarification or more information

Comments

@yananchen1989
Copy link

yananchen1989 commented Dec 24, 2024

hi there,
i see that in recent versions from 0.11, especially in PPO, there is a big change that only reward_model of nn.Module is accepted for scoring. in contrast to previous versions that I can use my own reward module to get the scores on the fly.

thus, my question is, if i have my own reward module, which is backed by rules instead of a LM, what is the recommended practice to integrate it into the current version ?

what comes into my mind are two manners:

My question also applied to OnlineDPOTrainer.

Thanks.

@August-murr August-murr added ❓ question Seeking clarification or more information 🏋 PPO Related to PPO labels Dec 25, 2024
@oliveiraeliel
Copy link

Hi, I have the same question as you do.

I think that there must be some easy way to simply write a reward function as a nn.Module, so we don't have to refactor anything, but I didn't tried it yet.

But I also think that PPOTrainer should accept a custom_get_reward_function as an optional parameter. In this case, anyone could define its own reward function, and would be a clean solution.

@nityadav
Copy link

@yananchen1989 Thanks for posting this as I was stuck with a similar issue (but for OnlineDPOTrainer). The easiest workaround for me was to subclass the trainer class (OnlineDPOTrainer) and override the training_step with my custom get_reward logic, and rest of the implementation being the same as in the original method.

@August-murr
Copy link
Collaborator

@yananchen1989 @oliveiraeliel @nityadav @hwhyyds @schmidtj3
This has been a recurring question, so before implementing a solution, I would like to ask you all for examples of when you would need this feature so that we can think of a good solution.

@yananchen1989
Copy link
Author

yananchen1989 commented Dec 30, 2024

correct me if i am wrong.
would like to know the primary motivation to rewrite the dpo from older version to current trainer unified version. maybe for better efficiency ?

i understand that recent TRL versions wants to unify the pipeline in a more neat and organized manner across these different RL methods, where Trainer is the pivotal module and kick off the trainer.train() and all set.
so for some methods like ppo the reward module is needed, it is also directed passed into the trainer. while for say dpo, sft, there is no provision for reward module.

however. this could cause excessive encapsulation since it if hard to modularize the the reward module.
the core reason is that in practical cases, reward module can be of any form not just a single torch.nn module which just score the whole output. the reward module may be a mixture and may be of dependence on external parameters, the prompt and most importantly it can not score the ppo trainer' outputs in a batch mode.
anyway the flexibility is significantly reduced.
although as you know the current unified pipeline is very fine with other mothods such as dpo as they do not have the reward concerns and the reward module is implicitly expressed within the algorithm.

in my view, there is no need to rigidly transfer these rl methods into a unified training framework.

pls advise.

@August-murr
Copy link
Collaborator

Ultimately, TRL is a Hugging Face library built on top of Transformers and is part of the Hugging Face ecosystem. If the Trainer does limit flexibility, then Transformers will need to adapt; otherwise, we will have to maintain a much larger and more complex codebase.

We'll come up with a way to add these features and prepare a PR soon!

@August-murr
Copy link
Collaborator

@qgallouedec, do you want to comment?

@qgallouedec
Copy link
Member

Maybe having a reward_func arg of type Callable is an option.

Alternatively, releasing the type of reward_model to accept any Callable is also an option. But given that a custom reward func won't return the same type/shape as proper reward_model I'm a bit afraid that it would require overcomplicated logic.

In any case, I believe that the best approach is to discuss around a PR if anyone is willing to propose their approach

@yananchen1989
Copy link
Author

i hear u. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 PPO Related to PPO ❓ question Seeking clarification or more information
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants