[question] best way to have my own reward model which is backed by rules #2518

yananchen1989 · 2024-12-24T21:29:44Z

hi there,
i see that in recent versions from 0.11, especially in PPO, there is a big change that only reward_model of nn.Module is accepted for scoring. in contrast to previous versions that I can use my own reward module to get the scores on the fly.

thus, my question is, if i have my own reward module, which is backed by rules instead of a LM, what is the recommended practice to integrate it into the current version ?

what comes into my mind are two manners:

perhaps to change the function in get_reward https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L1051 and write my rules within it ?
Use my own reward module to score samples and train a module as the reward LM and then pass it into the ppo_trainer ?

My question also applied to OnlineDPOTrainer.

Thanks.

The text was updated successfully, but these errors were encountered:

oliveiraeliel · 2024-12-28T02:19:22Z

Hi, I have the same question as you do.

I think that there must be some easy way to simply write a reward function as a nn.Module, so we don't have to refactor anything, but I didn't tried it yet.

But I also think that PPOTrainer should accept a custom_get_reward_function as an optional parameter. In this case, anyone could define its own reward function, and would be a clean solution.

nityadav · 2024-12-29T19:24:39Z

@yananchen1989 Thanks for posting this as I was stuck with a similar issue (but for OnlineDPOTrainer). The easiest workaround for me was to subclass the trainer class (OnlineDPOTrainer) and override the training_step with my custom get_reward logic, and rest of the implementation being the same as in the original method.

August-murr · 2024-12-30T18:28:29Z

@yananchen1989 @oliveiraeliel @nityadav @hwhyyds @schmidtj3
This has been a recurring question, so before implementing a solution, I would like to ask you all for examples of when you would need this feature so that we can think of a good solution.

yananchen1989 · 2024-12-30T18:51:22Z

correct me if i am wrong.
would like to know the primary motivation to rewrite the dpo from older version to current trainer unified version. maybe for better efficiency ?

i understand that recent TRL versions wants to unify the pipeline in a more neat and organized manner across these different RL methods, where Trainer is the pivotal module and kick off the trainer.train() and all set.
so for some methods like ppo the reward module is needed, it is also directed passed into the trainer. while for say dpo, sft, there is no provision for reward module.

however. this could cause excessive encapsulation since it if hard to modularize the the reward module.
the core reason is that in practical cases, reward module can be of any form not just a single torch.nn module which just score the whole output. the reward module may be a mixture and may be of dependence on external parameters, the prompt and most importantly it can not score the ppo trainer' outputs in a batch mode.
anyway the flexibility is significantly reduced.
although as you know the current unified pipeline is very fine with other mothods such as dpo as they do not have the reward concerns and the reward module is implicitly expressed within the algorithm.

in my view, there is no need to rigidly transfer these rl methods into a unified training framework.

pls advise.

August-murr · 2024-12-31T06:52:17Z

Ultimately, TRL is a Hugging Face library built on top of Transformers and is part of the Hugging Face ecosystem. If the Trainer does limit flexibility, then Transformers will need to adapt; otherwise, we will have to maintain a much larger and more complex codebase.

We'll come up with a way to add these features and prepare a PR soon!

August-murr · 2024-12-31T06:52:44Z

@qgallouedec, do you want to comment?

qgallouedec · 2024-12-31T07:30:41Z

Maybe having a reward_func arg of type Callable is an option.

Alternatively, releasing the type of reward_model to accept any Callable is also an option. But given that a custom reward func won't return the same type/shape as proper reward_model I'm a bit afraid that it would require overcomplicated logic.

In any case, I believe that the best approach is to discuss around a PR if anyone is willing to propose their approach

yananchen1989 · 2024-12-31T12:57:59Z

i hear u. thanks

August-murr added ❓ question Seeking clarification or more information 🏋 PPO Related to PPO labels Dec 25, 2024

August-murr linked a pull request Jan 3, 2025 that will close this issue

custom reward function support for ppo trainer #2540

Draft

5 tasks

August-murr self-assigned this Jan 4, 2025

oliveiraeliel mentioned this issue Jan 11, 2025

Add _compute_score method to PPOTrainer #2560

Draft

5 tasks

Benjoyo mentioned this issue Jan 16, 2025

Allow more flexibility in new PPOTrainer (aka PPOv2) #2576

Open

qgallouedec mentioned this issue Jan 16, 2025

Allow more flexibility in the new PPOTrainer (aka PPOTrainerv2) August-murr/trl#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] best way to have my own reward model which is backed by rules #2518

[question] best way to have my own reward model which is backed by rules #2518

yananchen1989 commented Dec 24, 2024 •

edited

Loading

oliveiraeliel commented Dec 28, 2024

nityadav commented Dec 29, 2024

August-murr commented Dec 30, 2024

yananchen1989 commented Dec 30, 2024 •

edited

Loading

August-murr commented Dec 31, 2024

August-murr commented Dec 31, 2024

qgallouedec commented Dec 31, 2024

yananchen1989 commented Dec 31, 2024

[question] best way to have my own reward model which is backed by rules #2518

[question] best way to have my own reward model which is backed by rules #2518

Comments

yananchen1989 commented Dec 24, 2024 • edited Loading

oliveiraeliel commented Dec 28, 2024

nityadav commented Dec 29, 2024

August-murr commented Dec 30, 2024

yananchen1989 commented Dec 30, 2024 • edited Loading

August-murr commented Dec 31, 2024

August-murr commented Dec 31, 2024

qgallouedec commented Dec 31, 2024

yananchen1989 commented Dec 31, 2024

yananchen1989 commented Dec 24, 2024 •

edited

Loading

yananchen1989 commented Dec 30, 2024 •

edited

Loading