You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An attribute true_reward is passed and set at initialization of the ILE class. However, this attribute is never used.
Instead ILE's evaluate method uses the environment passed at initialization when calling ValueIteration. If this environment is a RewardWrapper, the returned values will be based on the wrapped rather than the true reward (via get_reward_matrix).
The upshot is that when whenever ILE is initialized with a RewardWrapper as first positional argument, then we'll get results that aren't based on the true reward. Iirc in the experiment script that produced our results ILE was in fact initialized with a RewardWrapper whose reward_function was constant zero.
I think we need a more general solution to keep track of when we're using the true vs. a custom reward function. Else I worry that we'll see many similar bugs. Our current way of handling this seems too intransparent; e.g. for the above analysis I had to look at three different functions in three different modules, and in the end the crucial part happened in a complicated nested for loop. (I'm still only 80% confident that I identified the bug correctly.)
The text was updated successfully, but these errors were encountered:
In the
inverse_learning_error
module:true_reward
is passed and set at initialization of theILE
class. However, this attribute is never used.ILE
'sevaluate
method uses the environment passed at initialization when callingValueIteration
. If this environment is aRewardWrapper
, the returned values will be based on the wrapped rather than the true reward (viaget_reward_matrix
).The upshot is that when whenever
ILE
is initialized with aRewardWrapper
as first positional argument, then we'll get results that aren't based on the true reward. Iirc in the experiment script that produced our resultsILE
was in fact initialized with aRewardWrapper
whosereward_function
was constant zero.I think we need a more general solution to keep track of when we're using the true vs. a custom reward function. Else I worry that we'll see many similar bugs. Our current way of handling this seems too intransparent; e.g. for the above analysis I had to look at three different functions in three different modules, and in the end the crucial part happened in a complicated nested for loop. (I'm still only 80% confident that I identified the bug correctly.)
The text was updated successfully, but these errors were encountered: