Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILE calculated using initial/estimated rather than true reward #41

Open
maxmdaniel opened this issue Oct 14, 2018 · 1 comment
Open
Assignees

Comments

@maxmdaniel
Copy link
Collaborator

maxmdaniel commented Oct 14, 2018

In the inverse_learning_error module:

  • An attribute true_reward is passed and set at initialization of the ILE class. However, this attribute is never used.
  • Instead ILE's evaluate method uses the environment passed at initialization when calling ValueIteration. If this environment is a RewardWrapper, the returned values will be based on the wrapped rather than the true reward (via get_reward_matrix).

The upshot is that when whenever ILE is initialized with a RewardWrapper as first positional argument, then we'll get results that aren't based on the true reward. Iirc in the experiment script that produced our results ILE was in fact initialized with a RewardWrapper whose reward_function was constant zero.

I think we need a more general solution to keep track of when we're using the true vs. a custom reward function. Else I worry that we'll see many similar bugs. Our current way of handling this seems too intransparent; e.g. for the above analysis I had to look at three different functions in three different modules, and in the end the crucial part happened in a complicated nested for loop. (I'm still only 80% confident that I identified the bug correctly.)

@JohannesHeidecke
Copy link
Owner

@dit7ya will fix this and push to the cleanup branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants