Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle datasets with no terminating states #394

Open
ikatsov opened this issue Feb 11, 2021 · 3 comments
Open

Handle datasets with no terminating states #394

ikatsov opened this issue Feb 11, 2021 · 3 comments

Comments

@ikatsov
Copy link

ikatsov commented Feb 11, 2021

Imagine that we use ReAgent to train a personalization policy, and the workflow is as follows:

  1. We collect a number of user interaction histories (episodes) and train a DQN model in offline (Batch RL) mode.
  2. Some time later, we collect additional transitions for the same users and want to update the DQN model by feeding new transitions into ReAgent (incrementally, without retraining from scratch).

The question is how to do this correctly to handle the initial conditions - in DQN, it is assumed that Q(s,a)=0 for the final states of the episodes, but we extend the episodes with new transitions at each update. Does ReAgent handle this correctly?

@MisterTea
Copy link
Contributor

Disregard my last comment, I misunderstood.

This is tricky because you need terminating states to have a reasonable model. I think in this case you need another model that predicts total value remaining for a user and use this second model to augment the data so ReAgent gets complete episodes. The second model could be a guess based on looking at prior users that have abandoned.

@MisterTea MisterTea changed the title Does ReAgent Support Incremental Training? Handle datasets with no terminating states Feb 12, 2021
@vasidzius
Copy link

I have the same question and a proposal to workaround the problem:

Let's imagine that trajectories (episodes) are infinite in general but the information about steps occurs incrementally ( user interacts with the system and new data is generated). So we want to update agent's policy incrementally as well to make better actions in the future.

Imagine we received new data with three new steps in particular trajectory. In this case, will it be correct to update the policy (DQN in particular) using these steps as an episode but not using the third (i.e. terminal one) step's q-values, where the Q function is calculated incorrectly because the last step is, in fact, not the terminal ?

in more details, for this example we will have
Q(s1,a1) = r1 + gamma* maxQ(s2,a)
Q(s2,a2) = r2 + gamma* maxQ(s3,a)
Q(s3,a3) = r3
update DQN using all three results

Proposal:
Q(s1,a1) = r1 + gamma* maxQ(s2,a)
Q(s2,a2) = r2 + gamma* maxQ(s3,a)
update DQN without incorrectly calculated Q(s3,a3)

Does it seem to be correct training flow?

@MisterTea
Copy link
Contributor

MisterTea commented Feb 12, 2021

In this case, there will be a high error in the model when you deploy it. The model will think that all states have a lot of future reward because it hasn't seen one that terminates. The Q values will be really high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants