Handle datasets with no terminating states #394

ikatsov · 2021-02-11T22:34:08Z

Imagine that we use ReAgent to train a personalization policy, and the workflow is as follows:

We collect a number of user interaction histories (episodes) and train a DQN model in offline (Batch RL) mode.
Some time later, we collect additional transitions for the same users and want to update the DQN model by feeding new transitions into ReAgent (incrementally, without retraining from scratch).

The question is how to do this correctly to handle the initial conditions - in DQN, it is assumed that Q(s,a)=0 for the final states of the episodes, but we extend the episodes with new transitions at each update. Does ReAgent handle this correctly?

MisterTea · 2021-02-12T01:02:55Z

Disregard my last comment, I misunderstood.

This is tricky because you need terminating states to have a reasonable model. I think in this case you need another model that predicts total value remaining for a user and use this second model to augment the data so ReAgent gets complete episodes. The second model could be a guess based on looking at prior users that have abandoned.

vasidzius · 2021-02-12T09:12:26Z

I have the same question and a proposal to workaround the problem:

Let's imagine that trajectories (episodes) are infinite in general but the information about steps occurs incrementally ( user interacts with the system and new data is generated). So we want to update agent's policy incrementally as well to make better actions in the future.

Imagine we received new data with three new steps in particular trajectory. In this case, will it be correct to update the policy (DQN in particular) using these steps as an episode but not using the third (i.e. terminal one) step's q-values, where the Q function is calculated incorrectly because the last step is, in fact, not the terminal ?

in more details, for this example we will have
Q(s1,a1) = r1 + gamma* maxQ(s2,a)
Q(s2,a2) = r2 + gamma* maxQ(s3,a)
Q(s3,a3) = r3
update DQN using all three results

Proposal:
Q(s1,a1) = r1 + gamma* maxQ(s2,a)
Q(s2,a2) = r2 + gamma* maxQ(s3,a)
update DQN without incorrectly calculated Q(s3,a3)

Does it seem to be correct training flow?

MisterTea · 2021-02-12T14:12:24Z

In this case, there will be a high error in the model when you deploy it. The model will think that all states have a lot of future reward because it hasn't seen one that terminates. The Q values will be really high.

MisterTea changed the title ~~Does ReAgent Support Incremental Training?~~ Handle datasets with no terminating states Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle datasets with no terminating states #394

Handle datasets with no terminating states #394

ikatsov commented Feb 11, 2021

MisterTea commented Feb 12, 2021

vasidzius commented Feb 12, 2021

MisterTea commented Feb 12, 2021 •

edited

Loading

Handle datasets with no terminating states #394

Handle datasets with no terminating states #394

Comments

ikatsov commented Feb 11, 2021

MisterTea commented Feb 12, 2021

vasidzius commented Feb 12, 2021

MisterTea commented Feb 12, 2021 • edited Loading

MisterTea commented Feb 12, 2021 •

edited

Loading