You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(This is the cause of a flake in Complement's TestRestrictedRoomsLocalJoin tests when running with workers.)
If room A is a restricted room, restricted to members of room B, then if:
a user first attempts to join room A and is rejected;
the user then joins room B successfully; and
the user then attempts to join room A again
the user can be erroneously rejected from joining room A (in step (3)).
This is because there is a cache, get_rooms_for_user_with_stream_ordering, which is populated in step (1) on the event creator.
In step (2), the event persister will invalidate that cache and send a command over replication for other workers to do the same.
This issue then occurs if the event creator doesn't receive that command until after step (3).
This occurs easily in Complement+workers on CI, where it takes ~200 ms for the invalidation to be received on the new worker (CPU contention in CI is probably playing a big part).
An obvious workaround is for the client to just sleep & retry, but it seems very ugly that we're issuing 403s to clients when they're making fully serial requests.
Another incomplete workaround is for the event creator to invalidate its own cache, but that won't help if the join to room B takes place on a different event creator.
I think a potential solution that would work is to:
when persisting the join event and sending the invalidation, ensure this is done before finishing the replication request and the client request
when starting a new join for a restricted room on an event creator, fetch the latest cache invalidation position from Redis and then only start processing the join once replication has caught up past that point.
(I'm not exactly sure how you 'fetch the latest position'; I was rather hoping there'd be SETMAX in Redis but didn't find it. Alternatively, it's probably possible to just PING Redis and treat the PONG as the barrier, but it's not a very nice solution.)
The text was updated successfully, but these errors were encountered:
matrixbot
changed the title
Dummy issue
Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room
Dec 21, 2023
This issue has been migrated from #13185.
(This is the cause of a flake in Complement's
TestRestrictedRoomsLocalJoin
tests when running with workers.)If room A is a restricted room, restricted to members of room B, then if:
the user can be erroneously rejected from joining room A (in step (3)).
This is because there is a cache,
get_rooms_for_user_with_stream_ordering
, which is populated in step (1) on the event creator.In step (2), the event persister will invalidate that cache and send a command over replication for other workers to do the same.
This issue then occurs if the event creator doesn't receive that command until after step (3).
This occurs easily in Complement+workers on CI, where it takes ~200 ms for the invalidation to be received on the new worker (CPU contention in CI is probably playing a big part).
An obvious workaround is for the client to just sleep & retry, but it seems very ugly that we're issuing 403s to clients when they're making fully serial requests.
Another incomplete workaround is for the event creator to invalidate its own cache, but that won't help if the join to room B takes place on a different event creator.
I think a potential solution that would work is to:
(I'm not exactly sure how you 'fetch the latest position'; I was rather hoping there'd be
SETMAX
in Redis but didn't find it. Alternatively, it's probably possible to just PING Redis and treat the PONG as the barrier, but it's not a very nice solution.)The text was updated successfully, but these errors were encountered: