Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

Open
matrixbot opened this issue Dec 19, 2023 · 0 comments

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 19, 2023

This issue has been migrated from #13185.


(This is the cause of a flake in Complement's TestRestrictedRoomsLocalJoin tests when running with workers.)

If room A is a restricted room, restricted to members of room B, then if:

  1. a user first attempts to join room A and is rejected;
  2. the user then joins room B successfully; and
  3. the user then attempts to join room A again

the user can be erroneously rejected from joining room A (in step (3)).

This is because there is a cache, get_rooms_for_user_with_stream_ordering, which is populated in step (1) on the event creator.
In step (2), the event persister will invalidate that cache and send a command over replication for other workers to do the same.
This issue then occurs if the event creator doesn't receive that command until after step (3).

2022-07-05-TestRestrictedRoomsLocalJoin_race

This occurs easily in Complement+workers on CI, where it takes ~200 ms for the invalidation to be received on the new worker (CPU contention in CI is probably playing a big part).

An obvious workaround is for the client to just sleep & retry, but it seems very ugly that we're issuing 403s to clients when they're making fully serial requests.

Another incomplete workaround is for the event creator to invalidate its own cache, but that won't help if the join to room B takes place on a different event creator.

I think a potential solution that would work is to:

  1. when persisting the join event and sending the invalidation, ensure this is done before finishing the replication request and the client request
  2. when starting a new join for a restricted room on an event creator, fetch the latest cache invalidation position from Redis and then only start processing the join once replication has caught up past that point.

(I'm not exactly sure how you 'fetch the latest position'; I was rather hoping there'd be SETMAX in Redis but didn't find it. Alternatively, it's probably possible to just PING Redis and treat the PONG as the barrier, but it's not a very nice solution.)

@matrixbot matrixbot changed the title Dummy issue Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room Dec 21, 2023
@matrixbot matrixbot reopened this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants