Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

matrixbot · 2023-12-19T19:45:33Z

This issue has been migrated from #13185.

(This is the cause of a flake in Complement's TestRestrictedRoomsLocalJoin tests when running with workers.)

If room A is a restricted room, restricted to members of room B, then if:

a user first attempts to join room A and is rejected;
the user then joins room B successfully; and
the user then attempts to join room A again

the user can be erroneously rejected from joining room A (in step (3)).

This is because there is a cache, get_rooms_for_user_with_stream_ordering, which is populated in step (1) on the event creator.
In step (2), the event persister will invalidate that cache and send a command over replication for other workers to do the same.
This issue then occurs if the event creator doesn't receive that command until after step (3).

This occurs easily in Complement+workers on CI, where it takes ~200 ms for the invalidation to be received on the new worker (CPU contention in CI is probably playing a big part).

An obvious workaround is for the client to just sleep & retry, but it seems very ugly that we're issuing 403s to clients when they're making fully serial requests.

Another incomplete workaround is for the event creator to invalidate its own cache, but that won't help if the join to room B takes place on a different event creator.

I think a potential solution that would work is to:

when persisting the join event and sending the invalidation, ensure this is done before finishing the replication request and the client request
when starting a new join for a restricted room on an event creator, fetch the latest cache invalidation position from Redis and then only start processing the join once replication has caught up past that point.

(I'm not exactly sure how you 'fetch the latest position'; I was rather hoping there'd be SETMAX in Redis but didn't find it. Alternatively, it's probably possible to just PING Redis and treat the PONG as the barrier, but it's not a very nice solution.)

The text was updated successfully, but these errors were encountered:

matrixbot closed this as completed Dec 19, 2023

matrixbot changed the title ~~Dummy issue~~ Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room Dec 21, 2023

matrixbot added A-Testing T-Defect Z-Read-After-Write labels Dec 21, 2023

matrixbot reopened this Dec 21, 2023

MadLittleMods added the A-Join label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

matrixbot commented Dec 19, 2023 •

edited

Loading

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

Comments

matrixbot commented Dec 19, 2023 • edited Loading

matrixbot commented Dec 19, 2023 •

edited

Loading