Skip to content

Locks and Races

Adam Hooper edited this page Jun 16, 2021 · 2 revisions

This page lists all the ways in which stuff might get out of sync -- and what we do about it.

Workflow updates

Workflow updates are atomic. That comprises anything that anybody can see that pertains to the workflow: render-cache updates, new uploaded files, steps, fetches ... anything. Nobody can see a half-completed change, ever.

In other words: a Workflow is one big blob.

We use database locks for this. Any write (even a simple Step update) must SELECT the workflow FOR UPDATE.

Django doesn't SELECT FOR SHARE, so reads must use SELECT FOR UPDATE too, to agree with writes.

Many workflow updates (such as "add step") happen via "commands", logged in the "delta" table. Many other updates (such as "write to render cache") don't.

User updates

User updates are also atomic.

Updates to user's usage/limits

Some operations, like "change step's auto-update frequency", modify the user's usage: a property that crosses all workflows. A change like this must be atomic, too: if two workflow updates are concurrent, any observer must only be allowed to see 1) the first update; followed by 2) the second update.

Operations that affect usage:

  • Change step update frequency
  • Delete workflow
  • Delete tab (undoing won't schedule any new fetches)
  • Delete step (undoing won't schedule any new fetches)
  • Subscribe/unsubscribe

(This list is short by design: Workbench makes it hard for users to increase their usage accidentally.)

To prevent users from seeing inconsistent states, every usage read or change must lock the User.

Order Workflow/User locks to avoid deadlock!

A workflow change that alters usage now has two locks: the Workflow lock and the User lock. Avoid deadlock: always lock in order.

The order is:

  1. Acquire Workflow lock
  2. Acquire User lock
  3. Release User lock
  4. Release Workflow lock

(This deviates from alphabetical ordering because it makes cleaner code. User locks tend to be conditional, so we nest them deeper in our calling code.)

WebSocket updates

Every Workflow (or User) update gets sent to the client.

Since Workflow and User updates are ordered, the updates seen by clients should also be ordered.

Unfortunately, our messaging system won't stay in lockstep with the database:

  • We can't send messages until after database commit -- at which point, we've released our locks.
  • Even if we held locks while sending messages, our message queueing system doesn't guarantee ordering. We don't confirm published messages -- nor should we, because we drop messages by design when clients back-pressure. Without confirmations, we can't guarantee ordering. (We could route all messages through an intermediary queue just to make sure they're ordered ... but if we reach this point, our system is too complex and we should abandon RabbitMQ instead.)

All that to say: there is a race in theory. Clients may theoretically receive workflow updates out of order.

As of 2021-06-16, we haven't heard of a single bug related to this race. We assume our database locks are far, far slower than message publishes, making this race moot.

Also as of 2021-06-16, the consequences of this race are harmless. Clients may see a glitch or error message, and they can dismiss it by refreshing the page.