Capture model service #588

stephenwf · 2022-09-08T12:05:49Z

stephenwf
Sep 8, 2022
Maintainer

What the capture model service does?

A capture model is a document that contains structured information about a particular image. When you create a project a capture model is created using a template defined in the project for each of the images that are a part of the project. The capture model service stores and manages access to these documents.

When a user makes a contribution that has not been submitted, only they should see the data that they have added to the document. When that contribution is accepted, then it should be shown to all users when adding new annotations. The capture model service also performs these filters and ensures that users are seeing what they are allowed to see.

A revision (set of changes to a document) is owned by the user who created it. Typically only this user can make changes to that revision, and it can be safely deleted without affecting other peoples work. When a user updates a transcription, for example, a copy of the transcription is taken and then modified by a user. Even when the transcription is accepted the two distinct transcriptions remain, the original and the revision. Chains of revisions can be created using capture models, with only the latest being shown. The capture model service ensures that these chains are valid, and that users are only making changes to revisions and not directly to the document.

Lastly there are some operations that Administrators or Reviewers can do to capture models that are outside of this revision workflow, like merging two revisions together, marking revisions as complete, editing a revision on behalf of a user, deletions or directly editing the model - to support features such as pre-segmentation. The capture model service ensures that these operations can only be done by the correct users and tries to keep the capture model valid.

Where the problems are

The most used path for the Model API is the saving of new or existing revisions. The flow looks like this:

sequenceDiagram
    User->>Model API: Get Capture model 
    Model API->>Database: Request full model
    Database-->>Model API: Full model
    Model API->>Model API: Filter based on user
    Model API-->>User: Return filtered model
    Note right of User: user now sees only a subset of the model
    User->>User: Create revision based on filtered document
    Note right of User: User makes new changes 
    User->>Model API: Create new revision
    Model API->>Database: Request full model
    Database-->>Model API: Full model
    Model API->>Model API: Apply changes to full model
    Model API->>Database: Save new model
    Model API-->>User: Saved revision

The bugs that have arisen so far have been from:

Filtering the model correctly so users see only what they should see
Creating a revision from a subset and then saving back to the original - sometimes conflicts
Database integrity issues when removing some fields
Returning the revision back after saving - and keeping the UI consistent

What the fundamental problem is?

The current Model API saves each field/entity individually as a row in a SQL database. This makes the data inside the models accessible for querying and allows for good integrity checks to make sure everything remains valid. However the concept of the capture model is fundamentally document based.

There is one big document that can be:

Filtered
Atomically updated

Over the years the database driving the Model API is simply reconstructing the full document each time - and then filtering outside of the database (due to query complexity as things evolved). So now we have a complex database that can do create filtering and querying - but that we are not using. So we are not getting any benefit from breaking the document out into database rows.

For updating, since each field and entity is it's own row in the database, we can't simply replace entire slices of the document easily, instead we have to go through and "upsert" (insert a new, or update an existing) row for each entity, field and selector. This has made the updating flow in the diagram above very complex. This complexity could be managed and improved if there was a clear benefit, but we are simply not using the individual rows in the database.

The bugs mostly arise from the "upsert" action, either causing an integrity issue - or missing insertions completely, where data is ends up not being saved. These edge-cases are due to the flexibility of the capture model format and limits how we can use them.

Lastly the code for doing the filtering, saving and modification of the documents exists in a completely different code-base which makes end-to-end testing from the code that the UI uses to create revisions to the code that filters and modifies the models - impossible.

How it was fixed?

An overview of the changes made:

Migrated the code back into this repository, instead of being an external service (for now).
Replaced ORM with normal Postgres queries - matching the rest of Madoc.
Simplified the database to be more document-orientated. There is a single Postgres field for the FULL model document.
Added end-to-end tests for the full lifecycle of a model (document -> filter -> save)

These changes will enable the following:

Remove bugs relating to the database saving process
Testing the whole process together when bugs do arise
Simplify complex areas with the safety net of testing

Migrating considerations?

There is now 2 database where capture models can exist, there is also 2 endpoints:

The existing:

/api/crowdsourcing/ ...

The new:

/api/madoc/crowdsourcing/...

They should be 100% compatible with each other, however we need to "switch" from one to another safely.

Migrate on start up

When this version of madoc starts it will fetch all the models from the old service and add them to the new one before starting
[pro] no code to specify which endpoint to hit - all clean
[con] will take a LONG time and could result in downtime
[con] may fail and require a rollback

Migrate from admin

[pro] Can be done per-site
[pro] Does not affect existing sites or projects
[con] requires 2 steps - migrate and then switch to use the new API, which may result in data-loss if projects are running

Incremental migration

[pro] Fully transparent, new endpoints just copy the model if it doesn't exist
[pro] No migration step or downtime
[con] Increases request time - slow projects
[con] Errors not easily detectable if they arise
[con] No clear indication when it's safe to remove the Model API

Hybrid?

We could also use a hybrid of the above, maybe Incremental and Migrate from Admin so you know when it's safe to remove the service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture model service #588

{{title}}

Replies: 0 comments

Select a reply

Capture model service #588

stephenwf Sep 8, 2022 Maintainer

What the capture model service does?

Where the problems are

What the fundamental problem is?

How it was fixed?

Migrating considerations?

Migrate on start up

Migrate from admin

Incremental migration

Hybrid?

Replies: 0 comments

stephenwf
Sep 8, 2022
Maintainer