RFC: Design notes for integration syncing #1189

joshsmith · 2017-11-13T20:45:59Z

Problem

Even with only GitHub integration, we have been having difficulties with the sync up/down to the external service.

Writes are coupled to the external API
Multiple task and comment records are created
Records become easily out of sync

Our basic goals:

Eventual consistency to the destination records in our database
Eventual consistency to the external APIs
Events to/from the external APIs are fully concurrent:
- enqueued in order
- dropped when more recent events come in
- restartable when errored

Each Event can have a:

direction - :inbound | :outbound
integration_ fields
- integration_external_id - the id of the integration resource from the external provider
- integration_updated_at - the last updated at timestamp of the integration resource from the external provider
- integration_record_id - the id of our cached record for the resource
- integration_record_type - the type our cached record for the resource as the table name
record_ fields
- record_id - the id of the record for the resource connected to this integration
- record_type - the type of the record for the resource connected to this integration as the table name
canceled_by - the id of the Event that canceled this one
duplicate_of - the id of the Event that this is a duplicate of
ignored_for_id - the id of the record that caused this event to be ignored
ignored_for_type - the type of the record (table name) that caused this event to be ignored
state - :queued | :processing | :completed | :errored | :canceled | :ignored | :duplicate | :disabled

We may want our own writes to our own records, even without integrations, to also go through this process. Not sure.

When an event comes in we should:

check if there is any event for the integration_external_id where:
- the integration_updated_at is after our event's last updated timestamp (limit 1)
  - if yes, set state to :ignored and stop processing, set ignored_for_id to the id of the event in the limit 1 query and ignored_for_type to this event table's name
- the integration_updated_at timestamp for the relevant record_ is equal to our event's last updated timestamp (limit 1)
  - if yes, set state to :duplicate and stop processing, set duplicate_of to the id of the event in the limit 1
- the modified_at timestamp for the relevant record_ is after our event's last updated timestamp
  - if yes, set state to :ignored and stop processing, set ignored_for_id to the record_id and ignored_for_type to the record_type
check if there are any events for the integration_external_id where:
- integration_updated_at is before our event's last updated timestamp
  - if yes, set state of those events to :canceled and set canceled_by to the id of this event
check if there is any other :queued event or :processing event for the integration_external_id
- if yes, set state to :queued
when :processing, create or update the relevant record matching record_id and record_type through the relationship on the record for integration_record_id and integration_record_type
when :completed, kick off process to look for next :queued item where the integration_updated_at timestamp is the oldest

We would also need within the logic for updating the given record to check whether the record's updated timestamp is after the event's timestamp. If it is, then we need to bubble the changeset validation error and mark the event :ignored as above.

The text was updated successfully, but these errors were encountered:

joshsmith · 2017-11-13T22:49:36Z

I'm actually not sure about the polymorphic approach outlined above. Ecto docs say that this is not ideal and instead suggest we should define relationships directly; in this case that would be:

github_issue_id
github_pull_request_id
github_comment_id

Along with:

task_id
comment_id

We wouldn't be able to do ignored_for without prefixing all of the above with ignored_for_, which seems a little much, but it would be nice to have some sort of audit log of why the event was ignored.

I actually kind of like the above because it's pretty self-documenting that the external_service_x records are the integrations for the internal_x records.

joshsmith · 2017-11-13T22:50:52Z

The :disabled event I added above to the status is intended to serve as a mechanism for events that are created when an integration is no longer syncing, but can be restarted when syncing starts again. This is not necessarily the best approach to restarting an integration sync, but it's one idea.

joshsmith · 2017-11-14T02:29:50Z

One thing discovered here is that Task and Comment have certain expectations today that make many-to-many between them and Project challenging, for example:

Task has a task_list relationship that's particular to a project
Task has a number scoped to a project
Task has a UserTask
etc

The design as it stands makes changes relatively challenging.

Perhaps an easier approach is to, for now, remove the possibility of a GithubRepo belonging to many projects and instead restrict connecting a repo to a single project.

This also comes with some design challenges, although they are much less challenging:

Our UI will need updated to disallow creating a ProjectGithubRepo when one already exists, and display to the user the project that's currently connected to that repo.
We will need to change project_github_repos has_many on GithubRepo to a has_one.

A GithubAppInstallation should remain connectable to multiple projects. We could also improve the UI by indicating which other projects an installation is connected to, but this might be unnecessary with the UI change above.

joshsmith · 2017-11-14T02:38:08Z

@begedin just marked as question and assigned so you can give thoughts here. I don't feel comfortable yet with any specific direction without considering this more fully. My latest comment may be the best place for us to start on the rework, though.

joshsmith · 2017-11-14T05:15:01Z

I've created #1190 as a first step

begedin · 2017-11-14T07:05:03Z

@joshsmith

One thing I'm concerned about (unless I'm missing something) is that we seem to be making an assumption that, for example, we can safely ignore an event tied to a github_issue_id as long as it's timestamp the same or older than a queued event tied to the same id.

With the example of a github issue, would it not be possible for us to receive edited and labeled events out of order, both of which are valid changes.

I guess we can implicitly discard them due to our sync process in which we take all the data we get in the payload and use it for updating, regardless of event type, but still, it may be possible that there still exists a scenario where we get the two events above out of order, and the one we are processing does not have all the latest data.

begedin · 2017-11-14T07:07:09Z

Another concern with explicit instead of polymorphic relationships is what happens when we add other integrations. We'll have to expand our schema significantly. Not sure how that performs when the expansions are additional relationships to other schemas. I do agree that this is the more correct ecto approach, though.

joshsmith · 2017-11-14T17:42:38Z

Yes I prefer the polymorphism but José seemed to indicate that it significantly underperforms due to lack of foreign keys and this table is likely to have tens of millions of rows.

joshsmith · 2017-11-14T22:41:20Z

I guess we can implicitly discard them due to our sync process in which we take all the data we get in the payload and use it for updating, regardless of event type, but still, it may be possible that there still exists a scenario where we get the two events above out of order, and the one we are processing does not have all the latest data.

This would be overcome by using the webhook event as a trigger to retrieve the information from the external API for the given data.

This will get more complex with events like label and assignment, but for the events that impact the resource directly I think we are safe with that assumption.

joshsmith · 2018-01-12T00:50:39Z

NB: For now, the below implementation ties us to GitHub. To do otherwise would result in an explosion of fields and some increased complexity that I currently don't feel are worth the extra effort when we don't even have plans to add other integrations in the near future.

We would have a sync transaction for each type of internal record we want to sync. For example:

TaskSyncTransaction
CommentSyncTransaction

Every sync transaction record, regardless of type, would have a:

direction - :inbound | :outbound
github_app_installation_id - the id of the app installation for this sync
github_updated_at - the last updated at timestamp for the resource on GitHub
canceled_by - the id of the SyncTransaction that canceled this one
duplicate_of - the id of the SyncTransaction that this is a duplicate of
dropped_for - the id of the SyncTransaction that this was dropped in favor of
state
- :queued - waiting to be processed
- :processing - currently being processed; limited to one per instance of the synced record, e.g. comment_id
- :completed - successfully synced
- :errored - should be paired with a reason for the error
- :canceled - another transaction supersedes this one, so we should not process it
- :dropped - this transaction was outdated and was dropped
- :duplicate - another transaction already existed that matched the timestamp for this one
- :disabled - we received the transaction but cannot sync it because the repo no longer syncs to a project

Then each type would have type-specific fields, e.g. a CommentSyncTransaction would have:

comment_id - the id of our comment record
github_comment_id - the id of our cached record for the external resource
github_comment_external_id - the id of the resource from the external provider (GitHub)

If the event is due to the resource being created, there will not be a conflict. If the resource was created from our own clients, then there is no external GitHub ID yet. The same is true of events coming in from external providers and there is no internal record yet. I'm not yet clear as to whether we should conduct any conflict checking on these event types, but my guess is no. It should likely jump straight to :processing.

When an event comes in from GitHub we should (using a github_comment as our example):

delegate to the proper sync transaction table for the particular resource (in our example this would be comment_sync_transactions)
check if there are any transactions for the github_comment_external_id where:
- the github_updated_at is after our transaction's last updated timestamp (limit 1)
  - if yes, set state to :dropped and stop processing, set dropped_for to the id of the transaction in the limit 1 query
- the github_updated_at timestamp for the relevant record_ is equal to our transaction's last updated timestamp (limit 1)
  - if yes, set state to :duplicate and stop processing, set duplicate_of to the id of the transaction in the limit 1
- the modified_at timestamp for the relevant record_ is after our transaction's last updated timestamp
  - if yes, set state to :dropped and stop processing, set dropped_for to the id of the transaction in the limit 1 query
check if there are any :queued transactions for the integration_external_id where:
- github_updated_at is before our transaction's last updated timestamp
  - if yes, set state of those transactions to :canceled and set canceled_by to the id of this event
check if there is any other :queued transaction or :processing transaction for the integration_external_id
- if yes, set state to :queued
when :processing, check again to see if we can proceed, then create or update the comment through the relationship on the record for comment_id
when :completed, kick off process to look for next :queued item where the github_updated_at timestamp is the oldest

We would also need within the logic for updating the given record to check whether the record's updated timestamp is after the transaction's timestamp. If it is, then we need to bubble the changeset validation error and mark the transaction as :dropped per the above.

joshsmith · 2018-01-12T01:46:07Z

I'd like to figure out naming here of the record we're tracking. What is this even a queue of?

My feeling is that it's best described as a SyncTransaction, i.e. it's an inbound or outbound sync transaction.

joshsmith · 2018-01-12T02:02:49Z

I'm also starting to wonder whether there should be separate queues for the separate record types. This might actually make more sense to me. For example, we'd have a TaskSyncTransaction and a CommentSyncTransaction.

joshsmith · 2018-01-12T02:14:31Z

NB: due to my comment below, I'm leaning strongly towards YES here.

Should this be a queue?

No. It's possible that the transactions that will be sent to the database will themselves be time-ordered implicitly and avoid some extra need for us to iterate over our own queue. In other words, we simply reject or process.
Yes. We're not just writing to the database. There are some reads that must happen and therefore may mean some accidental ordering problems and staleness if we don't proceed by first queueing and then processing incrementally across individual synced records. We also want to be able to avoid "thundering herd" problems and allow for rate limiting.

joshsmith · 2018-01-12T03:02:34Z

Some upsides of the approaches above that I wanted to document, in no particular order:

The tracking above generates some implicit audit trails that will be helpful for debugging.
Any unique-per-record queued transactions can be run in parallel without issue, i.e. we can run transactions for %Comment{id: 1} and %Comment{id: 2} without any conflict.
We can avoid "thundering herd" problems when the system becomes backed up for any by having control over how the stack is popped.
We can use this in conjunction with rate limiting to only process the number of events we have in the queue for the given rate limit and defer further processing until after the rate limit has expired.

begedin · 2018-01-12T11:28:42Z

Notes from my end

I'm ok with tying things to github for now. I'm usually inclined to go with "what we know we need now" instead of "what we might need in the future". That being said, it seems to be that this could easily be generalized by renaming the github and github external fields to "integration" and "integration_external", while also adding an integration type. Of course, again, we can do that when we're sure we need it, since the migration is a matter of renaming a field and adding one new field with the same value for everything in the database. Much in the same way, instead of renaming, we can just add more fields for further integrations, so no reason to constrain ourselves now.
I'm also leaning towards "yes". Having an active queue means we get to load data in batches (X oldest by timestamp, for example) and potentially use other approaches to facilitate dealing with any performance/rate limitation issues.

the modified_at timestamp for the relevant record_ is after our transaction's last updated timestamp

I'm assuming each transaction here will always involve fetching the latest relevant data from github? Seems that way, especially from this rule, but wanted to make it clear.

Also, seems to me like we'll have to be identifying and storing the github installation id into the sync transaction record at the time of initially creating it. We can't really be sure who to authenticate as otherwise, unless I'm missing something.

joshsmith · 2018-01-12T21:27:12Z

Closing in favor of #1368

joshsmith added the question label Nov 14, 2017

joshsmith assigned begedin Nov 14, 2017

joshsmith added the needs estimate label Nov 14, 2017

joshsmith removed the question label Nov 14, 2017

joshsmith unassigned begedin Nov 14, 2017

joshsmith added the needs clarification label Nov 14, 2017

joshsmith added this to the 💅 GitHub integration polish milestone Jan 1, 2018

This was referenced Jan 1, 2018

Change GitHub syncing of tasks/comments to be async #952

Closed

Identify parts of the github integration process that could/should be done asynchronously #970

Closed

joshsmith changed the title ~~Design notes for integration syncing~~ RFC: Design notes for integration syncing Jan 1, 2018

joshsmith added the RFC label Jan 1, 2018

joshsmith closed this as completed Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Design notes for integration syncing #1189

RFC: Design notes for integration syncing #1189

joshsmith commented Nov 13, 2017 •

edited

Loading

joshsmith commented Nov 13, 2017 •

edited

Loading

joshsmith commented Nov 13, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Nov 14, 2017

begedin commented Nov 14, 2017

begedin commented Nov 14, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018

joshsmith commented Jan 12, 2018

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading

begedin commented Jan 12, 2018

joshsmith commented Jan 12, 2018

RFC: Design notes for integration syncing #1189

RFC: Design notes for integration syncing #1189

Comments

joshsmith commented Nov 13, 2017 • edited Loading

Problem

joshsmith commented Nov 13, 2017 • edited Loading

joshsmith commented Nov 13, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Nov 14, 2017

begedin commented Nov 14, 2017

begedin commented Nov 14, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Nov 14, 2017

joshsmith commented Jan 12, 2018 • edited Loading

joshsmith commented Jan 12, 2018

joshsmith commented Jan 12, 2018

joshsmith commented Jan 12, 2018 • edited Loading

joshsmith commented Jan 12, 2018 • edited Loading

begedin commented Jan 12, 2018

joshsmith commented Jan 12, 2018

joshsmith commented Nov 13, 2017 •

edited

Loading

joshsmith commented Nov 13, 2017 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading