Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Design notes for integration syncing #1189

Closed
joshsmith opened this issue Nov 13, 2017 · 16 comments
Closed

RFC: Design notes for integration syncing #1189

joshsmith opened this issue Nov 13, 2017 · 16 comments

Comments

@joshsmith
Copy link
Contributor

joshsmith commented Nov 13, 2017

Problem

Even with only GitHub integration, we have been having difficulties with the sync up/down to the external service.

  • Writes are coupled to the external API
  • Multiple task and comment records are created
  • Records become easily out of sync

Our basic goals:

  • Eventual consistency to the destination records in our database
  • Eventual consistency to the external APIs
  • Events to/from the external APIs are fully concurrent:
    • enqueued in order
    • dropped when more recent events come in
    • restartable when errored

Each Event can have a:

  • direction - :inbound | :outbound
  • integration_ fields
    • integration_external_id - the id of the integration resource from the external provider
    • integration_updated_at - the last updated at timestamp of the integration resource from the external provider
    • integration_record_id - the id of our cached record for the resource
    • integration_record_type - the type our cached record for the resource as the table name
  • record_ fields
    • record_id - the id of the record for the resource connected to this integration
    • record_type - the type of the record for the resource connected to this integration as the table name
  • canceled_by - the id of the Event that canceled this one
  • duplicate_of - the id of the Event that this is a duplicate of
  • ignored_for_id - the id of the record that caused this event to be ignored
  • ignored_for_type - the type of the record (table name) that caused this event to be ignored
  • state - :queued | :processing | :completed | :errored | :canceled | :ignored | :duplicate | :disabled

We may want our own writes to our own records, even without integrations, to also go through this process. Not sure.

When an event comes in we should:

  • check if there is any event for the integration_external_id where:
    • the integration_updated_at is after our event's last updated timestamp (limit 1)
      • if yes, set state to :ignored and stop processing, set ignored_for_id to the id of the event in the limit 1 query and ignored_for_type to this event table's name
    • the integration_updated_at timestamp for the relevant record_ is equal to our event's last updated timestamp (limit 1)
      • if yes, set state to :duplicate and stop processing, set duplicate_of to the id of the event in the limit 1
    • the modified_at timestamp for the relevant record_ is after our event's last updated timestamp
      • if yes, set state to :ignored and stop processing, set ignored_for_id to the record_id and ignored_for_type to the record_type
  • check if there are any events for the integration_external_id where:
    • integration_updated_at is before our event's last updated timestamp
      • if yes, set state of those events to :canceled and set canceled_by to the id of this event
  • check if there is any other :queued event or :processing event for the integration_external_id
    • if yes, set state to :queued
  • when :processing, create or update the relevant record matching record_id and record_type through the relationship on the record for integration_record_id and integration_record_type
  • when :completed, kick off process to look for next :queued item where the integration_updated_at timestamp is the oldest

We would also need within the logic for updating the given record to check whether the record's updated timestamp is after the event's timestamp. If it is, then we need to bubble the changeset validation error and mark the event :ignored as above.

@joshsmith
Copy link
Contributor Author

joshsmith commented Nov 13, 2017

I'm actually not sure about the polymorphic approach outlined above. Ecto docs say that this is not ideal and instead suggest we should define relationships directly; in this case that would be:

  • github_issue_id
  • github_pull_request_id
  • github_comment_id

Along with:

  • task_id
  • comment_id

We wouldn't be able to do ignored_for without prefixing all of the above with ignored_for_, which seems a little much, but it would be nice to have some sort of audit log of why the event was ignored.

I actually kind of like the above because it's pretty self-documenting that the external_service_x records are the integrations for the internal_x records.

@joshsmith
Copy link
Contributor Author

The :disabled event I added above to the status is intended to serve as a mechanism for events that are created when an integration is no longer syncing, but can be restarted when syncing starts again. This is not necessarily the best approach to restarting an integration sync, but it's one idea.

@joshsmith
Copy link
Contributor Author

One thing discovered here is that Task and Comment have certain expectations today that make many-to-many between them and Project challenging, for example:

  • Task has a task_list relationship that's particular to a project
  • Task has a number scoped to a project
  • Task has a UserTask
  • etc

The design as it stands makes changes relatively challenging.

Perhaps an easier approach is to, for now, remove the possibility of a GithubRepo belonging to many projects and instead restrict connecting a repo to a single project.

This also comes with some design challenges, although they are much less challenging:

  • Our UI will need updated to disallow creating a ProjectGithubRepo when one already exists, and display to the user the project that's currently connected to that repo.
  • We will need to change project_github_repos has_many on GithubRepo to a has_one.

A GithubAppInstallation should remain connectable to multiple projects. We could also improve the UI by indicating which other projects an installation is connected to, but this might be unnecessary with the UI change above.

@joshsmith
Copy link
Contributor Author

@begedin just marked as question and assigned so you can give thoughts here. I don't feel comfortable yet with any specific direction without considering this more fully. My latest comment may be the best place for us to start on the rework, though.

@joshsmith
Copy link
Contributor Author

I've created #1190 as a first step

@begedin
Copy link
Contributor

begedin commented Nov 14, 2017

@joshsmith

One thing I'm concerned about (unless I'm missing something) is that we seem to be making an assumption that, for example, we can safely ignore an event tied to a github_issue_id as long as it's timestamp the same or older than a queued event tied to the same id.

With the example of a github issue, would it not be possible for us to receive edited and labeled events out of order, both of which are valid changes.

I guess we can implicitly discard them due to our sync process in which we take all the data we get in the payload and use it for updating, regardless of event type, but still, it may be possible that there still exists a scenario where we get the two events above out of order, and the one we are processing does not have all the latest data.

@begedin
Copy link
Contributor

begedin commented Nov 14, 2017

Another concern with explicit instead of polymorphic relationships is what happens when we add other integrations. We'll have to expand our schema significantly. Not sure how that performs when the expansions are additional relationships to other schemas. I do agree that this is the more correct ecto approach, though.

@joshsmith
Copy link
Contributor Author

Yes I prefer the polymorphism but José seemed to indicate that it significantly underperforms due to lack of foreign keys and this table is likely to have tens of millions of rows.

@joshsmith
Copy link
Contributor Author

I guess we can implicitly discard them due to our sync process in which we take all the data we get in the payload and use it for updating, regardless of event type, but still, it may be possible that there still exists a scenario where we get the two events above out of order, and the one we are processing does not have all the latest data.

This would be overcome by using the webhook event as a trigger to retrieve the information from the external API for the given data.

This will get more complex with events like label and assignment, but for the events that impact the resource directly I think we are safe with that assumption.

@joshsmith joshsmith changed the title Design notes for integration syncing RFC: Design notes for integration syncing Jan 1, 2018
@joshsmith joshsmith added the RFC label Jan 1, 2018
@joshsmith
Copy link
Contributor Author

joshsmith commented Jan 12, 2018

NB: For now, the below implementation ties us to GitHub. To do otherwise would result in an explosion of fields and some increased complexity that I currently don't feel are worth the extra effort when we don't even have plans to add other integrations in the near future.

We would have a sync transaction for each type of internal record we want to sync. For example:

  • TaskSyncTransaction
  • CommentSyncTransaction

Every sync transaction record, regardless of type, would have a:

  • direction - :inbound | :outbound
  • github_app_installation_id - the id of the app installation for this sync
  • github_updated_at - the last updated at timestamp for the resource on GitHub
  • canceled_by - the id of the SyncTransaction that canceled this one
  • duplicate_of - the id of the SyncTransaction that this is a duplicate of
  • dropped_for - the id of the SyncTransaction that this was dropped in favor of
  • state
    • :queued - waiting to be processed
    • :processing - currently being processed; limited to one per instance of the synced record, e.g. comment_id
    • :completed - successfully synced
    • :errored - should be paired with a reason for the error
    • :canceled - another transaction supersedes this one, so we should not process it
    • :dropped - this transaction was outdated and was dropped
    • :duplicate - another transaction already existed that matched the timestamp for this one
    • :disabled - we received the transaction but cannot sync it because the repo no longer syncs to a project

Then each type would have type-specific fields, e.g. a CommentSyncTransaction would have:

  • comment_id - the id of our comment record
  • github_comment_id - the id of our cached record for the external resource
  • github_comment_external_id - the id of the resource from the external provider (GitHub)

If the event is due to the resource being created, there will not be a conflict. If the resource was created from our own clients, then there is no external GitHub ID yet. The same is true of events coming in from external providers and there is no internal record yet. I'm not yet clear as to whether we should conduct any conflict checking on these event types, but my guess is no. It should likely jump straight to :processing.

When an event comes in from GitHub we should (using a github_comment as our example):

  • delegate to the proper sync transaction table for the particular resource (in our example this would be comment_sync_transactions)
  • check if there are any transactions for the github_comment_external_id where:
    • the github_updated_at is after our transaction's last updated timestamp (limit 1)
      • if yes, set state to :dropped and stop processing, set dropped_for to the id of the transaction in the limit 1 query
    • the github_updated_at timestamp for the relevant record_ is equal to our transaction's last updated timestamp (limit 1)
      • if yes, set state to :duplicate and stop processing, set duplicate_of to the id of the transaction in the limit 1
    • the modified_at timestamp for the relevant record_ is after our transaction's last updated timestamp
      • if yes, set state to :dropped and stop processing, set dropped_for to the id of the transaction in the limit 1 query
  • check if there are any :queued transactions for the integration_external_id where:
    • github_updated_at is before our transaction's last updated timestamp
      • if yes, set state of those transactions to :canceled and set canceled_by to the id of this event
  • check if there is any other :queued transaction or :processing transaction for the integration_external_id
    • if yes, set state to :queued
  • when :processing, check again to see if we can proceed, then create or update the comment through the relationship on the record for comment_id
  • when :completed, kick off process to look for next :queued item where the github_updated_at timestamp is the oldest

We would also need within the logic for updating the given record to check whether the record's updated timestamp is after the transaction's timestamp. If it is, then we need to bubble the changeset validation error and mark the transaction as :dropped per the above.

@joshsmith
Copy link
Contributor Author

I'd like to figure out naming here of the record we're tracking. What is this even a queue of?

My feeling is that it's best described as a SyncTransaction, i.e. it's an inbound or outbound sync transaction.

@joshsmith
Copy link
Contributor Author

I'm also starting to wonder whether there should be separate queues for the separate record types. This might actually make more sense to me. For example, we'd have a TaskSyncTransaction and a CommentSyncTransaction.

@joshsmith
Copy link
Contributor Author

joshsmith commented Jan 12, 2018

NB: due to my comment below, I'm leaning strongly towards YES here.

Should this be a queue?

  • No. It's possible that the transactions that will be sent to the database will themselves be time-ordered implicitly and avoid some extra need for us to iterate over our own queue. In other words, we simply reject or process.

  • Yes. We're not just writing to the database. There are some reads that must happen and therefore may mean some accidental ordering problems and staleness if we don't proceed by first queueing and then processing incrementally across individual synced records. We also want to be able to avoid "thundering herd" problems and allow for rate limiting.

@joshsmith
Copy link
Contributor Author

joshsmith commented Jan 12, 2018

Some upsides of the approaches above that I wanted to document, in no particular order:

  • The tracking above generates some implicit audit trails that will be helpful for debugging.
  • Any unique-per-record queued transactions can be run in parallel without issue, i.e. we can run transactions for %Comment{id: 1} and %Comment{id: 2} without any conflict.
  • We can avoid "thundering herd" problems when the system becomes backed up for any by having control over how the stack is popped.
  • We can use this in conjunction with rate limiting to only process the number of events we have in the queue for the given rate limit and defer further processing until after the rate limit has expired.

@begedin
Copy link
Contributor

begedin commented Jan 12, 2018

Notes from my end

  • I'm ok with tying things to github for now. I'm usually inclined to go with "what we know we need now" instead of "what we might need in the future". That being said, it seems to be that this could easily be generalized by renaming the github and github external fields to "integration" and "integration_external", while also adding an integration type. Of course, again, we can do that when we're sure we need it, since the migration is a matter of renaming a field and adding one new field with the same value for everything in the database. Much in the same way, instead of renaming, we can just add more fields for further integrations, so no reason to constrain ourselves now.

  • I'm also leaning towards "yes". Having an active queue means we get to load data in batches (X oldest by timestamp, for example) and potentially use other approaches to facilitate dealing with any performance/rate limitation issues.

the modified_at timestamp for the relevant record_ is after our transaction's last updated timestamp

I'm assuming each transaction here will always involve fetching the latest relevant data from github? Seems that way, especially from this rule, but wanted to make it clear.

Also, seems to me like we'll have to be identifying and storing the github installation id into the sync transaction record at the time of initially creating it. We can't really be sure who to authenticate as otherwise, unless I'm missing something.

@joshsmith
Copy link
Contributor Author

Closing in favor of #1368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants