Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WOR-161] Stop trying to clone workspace files if it hasn't succeeded within a day #2644

Merged
merged 4 commits into from
Dec 4, 2023

Conversation

marctalbott
Copy link
Contributor

@marctalbott marctalbott commented Dec 1, 2023

Ticket: WOR-161
In some cases, the asynchronous file transfer process never succeeds. This clogs up the logs and results in Rawls performing unnecessary work. This PR introduces some guardrails to prevent Rawls from endlessly retrying.

  • Add columns to CLONE_WORKSPACE_FILE_TRANSFER
    • created tracks when a workspace is cloned and the file transfer attempts begin. If a file transfer is still pending after 1 day, it will be marked as failed.
    • finished tracks when a workspace file transfer has completed.
    • outcome indicates whether a workspace file transfer succeeded or failed.
  • Completed file transfers are no longer removed from the DB. This has the added benefit of allowing us to track a workspace's cloning lineage going forward.
  • A couple of small changes to existing tests to clean up their logs and make it easier to identify real failures

PR checklist

  • Include the JIRA issue number in the PR description and title
  • Make sure Swagger is updated if API changes
    • ...and Orchestration's Swagger too!
  • If you changed anything in model/, then you should publish a new official rawls-model and update rawls-model in Orchestration's dependencies.
  • Get two thumbsworth of PR review
  • Verify all tests go green, including CI tests
  • Squash commits and merge to develop (branches are automatically deleted after merging)
  • Inform other teams of any substantial changes via Slack and/or email

@@ -423,7 +427,7 @@ class CloneWorkspaceFileTransferMonitorSpec(_system: ActorSystem)
destinationBucketName,
goodObjectToCopy.getName,
Option(destWorkspace.googleProjectId)
)
)(system.dispatchers.defaultGlobalDispatcher)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing a NPE since the call wasn't properly mocked. The test still succeeded because the verify only checks that the method was called 5+ times, not that the method succeeded on those calls.

@@ -213,7 +214,8 @@ class CloneWorkspaceFileTransferMonitorSpec(_system: ActorSystem)

val mockGcsDAO = mock[GoogleServicesDAO](RETURNS_SMART_NULLS)
val failureMessage = "because I feel like it"
val exception = new HttpResponseException.Builder(403, failureMessage, new HttpHeaders()).build
val exception =
new HttpResponseException.Builder(403, failureMessage, new HttpHeaders()).setMessage(failureMessage).build
Copy link
Contributor Author

@marctalbott marctalbott Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added .setMessage to this exceptions and others in the tests since they were showing up in the logs with null messages and I was initially concerned something was wrong. The message makes it more clear that these exceptions are expected.

Before:

com.google.api.client.http.HttpResponseException: null
	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	...

After:

com.google.api.client.http.HttpResponseException: because I feel like it
	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make the exception text more meaningful, like "expected test exception"?

@marctalbott marctalbott requested review from a team, cahrens and blakery and removed request for a team December 1, 2023 16:36
Future.successful(List.empty)
}

_ <- markTransferAsComplete(pendingTransfer, transferSucceeded = !transferExpired)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So copyBucketFiles doesn't complete until the transfer is complete? But somehow we get to the point where we can kill off a transfer that is taking too long….

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses cases where a source workspace's files can't be copied to a destination workspace for longer than a day due to persistent errors. When that happens, the monitor currently tries to clone the files nonstop and fills up our logs unnecessarily.

It doesn't attempt to protect against long running copy operations while transferring large buckets. While it would be nice to have those protections, the solution is likely to switch to STS which would be more involved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so copyBucketFIles would throw an exception at some point, and then the monitor kicks off this code again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. And if copyBucketFiles continues to throw exceptions for over a day for a given workspace, this will make it so the monitor stops trying to transfer the files for that workspace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is to stop retries, not to interrupt an in-progress operation?
And then if it fails, this entire function will return a failed future (it won't even get to markTransferAsComplete).

)

val mockGcsDAO = mock[GoogleServicesDAO](RETURNS_SMART_NULLS)
val failureMessage = "because I feel like it"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not again!!

Future.successful(List.empty)
}

_ <- markTransferAsComplete(pendingTransfer, transferSucceeded = !transferExpired)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is to stop retries, not to interrupt an in-progress operation?
And then if it fails, this entire function will return a failed future (it won't even get to markTransferAsComplete).

@marctalbott marctalbott merged commit dcd8c2a into develop Dec 4, 2023
12 checks passed
@marctalbott marctalbott deleted the mtalbott-clone-limit branch December 4, 2023 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants