-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WOR-161] Stop trying to clone workspace files if it hasn't succeeded within a day #2644
Conversation
@@ -423,7 +427,7 @@ class CloneWorkspaceFileTransferMonitorSpec(_system: ActorSystem) | |||
destinationBucketName, | |||
goodObjectToCopy.getName, | |||
Option(destWorkspace.googleProjectId) | |||
) | |||
)(system.dispatchers.defaultGlobalDispatcher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing a NPE since the call wasn't properly mocked. The test still succeeded because the verify
only checks that the method was called 5+ times, not that the method succeeded on those calls.
@@ -213,7 +214,8 @@ class CloneWorkspaceFileTransferMonitorSpec(_system: ActorSystem) | |||
|
|||
val mockGcsDAO = mock[GoogleServicesDAO](RETURNS_SMART_NULLS) | |||
val failureMessage = "because I feel like it" | |||
val exception = new HttpResponseException.Builder(403, failureMessage, new HttpHeaders()).build | |||
val exception = | |||
new HttpResponseException.Builder(403, failureMessage, new HttpHeaders()).setMessage(failureMessage).build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added .setMessage
to this exceptions and others in the tests since they were showing up in the logs with null messages and I was initially concerned something was wrong. The message makes it more clear that these exceptions are expected.
Before:
com.google.api.client.http.HttpResponseException: null
at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
...
After:
com.google.api.client.http.HttpResponseException: because I feel like it
at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps make the exception text more meaningful, like "expected test exception"?
Future.successful(List.empty) | ||
} | ||
|
||
_ <- markTransferAsComplete(pendingTransfer, transferSucceeded = !transferExpired) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So copyBucketFiles
doesn't complete until the transfer is complete? But somehow we get to the point where we can kill off a transfer that is taking too long….
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addresses cases where a source workspace's files can't be copied to a destination workspace for longer than a day due to persistent errors. When that happens, the monitor currently tries to clone the files nonstop and fills up our logs unnecessarily.
It doesn't attempt to protect against long running copy operations while transferring large buckets. While it would be nice to have those protections, the solution is likely to switch to STS which would be more involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so copyBucketFIles
would throw an exception at some point, and then the monitor kicks off this code again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. And if copyBucketFiles
continues to throw exceptions for over a day for a given workspace, this will make it so the monitor stops trying to transfer the files for that workspace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is to stop retries, not to interrupt an in-progress operation?
And then if it fails, this entire function will return a failed future (it won't even get to markTransferAsComplete
).
) | ||
|
||
val mockGcsDAO = mock[GoogleServicesDAO](RETURNS_SMART_NULLS) | ||
val failureMessage = "because I feel like it" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not again!!
Future.successful(List.empty) | ||
} | ||
|
||
_ <- markTransferAsComplete(pendingTransfer, transferSucceeded = !transferExpired) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is to stop retries, not to interrupt an in-progress operation?
And then if it fails, this entire function will return a failed future (it won't even get to markTransferAsComplete
).
Ticket: WOR-161
In some cases, the asynchronous file transfer process never succeeds. This clogs up the logs and results in Rawls performing unnecessary work. This PR introduces some guardrails to prevent Rawls from endlessly retrying.
CLONE_WORKSPACE_FILE_TRANSFER
created
tracks when a workspace is cloned and the file transfer attempts begin. If a file transfer is still pending after 1 day, it will be marked as failed.finished
tracks when a workspace file transfer has completed.outcome
indicates whether a workspace file transfer succeeded or failed.PR checklist
model/
, then you should publish a new officialrawls-model
and updaterawls-model
in Orchestration's dependencies.