[WOR-161] Stop trying to clone workspace files if it hasn't succeeded within a day #2644

marctalbott · 2023-12-01T14:48:35Z

Ticket: WOR-161
In some cases, the asynchronous file transfer process never succeeds. This clogs up the logs and results in Rawls performing unnecessary work. This PR introduces some guardrails to prevent Rawls from endlessly retrying.

Add columns to CLONE_WORKSPACE_FILE_TRANSFER
- created tracks when a workspace is cloned and the file transfer attempts begin. If a file transfer is still pending after 1 day, it will be marked as failed.
- finished tracks when a workspace file transfer has completed.
- outcome indicates whether a workspace file transfer succeeded or failed.
Completed file transfers are no longer removed from the DB. This has the added benefit of allowing us to track a workspace's cloning lineage going forward.
A couple of small changes to existing tests to clean up their logs and make it easier to identify real failures

PR checklist

Include the JIRA issue number in the PR description and title
Make sure Swagger is updated if API changes
- ...and Orchestration's Swagger too!
If you changed anything in model/, then you should publish a new official rawls-model and update rawls-model in Orchestration's dependencies.
Get two thumbsworth of PR review
Verify all tests go green, including CI tests
Squash commits and merge to develop (branches are automatically deleted after merging)
Inform other teams of any substantial changes via Slack and/or email

marctalbott · 2023-12-01T14:50:06Z

...test/scala/org/broadinstitute/dsde/rawls/monitor/CloneWorkspaceFileTransferMonitorSpec.scala

@@ -423,7 +427,7 @@ class CloneWorkspaceFileTransferMonitorSpec(_system: ActorSystem)
                            destinationBucketName,
                            goodObjectToCopy.getName,
                            Option(destWorkspace.googleProjectId)
-        )
+        )(system.dispatchers.defaultGlobalDispatcher)


This was causing a NPE since the call wasn't properly mocked. The test still succeeded because the verify only checks that the method was called 5+ times, not that the method succeeded on those calls.

marctalbott · 2023-12-01T14:53:15Z

...test/scala/org/broadinstitute/dsde/rawls/monitor/CloneWorkspaceFileTransferMonitorSpec.scala

@@ -213,7 +214,8 @@ class CloneWorkspaceFileTransferMonitorSpec(_system: ActorSystem)

      val mockGcsDAO = mock[GoogleServicesDAO](RETURNS_SMART_NULLS)
      val failureMessage = "because I feel like it"
-      val exception = new HttpResponseException.Builder(403, failureMessage, new HttpHeaders()).build
+      val exception =
+        new HttpResponseException.Builder(403, failureMessage, new HttpHeaders()).setMessage(failureMessage).build


Added .setMessage to this exceptions and others in the tests since they were showing up in the logs with null messages and I was initially concerned something was wrong. The message makes it more clear that these exceptions are expected.

Before:

com.google.api.client.http.HttpResponseException: null at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) ...

After:

com.google.api.client.http.HttpResponseException: because I feel like it at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) ...

Perhaps make the exception text more meaningful, like "expected test exception"?

cahrens · 2023-12-01T17:09:57Z

...src/main/scala/org/broadinstitute/dsde/rawls/monitor/CloneWorkspaceFileTransferMonitor.scala

+          Future.successful(List.empty)
+        }
+
+      _ <- markTransferAsComplete(pendingTransfer, transferSucceeded = !transferExpired)


So copyBucketFiles doesn't complete until the transfer is complete? But somehow we get to the point where we can kill off a transfer that is taking too long….

This addresses cases where a source workspace's files can't be copied to a destination workspace for longer than a day due to persistent errors. When that happens, the monitor currently tries to clone the files nonstop and fills up our logs unnecessarily.

It doesn't attempt to protect against long running copy operations while transferring large buckets. While it would be nice to have those protections, the solution is likely to switch to STS which would be more involved.

Oh, so copyBucketFIles would throw an exception at some point, and then the monitor kicks off this code again?

Yes, that's correct. And if copyBucketFiles continues to throw exceptions for over a day for a given workspace, this will make it so the monitor stops trying to transfer the files for that workspace

I think the idea is to stop retries, not to interrupt an in-progress operation?
And then if it fails, this entire function will return a failed future (it won't even get to markTransferAsComplete).

cahrens · 2023-12-01T17:13:25Z

...test/scala/org/broadinstitute/dsde/rawls/monitor/CloneWorkspaceFileTransferMonitorSpec.scala

+      )
+
+      val mockGcsDAO = mock[GoogleServicesDAO](RETURNS_SMART_NULLS)
+      val failureMessage = "because I feel like it"


No, not again!!

blakery · 2023-12-01T20:14:59Z

...src/main/scala/org/broadinstitute/dsde/rawls/monitor/CloneWorkspaceFileTransferMonitor.scala

+          Future.successful(List.empty)
+        }
+
+      _ <- markTransferAsComplete(pendingTransfer, transferSucceeded = !transferExpired)


I think the idea is to stop retries, not to interrupt an in-progress operation?
And then if it fails, this entire function will return a failed future (it won't even get to markTransferAsComplete).

marctalbott added 2 commits December 1, 2023 08:52

eventually stop trying to copy files over when cloning workspace

9a91616

refactor, get rid of nulls in test logs

a97e115

marctalbott commented Dec 1, 2023

View reviewed changes

swagger

5efcb04

marctalbott requested review from a team, cahrens and blakery and removed request for a team December 1, 2023 16:36

cahrens reviewed Dec 1, 2023

View reviewed changes

blakery approved these changes Dec 1, 2023

View reviewed changes

make expected exceptions in test clearer

7390146

cahrens approved these changes Dec 4, 2023

View reviewed changes

marctalbott merged commit dcd8c2a into develop Dec 4, 2023
12 checks passed

marctalbott deleted the mtalbott-clone-limit branch December 4, 2023 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WOR-161] Stop trying to clone workspace files if it hasn't succeeded within a day #2644

[WOR-161] Stop trying to clone workspace files if it hasn't succeeded within a day #2644

marctalbott commented Dec 1, 2023 •

edited

Loading

marctalbott Dec 1, 2023

marctalbott Dec 1, 2023 •

edited

Loading

cahrens Dec 1, 2023

cahrens Dec 1, 2023

marctalbott Dec 1, 2023

cahrens Dec 1, 2023

marctalbott Dec 1, 2023

blakery Dec 1, 2023

cahrens Dec 1, 2023

blakery Dec 1, 2023

[WOR-161] Stop trying to clone workspace files if it hasn't succeeded within a day #2644

[WOR-161] Stop trying to clone workspace files if it hasn't succeeded within a day #2644

Conversation

marctalbott commented Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

marctalbott Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marctalbott commented Dec 1, 2023 •

edited

Loading

marctalbott Dec 1, 2023 •

edited

Loading