Add Terraform configuration for AWS Batch #34

rbreslow · 2022-05-13T18:17:49Z

Overview

I made the following changes to support running the ETL using AWS Batch:

Create Batch resources like a job queue and compute environment.
Create a job definition for the ETL; thread through requisite environment variables.
Draft IAM resources that allow the ETL to read objects from S3.

Also, I added documentation on how to perform a deployment. And, I added a flag to the CLI that allows us to start submitting Batch jobs so we can get rolling before #8.

Resolves #6
Resolves #18

Checklist

Squashed any fixup! commits
Updated README.md to reflect any changes

Testing Instructions

Make sure your AWS SSO session is current:

$ export AWS_PROFILE=chopd3bprod
$ aws sso login

Launch an instance of the included Terraform container image:

$ docker-compose -f docker-compose.ci.yml run --rm terraform
bash-5.1#

Once inside the context of the container image, set GIT_COMMIT to 5c27460 (I already published this tag):

bash-5.1# export GIT_COMMIT=5c27460

Use infra to generate and apply a Terraform plan:

bash-5.1# ./scripts/infra plan
bash-5.1# ./scripts/infra apply

Follow the new deployment README's instructions to execute database migrations by submitting a Batch job:

Note: While you are following these instructions, observe Batch adjust the desired vCPUs of the compute environment to match the queued database migration job requirements. The job status will stay in RUNNABLE until the underlying ECS cluster can win a spot bid and spin up a container instance to take the job. To start, I made the maximum spot bid 90% of on-demand pricing, but we can tune this down later on since the workloads aren't time-sensitive.

Next, launch a shell within the Python application container image:

$ ./scripts/console

And finally, use the CLI to get the UUID of one unprocessed Orthanc study and kick off a Batch job:

root@cf845f70ec0f:/usr/local/src# image-deid-etl check --limit 1 --raw | xargs image-deid-etl run --batch
Job started! View here:
https://console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/94de3455-24eb-4bcd-b868-6af5b74286b7

Note: Unfortunately, it's more likely than not that the job will fail. Most of the failures are related to a new batch of studies in Orthanc. @afamiliar resolved this in #29; we just haven't merged the fix yet. However, take a look at the detail view for the Batch job and the logs in CloudWatch. You'll see that the deployment is working 😎.

rbreslow · 2022-05-13T19:58:19Z

deployment/terraform/batch.tf

+}
+
+resource "aws_batch_compute_environment" "default" {
+  compute_environment_name_prefix = "batch${local.short}-"


The compute_environment_name_prefix combined with the lifecycle meta-argument allows you to change the CE without Terraform getting stuck.

Terraform will detach the deposed CE from the job queue, and queued jobs will hang in the RUNNABLE status. Then, once the new CE is attached to the job queue, the juices will start flowing.

rbreslow · 2022-05-13T20:01:22Z

deployment/terraform/batch.tf

+    type                = "SPOT"
+    allocation_strategy = var.batch_spot_fleet_allocation_strategy
+    bid_percentage      = var.batch_spot_fleet_bid_percentage


I've been using the capacity-optimized allocation strategy since it was introduced in 2019.

At a high level, the capacity-optimized strategy will try to get the smallest instance type that will fit your workload, but if none are available at your spot bid percentage, it will grab a larger instance type so that jobs don't sit in the queue.

Snce we supply this via an input variable, we can make adjustments by modifying the terraform.tfvars file.

rbreslow · 2022-05-13T20:02:47Z

deployment/terraform/batch.tf

+    ec2_configuration {
+      image_type = "ECS_AL2"
+    }


This tells Batch to use the latest version of the ECS-optimized Amazon Linux 2 AMI.

rbreslow · 2022-05-13T20:04:31Z

deployment/terraform/cloud-config/batch-container-instance

+Content-Type: multipart/mixed; boundary="==BOUNDARY=="
+MIME-Version: 1.0
+
+--==BOUNDARY==
+Content-Type: text/cloud-boothook; charset="us-ascii"
+
+# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook
+# makes it more likely the drive is mounted before the Docker daemon and ECS agent
+# start, which helps mitigate potential race conditions.
+#
+# See:
+# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon
+# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats
+mkfs.ext4 -E nodiscard /dev/nvme1n1
+mkdir -p /media/ephemeral0
+mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0
+
+--==BOUNDARY==


We're using the c5d, m5d, and z1d instance types which all have NVMe SSD ephemeral storage. These instances allow for super-fast ephemeral (temporary) I/O, which is great for staging files in the ETL and performing many reads/writes (e.g., the conversion process).

The instance store volumes are unformatted, so you need to initialize them whenever a container instance comes online.

Unfortunately, we can't take advantage of ephemeral storage just yet. As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).

As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).

Is there an issue open for this mechanism yet?

Thank you for prompting me. I should've opened an issue when I encountered this. I opened #35.

rbreslow · 2022-05-13T20:07:35Z

deployment/terraform/iam.tf

+resource "aws_iam_role" "batch_service_role" {
+  name_prefix        = "batch${local.short}ServiceRole-"
+  assume_role_policy = data.aws_iam_policy_document.batch_assume_role.json
+}
+
+resource "aws_iam_role_policy_attachment" "batch_service_role_policy" {
+  role       = aws_iam_role.batch_service_role.name
+  policy_arn = var.aws_batch_service_role_policy_arn
+}


This IAM role allows the Batch service itself to do things like manage ECS clusters and send logs to CloudWatch.

rbreslow · 2022-05-13T20:07:46Z

deployment/terraform/iam.tf

+resource "aws_iam_role" "spot_fleet_service_role" {
+  name_prefix        = "fleet${local.short}ServiceRole-"
+  assume_role_policy = data.aws_iam_policy_document.spot_fleet_assume_role.json
+}
+
+resource "aws_iam_role_policy_attachment" "spot_fleet_service_role_policy" {
+  role       = aws_iam_role.spot_fleet_service_role.name
+  policy_arn = var.aws_spot_fleet_service_role_policy_arn
+}


This IAM role allows Batch to request Spot instances.

rbreslow · 2022-05-13T20:08:03Z

deployment/terraform/iam.tf

+resource "aws_iam_role" "ecs_instance_role" {
+  name_prefix        = "ecs${local.short}InstanceRole-"
+  assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json
+}
+
+resource "aws_iam_role_policy_attachment" "ec2_service_role_policy" {
+  role       = aws_iam_role.ecs_instance_role.name
+  policy_arn = var.aws_ec2_service_role_policy_arn
+}
+
+resource "aws_iam_instance_profile" "ecs_instance_role" {
+  name = aws_iam_role.ecs_instance_role.name
+  role = aws_iam_role.ecs_instance_role.name
+}


This IAM role is attached to the Batch container instances. Any policy attached to this role will be accessible to the ETL.

rbreslow · 2022-05-13T20:08:32Z

deployment/terraform/iam.tf

+resource "aws_iam_role_policy" "scoped_etl_read" {
+  name_prefix = "S3ScopedEtlReadPolicy-"
+  role        = aws_iam_role.ecs_instance_role.id
+  policy      = data.aws_iam_policy_document.scoped_etl_read.json
+}


This inline IAM policy allows the Batch jobs to decrypt/read files in our PHI S3 bucket.

rbreslow · 2022-05-13T20:08:50Z

deployment/terraform/job-definitions/image-deid-etl.json.tmpl

+    "vcpus": 1,
+    "memory": 1024,


I used mprof to benchmark the ETL (including forked child processes) and generate a memory usage report for a single study:

Since our slice of the pipeline is single-threaded (not sure about the architecture of dcm2niix or the Flywheel CLI), 1 vCPU seems to make initial sense. And, since the study peaked at around ~800 MB of memory usage, I think 1024 MB makes sense. It's worth considering that these values correspond to vCPU/memory reservation. Most of the time, the instance types that Batch chooses will be oversized.

If any of the jobs go over their initial reservation, we can identify this via CloudWatch metrics, and adjust resource reservations accordingly.

I really like how well thought out you did this analysis. That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Good idea. Resolved in 5d1058d.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

Those constraints are mainly just for Fargate. With an EC2-backed ECS cluster, you can pretty much set any value for vCPUs and memory you want. Also, since these values are checked at runtime (e.g., the AWS API will reject the plan), I think it's alright to omit type constraints.

Oh, nice. Ok, then this gets the 👍 from me!

rbreslow · 2022-05-13T20:09:27Z

src/image_deid_etl/image_deid_etl/__main__.py

+    parser_run.add_argument(
+        "--batch",
+        action="store_true",
+        help="skip local processing and submit job(s) to AWS Batch",
+    )


I added this flag as a quick way to process studies on Batch. The more "robust" way that @afamiliar and I settled on is captured in #8 but will take more time to develop.

I realized that running image-deid-etl check on your host will compare Orthanc studies to your local database. However, invocations of the ETL on Batch will mark Orthanc studies as processed on the RDS instance. So, unfortunately, image-deid-etl check will always be out-of-sync with the RDS instance.

I think this is OK for the first batch of studies since both databases would be empty, but we'll need to address this once we've processed the Orthanc backlog and want to process new studies.

That might be a problem if there's any expectation that this will run on a local machine outside of testing, but it doesn't sound like that's likely.

I think there's an expectation that this is temporary. However, I included it as a flag in the CLI, so folks could use it down the road and run into trouble.

My brain is foggy on the best way to fix the problem. If the ETL had an HTTP API, we could talk to that. The only other solution is tunneling through the Bastion to talk to PostgreSQL. I'd like to return to this later (#36).

Yea, I don't have a clean way around that either, although we do have that bastion in place...

I'll 👍 for now and we can move this to #36

rbreslow · 2022-05-15T22:56:13Z

deployment/terraform/database.tf

-    Name        = "dbpgDatabaseServer"
-    Project     = var.project
-    Environment = var.environment
+    Name = "dbpgDatabaseServer"


Formatting tweaks that can be ignored.

devbyaccident

Good stuff! Couple of clarifications for my own sake, but I was able to run through the testing instructions with the exception of connecting to the container instance with ./scripts/console, which threw an error when building the image-deid-etl container.

+ curl https://storage.googleapis.com/flywheel-dist/cli/16.4.0/fw-linux_amd64-16.4.0.zip -o flywheel-cli.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

devbyaccident · 2022-05-16T16:22:37Z

deployment/terraform/job-definitions/image-deid-etl.json.tmpl

+    "vcpus": 1,
+    "memory": 1024,


I really like how well thought out you did this analysis. That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

devbyaccident · 2022-05-16T16:30:29Z

deployment/terraform/job-definitions/image-deid-etl.json.tmpl

+        },
+        {
+            "name": "FLYWHEEL_API_KEY",
+            "value": "${flywheel_api_key}"


I am a little paranoid about having secrets stored in the environment variables that can be read from the AWS Console, but that might just be a reflex. Do you think it makes sense to pull them from secrets manager/S3 in the container entrypoint?

It might be. I've used this pattern for a while. Think about it. Suppose you have console access to see the environment variables for a task. In that case, you probably have console access to swap out the image for something else or do other nefarious things.

I still think the reflex is sound. Using Secrets Manager combined with fine-grained access controls would be a step up. But, it'd also increase complexity. Since I don't have experience with that pattern, I'd want to explore it in an ADR-lite-like setting.

Yea, I think it couldn't hurt to look into. We do have contractors from other orgs with roles that would allow reading them in other accounts, but not in HIPAA account (Yet).

Let's take that to a DEVOPS ticket tho for a proper ADR on it, but note that this application (As well as many others to be sure) will needed to be modified from the results of that.

devbyaccident · 2022-05-16T16:33:15Z

src/image_deid_etl/image_deid_etl/__main__.py

+    parser_run.add_argument(
+        "--batch",
+        action="store_true",
+        help="skip local processing and submit job(s) to AWS Batch",
+    )


That might be a problem if there's any expectation that this will run on a local machine outside of testing, but it doesn't sound like that's likely.

devbyaccident · 2022-05-16T16:35:31Z

deployment/terraform/cloud-config/batch-container-instance

+Content-Type: multipart/mixed; boundary="==BOUNDARY=="
+MIME-Version: 1.0
+
+--==BOUNDARY==
+Content-Type: text/cloud-boothook; charset="us-ascii"
+
+# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook
+# makes it more likely the drive is mounted before the Docker daemon and ECS agent
+# start, which helps mitigate potential race conditions.
+#
+# See:
+# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon
+# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats
+mkfs.ext4 -E nodiscard /dev/nvme1n1
+mkdir -p /media/ephemeral0
+mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0
+
+--==BOUNDARY==


As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).

Is there an issue open for this mechanism yet?

rbreslow

[...] with the exception of connecting to the container instance with ./scripts/console, which threw an error when building the image-deid-etl container.

Ugh, yeah, @afamiliar ran into this as well. It's an issue with Netskope MITMing the connection to Google and swapping the TLS certificate out for a self-signed certificate. The Netskope root CA isn't present within the container, though, leading to the issue with curl. The only way to get past this is by disabling the Netskope launchd service.

rbreslow · 2022-05-16T18:19:38Z

deployment/terraform/cloud-config/batch-container-instance

+Content-Type: multipart/mixed; boundary="==BOUNDARY=="
+MIME-Version: 1.0
+
+--==BOUNDARY==
+Content-Type: text/cloud-boothook; charset="us-ascii"
+
+# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook
+# makes it more likely the drive is mounted before the Docker daemon and ECS agent
+# start, which helps mitigate potential race conditions.
+#
+# See:
+# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon
+# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats
+mkfs.ext4 -E nodiscard /dev/nvme1n1
+mkdir -p /media/ephemeral0
+mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0
+
+--==BOUNDARY==


Thank you for prompting me. I should've opened an issue when I encountered this. I opened #35.

rbreslow · 2022-05-16T19:19:33Z

deployment/terraform/job-definitions/image-deid-etl.json.tmpl

+    "vcpus": 1,
+    "memory": 1024,


That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Good idea. Resolved in 5d1058d.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

Those constraints are mainly just for Fargate. With an EC2-backed ECS cluster, you can pretty much set any value for vCPUs and memory you want. Also, since these values are checked at runtime (e.g., the AWS API will reject the plan), I think it's alright to omit type constraints.

rbreslow · 2022-05-16T19:23:43Z

deployment/terraform/job-definitions/image-deid-etl.json.tmpl

+        },
+        {
+            "name": "FLYWHEEL_API_KEY",
+            "value": "${flywheel_api_key}"


It might be. I've used this pattern for a while. Think about it. Suppose you have console access to see the environment variables for a task. In that case, you probably have console access to swap out the image for something else or do other nefarious things.

I still think the reflex is sound. Using Secrets Manager combined with fine-grained access controls would be a step up. But, it'd also increase complexity. Since I don't have experience with that pattern, I'd want to explore it in an ADR-lite-like setting.

rbreslow · 2022-05-16T19:51:44Z

src/image_deid_etl/image_deid_etl/__main__.py

+    parser_run.add_argument(
+        "--batch",
+        action="store_true",
+        help="skip local processing and submit job(s) to AWS Batch",
+    )


I think there's an expectation that this is temporary. However, I included it as a flag in the CLI, so folks could use it down the road and run into trouble.

My brain is foggy on the best way to fix the problem. If the ETL had an HTTP API, we could talk to that. The only other solution is tunneling through the Bastion to talk to PostgreSQL. I'd like to return to this later (#36).

devbyaccident · 2022-05-16T23:20:15Z

The only way to get past this is by disabling the Netskope launchd service.

Ugh, looks like I have a reason to get on that exemption list after all now! I'll approve this PR though, ty for all the fixes!

devbyaccident

👍

- Create Batch resources like a job queue and compute environment. - Create a job definition for the ETL; thread through requisite environment variables. - Draft IAM resources that allow the ETL to read objects from S3.

This attempts to fix a "Connection reset by peer" error we're receiving that may be a result of the call to io.BytesIO. I'd like to simplify things to use the native Pandas API for reading from S3 to see if the error goes away. Also, we're now reading these paths in from the environment.

alubneuski

💯

rbreslow self-assigned this May 13, 2022

alubneuski approved these changes May 13, 2022

View reviewed changes

rbreslow marked this pull request as draft May 13, 2022 18:34

alubneuski self-requested a review May 13, 2022 18:47

rbreslow commented May 15, 2022

View reviewed changes

rbreslow marked this pull request as ready for review May 15, 2022 22:47

rbreslow commented May 15, 2022

View reviewed changes

devbyaccident reviewed May 16, 2022

View reviewed changes

rbreslow commented May 16, 2022

View reviewed changes

rbreslow requested a review from devbyaccident May 16, 2022 19:56

rbreslow force-pushed the feature/rb/batch branch from 5d1058d to e44d9cb Compare May 16, 2022 22:50

devbyaccident approved these changes May 16, 2022

View reviewed changes

rbreslow added 3 commits May 17, 2022 12:42

Add Terraform configuration for AWS Batch

a1ed59f

- Create Batch resources like a job queue and compute environment. - Create a job definition for the ETL; thread through requisite environment variables. - Draft IAM resources that allow the ETL to read objects from S3.

Write deployment documentation

bdcfc73

Add flag to CLI that offloads processing to Batch

4b002fd

rbreslow force-pushed the feature/rb/batch branch from e44d9cb to af945d8 Compare May 17, 2022 16:42

rbreslow force-pushed the feature/rb/batch branch from af945d8 to 5c27460 Compare May 17, 2022 16:48

This was referenced May 17, 2022

Update hardcoded s3 bucket & CSV file name to source from environment and user input #21

Closed

Something relating to check command targetting different database #36

Open

alubneuski approved these changes May 20, 2022

View reviewed changes

rbreslow merged commit 27e175a into develop May 21, 2022

rbreslow deleted the feature/rb/batch branch May 21, 2022 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Terraform configuration for AWS Batch #34

Add Terraform configuration for AWS Batch #34

rbreslow commented May 13, 2022 •

edited

Loading

rbreslow May 13, 2022

rbreslow May 13, 2022

rbreslow May 13, 2022

rbreslow May 13, 2022

devbyaccident May 16, 2022

rbreslow May 16, 2022

rbreslow May 13, 2022

rbreslow May 13, 2022

rbreslow May 13, 2022

rbreslow May 13, 2022

rbreslow May 13, 2022

devbyaccident May 16, 2022

rbreslow May 16, 2022 •

edited

Loading

devbyaccident May 16, 2022

rbreslow May 13, 2022

devbyaccident May 16, 2022

rbreslow May 16, 2022

devbyaccident May 16, 2022

rbreslow May 15, 2022

devbyaccident left a comment

devbyaccident May 16, 2022

devbyaccident May 16, 2022

rbreslow May 16, 2022

devbyaccident May 16, 2022 •

edited

Loading

devbyaccident May 16, 2022

devbyaccident May 16, 2022

rbreslow left a comment

rbreslow May 16, 2022

rbreslow May 16, 2022 •

edited

Loading

rbreslow May 16, 2022

rbreslow May 16, 2022

devbyaccident commented May 16, 2022 •

edited

Loading

devbyaccident left a comment

alubneuski left a comment

Add Terraform configuration for AWS Batch #34

Add Terraform configuration for AWS Batch #34

Conversation

rbreslow commented May 13, 2022 • edited Loading

Overview

Checklist

Testing Instructions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbreslow May 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devbyaccident left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devbyaccident May 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbreslow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbreslow May 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devbyaccident commented May 16, 2022 • edited Loading

devbyaccident left a comment

Choose a reason for hiding this comment

alubneuski left a comment

Choose a reason for hiding this comment

rbreslow commented May 13, 2022 •

edited

Loading

rbreslow May 16, 2022 •

edited

Loading

devbyaccident May 16, 2022 •

edited

Loading

rbreslow May 16, 2022 •

edited

Loading

devbyaccident commented May 16, 2022 •

edited

Loading