Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Terraform configuration for AWS Batch #34

Merged
merged 4 commits into from
May 21, 2022
Merged

Conversation

rbreslow
Copy link
Contributor

@rbreslow rbreslow commented May 13, 2022

Overview

I made the following changes to support running the ETL using AWS Batch:

  • Create Batch resources like a job queue and compute environment.
  • Create a job definition for the ETL; thread through requisite environment variables.
  • Draft IAM resources that allow the ETL to read objects from S3.

Also, I added documentation on how to perform a deployment. And, I added a flag to the CLI that allows us to start submitting Batch jobs so we can get rolling before #8.

Resolves #6
Resolves #18

Checklist

  • Squashed any fixup! commits
  • Updated README.md to reflect any changes

Testing Instructions

Make sure your AWS SSO session is current:

$ export AWS_PROFILE=chopd3bprod
$ aws sso login

Launch an instance of the included Terraform container image:

$ docker-compose -f docker-compose.ci.yml run --rm terraform
bash-5.1#

Once inside the context of the container image, set GIT_COMMIT to 5c27460 (I already published this tag):

bash-5.1# export GIT_COMMIT=5c27460

Use infra to generate and apply a Terraform plan:

bash-5.1# ./scripts/infra plan
bash-5.1# ./scripts/infra apply

Follow the new deployment README's instructions to execute database migrations by submitting a Batch job:

image

Note: While you are following these instructions, observe Batch adjust the desired vCPUs of the compute environment to match the queued database migration job requirements. The job status will stay in RUNNABLE until the underlying ECS cluster can win a spot bid and spin up a container instance to take the job. To start, I made the maximum spot bid 90% of on-demand pricing, but we can tune this down later on since the workloads aren't time-sensitive.

Next, launch a shell within the Python application container image:

$ ./scripts/console

And finally, use the CLI to get the UUID of one unprocessed Orthanc study and kick off a Batch job:

root@cf845f70ec0f:/usr/local/src# image-deid-etl check --limit 1 --raw | xargs image-deid-etl run --batch
Job started! View here:
https://console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/94de3455-24eb-4bcd-b868-6af5b74286b7

Note: Unfortunately, it's more likely than not that the job will fail. Most of the failures are related to a new batch of studies in Orthanc. @afamiliar resolved this in #29; we just haven't merged the fix yet. However, take a look at the detail view for the Batch job and the logs in CloudWatch. You'll see that the deployment is working 😎.

@rbreslow rbreslow self-assigned this May 13, 2022
@rbreslow rbreslow marked this pull request as draft May 13, 2022 18:34
@alubneuski alubneuski self-requested a review May 13, 2022 18:47
}

resource "aws_batch_compute_environment" "default" {
compute_environment_name_prefix = "batch${local.short}-"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute_environment_name_prefix combined with the lifecycle meta-argument allows you to change the CE without Terraform getting stuck.

Terraform will detach the deposed CE from the job queue, and queued jobs will hang in the RUNNABLE status. Then, once the new CE is attached to the job queue, the juices will start flowing.

Comment on lines +42 to +44
type = "SPOT"
allocation_strategy = var.batch_spot_fleet_allocation_strategy
bid_percentage = var.batch_spot_fleet_bid_percentage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using the capacity-optimized allocation strategy since it was introduced in 2019.

At a high level, the capacity-optimized strategy will try to get the smallest instance type that will fit your workload, but if none are available at your spot bid percentage, it will grab a larger instance type so that jobs don't sit in the queue.

Snce we supply this via an input variable, we can make adjustments by modifying the terraform.tfvars file.

Comment on lines +46 to +48
ec2_configuration {
image_type = "ECS_AL2"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells Batch to use the latest version of the ECS-optimized Amazon Linux 2 AMI.

Comment on lines +1 to +18
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook
# makes it more likely the drive is mounted before the Docker daemon and ECS agent
# start, which helps mitigate potential race conditions.
#
# See:
# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats
mkfs.ext4 -E nodiscard /dev/nvme1n1
mkdir -p /media/ephemeral0
mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0

--==BOUNDARY==
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using the c5d, m5d, and z1d instance types which all have NVMe SSD ephemeral storage. These instances allow for super-fast ephemeral (temporary) I/O, which is great for staging files in the ETL and performing many reads/writes (e.g., the conversion process).

The instance store volumes are unformatted, so you need to initialize them whenever a container instance comes online.

Unfortunately, we can't take advantage of ephemeral storage just yet. As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).

Is there an issue open for this mechanism yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for prompting me. I should've opened an issue when I encountered this. I opened #35.

Comment on lines +17 to +25
resource "aws_iam_role" "batch_service_role" {
name_prefix = "batch${local.short}ServiceRole-"
assume_role_policy = data.aws_iam_policy_document.batch_assume_role.json
}

resource "aws_iam_role_policy_attachment" "batch_service_role_policy" {
role = aws_iam_role.batch_service_role.name
policy_arn = var.aws_batch_service_role_policy_arn
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IAM role allows the Batch service itself to do things like manage ECS clusters and send logs to CloudWatch.

Comment on lines +43 to +51
resource "aws_iam_role" "spot_fleet_service_role" {
name_prefix = "fleet${local.short}ServiceRole-"
assume_role_policy = data.aws_iam_policy_document.spot_fleet_assume_role.json
}

resource "aws_iam_role_policy_attachment" "spot_fleet_service_role_policy" {
role = aws_iam_role.spot_fleet_service_role.name
policy_arn = var.aws_spot_fleet_service_role_policy_arn
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IAM role allows Batch to request Spot instances.

Comment on lines +69 to +82
resource "aws_iam_role" "ecs_instance_role" {
name_prefix = "ecs${local.short}InstanceRole-"
assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json
}

resource "aws_iam_role_policy_attachment" "ec2_service_role_policy" {
role = aws_iam_role.ecs_instance_role.name
policy_arn = var.aws_ec2_service_role_policy_arn
}

resource "aws_iam_instance_profile" "ecs_instance_role" {
name = aws_iam_role.ecs_instance_role.name
role = aws_iam_role.ecs_instance_role.name
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IAM role is attached to the Batch container instances. Any policy attached to this role will be accessible to the ETL.

Comment on lines +118 to +122
resource "aws_iam_role_policy" "scoped_etl_read" {
name_prefix = "S3ScopedEtlReadPolicy-"
role = aws_iam_role.ecs_instance_role.id
policy = data.aws_iam_policy_document.scoped_etl_read.json
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inline IAM policy allows the Batch jobs to decrypt/read files in our PHI S3 bucket.

Comment on lines 3 to 4
"vcpus": 1,
"memory": 1024,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used mprof to benchmark the ETL (including forked child processes) and generate a memory usage report for a single study:

memory_usage

Since our slice of the pipeline is single-threaded (not sure about the architecture of dcm2niix or the Flywheel CLI), 1 vCPU seems to make initial sense. And, since the study peaked at around ~800 MB of memory usage, I think 1024 MB makes sense. It's worth considering that these values correspond to vCPU/memory reservation. Most of the time, the instance types that Batch chooses will be oversized.

If any of the jobs go over their initial reservation, we can identify this via CloudWatch metrics, and adjust resource reservations accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how well thought out you did this analysis. That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

Copy link
Contributor Author

@rbreslow rbreslow May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Good idea. Resolved in 5d1058d.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

Those constraints are mainly just for Fargate. With an EC2-backed ECS cluster, you can pretty much set any value for vCPUs and memory you want. Also, since these values are checked at runtime (e.g., the AWS API will reject the plan), I think it's alright to omit type constraints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice. Ok, then this gets the 👍 from me!

Comment on lines +299 to +309
parser_run.add_argument(
"--batch",
action="store_true",
help="skip local processing and submit job(s) to AWS Batch",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this flag as a quick way to process studies on Batch. The more "robust" way that @afamiliar and I settled on is captured in #8 but will take more time to develop.

I realized that running image-deid-etl check on your host will compare Orthanc studies to your local database. However, invocations of the ETL on Batch will mark Orthanc studies as processed on the RDS instance. So, unfortunately, image-deid-etl check will always be out-of-sync with the RDS instance.

I think this is OK for the first batch of studies since both databases would be empty, but we'll need to address this once we've processed the Orthanc backlog and want to process new studies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a problem if there's any expectation that this will run on a local machine outside of testing, but it doesn't sound like that's likely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an expectation that this is temporary. However, I included it as a flag in the CLI, so folks could use it down the road and run into trouble.

My brain is foggy on the best way to fix the problem. If the ETL had an HTTP API, we could talk to that. The only other solution is tunneling through the Bastion to talk to PostgreSQL. I'd like to return to this later (#36).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I don't have a clean way around that either, although we do have that bastion in place...

I'll 👍 for now and we can move this to #36

@rbreslow rbreslow marked this pull request as ready for review May 15, 2022 22:47
Name = "dbpgDatabaseServer"
Project = var.project
Environment = var.environment
Name = "dbpgDatabaseServer"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting tweaks that can be ignored.

Copy link
Member

@devbyaccident devbyaccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Couple of clarifications for my own sake, but I was able to run through the testing instructions with the exception of connecting to the container instance with ./scripts/console, which threw an error when building the image-deid-etl container.

+ curl https://storage.googleapis.com/flywheel-dist/cli/16.4.0/fw-linux_amd64-16.4.0.zip -o flywheel-cli.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Comment on lines 3 to 4
"vcpus": 1,
"memory": 1024,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how well thought out you did this analysis. That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

},
{
"name": "FLYWHEEL_API_KEY",
"value": "${flywheel_api_key}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little paranoid about having secrets stored in the environment variables that can be read from the AWS Console, but that might just be a reflex. Do you think it makes sense to pull them from secrets manager/S3 in the container entrypoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be. I've used this pattern for a while. Think about it. Suppose you have console access to see the environment variables for a task. In that case, you probably have console access to swap out the image for something else or do other nefarious things.

I still think the reflex is sound. Using Secrets Manager combined with fine-grained access controls would be a step up. But, it'd also increase complexity. Since I don't have experience with that pattern, I'd want to explore it in an ADR-lite-like setting.

Copy link
Member

@devbyaccident devbyaccident May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think it couldn't hurt to look into. We do have contractors from other orgs with roles that would allow reading them in other accounts, but not in HIPAA account (Yet).

Let's take that to a DEVOPS ticket tho for a proper ADR on it, but note that this application (As well as many others to be sure) will needed to be modified from the results of that.

Comment on lines +299 to +309
parser_run.add_argument(
"--batch",
action="store_true",
help="skip local processing and submit job(s) to AWS Batch",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a problem if there's any expectation that this will run on a local machine outside of testing, but it doesn't sound like that's likely.

Comment on lines +1 to +18
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook
# makes it more likely the drive is mounted before the Docker daemon and ECS agent
# start, which helps mitigate potential race conditions.
#
# See:
# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats
mkfs.ext4 -E nodiscard /dev/nvme1n1
mkdir -p /media/ephemeral0
mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0

--==BOUNDARY==
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).

Is there an issue open for this mechanism yet?

Copy link
Contributor Author

@rbreslow rbreslow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] with the exception of connecting to the container instance with ./scripts/console, which threw an error when building the image-deid-etl container.

Ugh, yeah, @afamiliar ran into this as well. It's an issue with Netskope MITMing the connection to Google and swapping the TLS certificate out for a self-signed certificate. The Netskope root CA isn't present within the container, though, leading to the issue with curl. The only way to get past this is by disabling the Netskope launchd service.

Comment on lines +1 to +18
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook
# makes it more likely the drive is mounted before the Docker daemon and ECS agent
# start, which helps mitigate potential race conditions.
#
# See:
# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats
mkfs.ext4 -E nodiscard /dev/nvme1n1
mkdir -p /media/ephemeral0
mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0

--==BOUNDARY==
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for prompting me. I should've opened an issue when I encountered this. I opened #35.

Comment on lines 3 to 4
"vcpus": 1,
"memory": 1024,
Copy link
Contributor Author

@rbreslow rbreslow May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.

Good idea. Resolved in 5d1058d.

Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?

Those constraints are mainly just for Fargate. With an EC2-backed ECS cluster, you can pretty much set any value for vCPUs and memory you want. Also, since these values are checked at runtime (e.g., the AWS API will reject the plan), I think it's alright to omit type constraints.

},
{
"name": "FLYWHEEL_API_KEY",
"value": "${flywheel_api_key}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be. I've used this pattern for a while. Think about it. Suppose you have console access to see the environment variables for a task. In that case, you probably have console access to swap out the image for something else or do other nefarious things.

I still think the reflex is sound. Using Secrets Manager combined with fine-grained access controls would be a step up. But, it'd also increase complexity. Since I don't have experience with that pattern, I'd want to explore it in an ADR-lite-like setting.

Comment on lines +299 to +309
parser_run.add_argument(
"--batch",
action="store_true",
help="skip local processing and submit job(s) to AWS Batch",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an expectation that this is temporary. However, I included it as a flag in the CLI, so folks could use it down the road and run into trouble.

My brain is foggy on the best way to fix the problem. If the ETL had an HTTP API, we could talk to that. The only other solution is tunneling through the Bastion to talk to PostgreSQL. I'd like to return to this later (#36).

@rbreslow rbreslow requested a review from devbyaccident May 16, 2022 19:56
@rbreslow rbreslow force-pushed the feature/rb/batch branch from 5d1058d to e44d9cb Compare May 16, 2022 22:50
@devbyaccident
Copy link
Member

devbyaccident commented May 16, 2022

The only way to get past this is by disabling the Netskope launchd service.

Ugh, looks like I have a reason to get on that exemption list after all now! I'll approve this PR though, ty for all the fixes!

Copy link
Member

@devbyaccident devbyaccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

rbreslow added 3 commits May 17, 2022 12:42
- Create Batch resources like a job queue and compute environment.
- Create a job definition for the ETL; thread through requisite
  environment variables.
- Draft IAM resources that allow the ETL to read objects from S3.
@rbreslow rbreslow force-pushed the feature/rb/batch branch from e44d9cb to af945d8 Compare May 17, 2022 16:42
This attempts to fix a "Connection reset by peer" error we're receiving
that may be a result of the call to io.BytesIO. I'd like to simplify
things to use the native Pandas API for reading from S3 to see if the
error goes away. Also, we're now reading these paths in from the
environment.
Copy link
Contributor

@alubneuski alubneuski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@rbreslow rbreslow merged commit 27e175a into develop May 21, 2022
@rbreslow rbreslow deleted the feature/rb/batch branch May 21, 2022 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change hardcoded var's to command options Add Terraform configuration for a Batch compute environment
3 participants