-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Terraform configuration for AWS Batch #34
Conversation
} | ||
|
||
resource "aws_batch_compute_environment" "default" { | ||
compute_environment_name_prefix = "batch${local.short}-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compute_environment_name_prefix
combined with the lifecycle meta-argument allows you to change the CE without Terraform getting stuck.
Terraform will detach the deposed CE from the job queue, and queued jobs will hang in the RUNNABLE
status. Then, once the new CE is attached to the job queue, the juices will start flowing.
type = "SPOT" | ||
allocation_strategy = var.batch_spot_fleet_allocation_strategy | ||
bid_percentage = var.batch_spot_fleet_bid_percentage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using the capacity-optimized allocation strategy since it was introduced in 2019.
At a high level, the capacity-optimized strategy will try to get the smallest instance type that will fit your workload, but if none are available at your spot bid percentage, it will grab a larger instance type so that jobs don't sit in the queue.
Snce we supply this via an input variable, we can make adjustments by modifying the terraform.tfvars
file.
ec2_configuration { | ||
image_type = "ECS_AL2" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells Batch to use the latest version of the ECS-optimized Amazon Linux 2 AMI.
Content-Type: multipart/mixed; boundary="==BOUNDARY==" | ||
MIME-Version: 1.0 | ||
|
||
--==BOUNDARY== | ||
Content-Type: text/cloud-boothook; charset="us-ascii" | ||
|
||
# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook | ||
# makes it more likely the drive is mounted before the Docker daemon and ECS agent | ||
# start, which helps mitigate potential race conditions. | ||
# | ||
# See: | ||
# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon | ||
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats | ||
mkfs.ext4 -E nodiscard /dev/nvme1n1 | ||
mkdir -p /media/ephemeral0 | ||
mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0 | ||
|
||
--==BOUNDARY== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using the c5d
, m5d
, and z1d
instance types which all have NVMe SSD ephemeral storage. These instances allow for super-fast ephemeral (temporary) I/O, which is great for staging files in the ETL and performing many reads/writes (e.g., the conversion process).
The instance store volumes are unformatted, so you need to initialize them whenever a container instance comes online.
Unfortunately, we can't take advantage of ephemeral storage just yet. As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src
), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).
Is there an issue open for this mechanism yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for prompting me. I should've opened an issue when I encountered this. I opened #35.
resource "aws_iam_role" "batch_service_role" { | ||
name_prefix = "batch${local.short}ServiceRole-" | ||
assume_role_policy = data.aws_iam_policy_document.batch_assume_role.json | ||
} | ||
|
||
resource "aws_iam_role_policy_attachment" "batch_service_role_policy" { | ||
role = aws_iam_role.batch_service_role.name | ||
policy_arn = var.aws_batch_service_role_policy_arn | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IAM role allows the Batch service itself to do things like manage ECS clusters and send logs to CloudWatch.
resource "aws_iam_role" "spot_fleet_service_role" { | ||
name_prefix = "fleet${local.short}ServiceRole-" | ||
assume_role_policy = data.aws_iam_policy_document.spot_fleet_assume_role.json | ||
} | ||
|
||
resource "aws_iam_role_policy_attachment" "spot_fleet_service_role_policy" { | ||
role = aws_iam_role.spot_fleet_service_role.name | ||
policy_arn = var.aws_spot_fleet_service_role_policy_arn | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IAM role allows Batch to request Spot instances.
resource "aws_iam_role" "ecs_instance_role" { | ||
name_prefix = "ecs${local.short}InstanceRole-" | ||
assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json | ||
} | ||
|
||
resource "aws_iam_role_policy_attachment" "ec2_service_role_policy" { | ||
role = aws_iam_role.ecs_instance_role.name | ||
policy_arn = var.aws_ec2_service_role_policy_arn | ||
} | ||
|
||
resource "aws_iam_instance_profile" "ecs_instance_role" { | ||
name = aws_iam_role.ecs_instance_role.name | ||
role = aws_iam_role.ecs_instance_role.name | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IAM role is attached to the Batch container instances. Any policy attached to this role will be accessible to the ETL.
resource "aws_iam_role_policy" "scoped_etl_read" { | ||
name_prefix = "S3ScopedEtlReadPolicy-" | ||
role = aws_iam_role.ecs_instance_role.id | ||
policy = data.aws_iam_policy_document.scoped_etl_read.json | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inline IAM policy allows the Batch jobs to decrypt/read files in our PHI S3 bucket.
"vcpus": 1, | ||
"memory": 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used mprof
to benchmark the ETL (including forked child processes) and generate a memory usage report for a single study:
Since our slice of the pipeline is single-threaded (not sure about the architecture of dcm2niix
or the Flywheel CLI), 1 vCPU seems to make initial sense. And, since the study peaked at around ~800 MB of memory usage, I think 1024 MB makes sense. It's worth considering that these values correspond to vCPU/memory reservation. Most of the time, the instance types that Batch chooses will be oversized.
If any of the jobs go over their initial reservation, we can identify this via CloudWatch metrics, and adjust resource reservations accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how well thought out you did this analysis. That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.
Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.
Good idea. Resolved in 5d1058d.
Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?
Those constraints are mainly just for Fargate. With an EC2-backed ECS cluster, you can pretty much set any value for vCPUs and memory you want. Also, since these values are checked at runtime (e.g., the AWS API will reject the plan), I think it's alright to omit type constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice. Ok, then this gets the 👍 from me!
parser_run.add_argument( | ||
"--batch", | ||
action="store_true", | ||
help="skip local processing and submit job(s) to AWS Batch", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this flag as a quick way to process studies on Batch. The more "robust" way that @afamiliar and I settled on is captured in #8 but will take more time to develop.
I realized that running image-deid-etl check
on your host will compare Orthanc studies to your local database. However, invocations of the ETL on Batch will mark Orthanc studies as processed on the RDS instance. So, unfortunately, image-deid-etl check
will always be out-of-sync with the RDS instance.
I think this is OK for the first batch of studies since both databases would be empty, but we'll need to address this once we've processed the Orthanc backlog and want to process new studies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be a problem if there's any expectation that this will run on a local machine outside of testing, but it doesn't sound like that's likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an expectation that this is temporary. However, I included it as a flag in the CLI, so folks could use it down the road and run into trouble.
My brain is foggy on the best way to fix the problem. If the ETL had an HTTP API, we could talk to that. The only other solution is tunneling through the Bastion to talk to PostgreSQL. I'd like to return to this later (#36).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I don't have a clean way around that either, although we do have that bastion in place...
I'll 👍 for now and we can move this to #36
Name = "dbpgDatabaseServer" | ||
Project = var.project | ||
Environment = var.environment | ||
Name = "dbpgDatabaseServer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting tweaks that can be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Couple of clarifications for my own sake, but I was able to run through the testing instructions with the exception of connecting to the container instance with ./scripts/console
, which threw an error when building the image-deid-etl
container.
+ curl https://storage.googleapis.com/flywheel-dist/cli/16.4.0/fw-linux_amd64-16.4.0.zip -o flywheel-cli.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
"vcpus": 1, | ||
"memory": 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how well thought out you did this analysis. That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.
Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?
}, | ||
{ | ||
"name": "FLYWHEEL_API_KEY", | ||
"value": "${flywheel_api_key}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little paranoid about having secrets stored in the environment variables that can be read from the AWS Console, but that might just be a reflex. Do you think it makes sense to pull them from secrets manager/S3 in the container entrypoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be. I've used this pattern for a while. Think about it. Suppose you have console access to see the environment variables for a task. In that case, you probably have console access to swap out the image for something else or do other nefarious things.
I still think the reflex is sound. Using Secrets Manager combined with fine-grained access controls would be a step up. But, it'd also increase complexity. Since I don't have experience with that pattern, I'd want to explore it in an ADR-lite-like setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think it couldn't hurt to look into. We do have contractors from other orgs with roles that would allow reading them in other accounts, but not in HIPAA account (Yet).
Let's take that to a DEVOPS ticket tho for a proper ADR on it, but note that this application (As well as many others to be sure) will needed to be modified from the results of that.
parser_run.add_argument( | ||
"--batch", | ||
action="store_true", | ||
help="skip local processing and submit job(s) to AWS Batch", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be a problem if there's any expectation that this will run on a local machine outside of testing, but it doesn't sound like that's likely.
Content-Type: multipart/mixed; boundary="==BOUNDARY==" | ||
MIME-Version: 1.0 | ||
|
||
--==BOUNDARY== | ||
Content-Type: text/cloud-boothook; charset="us-ascii" | ||
|
||
# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook | ||
# makes it more likely the drive is mounted before the Docker daemon and ECS agent | ||
# start, which helps mitigate potential race conditions. | ||
# | ||
# See: | ||
# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon | ||
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats | ||
mkfs.ext4 -E nodiscard /dev/nvme1n1 | ||
mkdir -p /media/ephemeral0 | ||
mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0 | ||
|
||
--==BOUNDARY== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it's currently designed, the ETL will write files to the current working directory (which is /usr/local/src), so we need to add a mechanism that tells the ETL to perform operations in an alternative directory (e.g., /media/ephemeral0).
Is there an issue open for this mechanism yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] with the exception of connecting to the container instance with
./scripts/console
, which threw an error when building theimage-deid-etl
container.
Ugh, yeah, @afamiliar ran into this as well. It's an issue with Netskope MITMing the connection to Google and swapping the TLS certificate out for a self-signed certificate. The Netskope root CA isn't present within the container, though, leading to the issue with curl
. The only way to get past this is by disabling the Netskope launchd service.
Content-Type: multipart/mixed; boundary="==BOUNDARY==" | ||
MIME-Version: 1.0 | ||
|
||
--==BOUNDARY== | ||
Content-Type: text/cloud-boothook; charset="us-ascii" | ||
|
||
# Manually mount unformatted instance store volumes. Mounting in a cloud-boothook | ||
# makes it more likely the drive is mounted before the Docker daemon and ECS agent | ||
# start, which helps mitigate potential race conditions. | ||
# | ||
# See: | ||
# - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html#bootstrap_docker_daemon | ||
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/amazon-linux-ami-basics.html#supported-user-data-formats | ||
mkfs.ext4 -E nodiscard /dev/nvme1n1 | ||
mkdir -p /media/ephemeral0 | ||
mount -t ext4 -o defaults,nofail,discard /dev/nvme1n1 /media/ephemeral0 | ||
|
||
--==BOUNDARY== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for prompting me. I should've opened an issue when I encountered this. I opened #35.
"vcpus": 1, | ||
"memory": 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I think we should have the vcpus/memory in terraform variables so if they need to be adjusted, such as in the case of a larger study or if the sample data we have here is lighter than the actual data we will be receiving, we can modify it without making any code changes.
Good idea. Resolved in 5d1058d.
Since ECS task CPU/memory has to be configured in specific amounts, maybe we could use a map with some pre-defined combinations?
Those constraints are mainly just for Fargate. With an EC2-backed ECS cluster, you can pretty much set any value for vCPUs and memory you want. Also, since these values are checked at runtime (e.g., the AWS API will reject the plan), I think it's alright to omit type constraints.
}, | ||
{ | ||
"name": "FLYWHEEL_API_KEY", | ||
"value": "${flywheel_api_key}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be. I've used this pattern for a while. Think about it. Suppose you have console access to see the environment variables for a task. In that case, you probably have console access to swap out the image for something else or do other nefarious things.
I still think the reflex is sound. Using Secrets Manager combined with fine-grained access controls would be a step up. But, it'd also increase complexity. Since I don't have experience with that pattern, I'd want to explore it in an ADR-lite-like setting.
parser_run.add_argument( | ||
"--batch", | ||
action="store_true", | ||
help="skip local processing and submit job(s) to AWS Batch", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an expectation that this is temporary. However, I included it as a flag in the CLI, so folks could use it down the road and run into trouble.
My brain is foggy on the best way to fix the problem. If the ETL had an HTTP API, we could talk to that. The only other solution is tunneling through the Bastion to talk to PostgreSQL. I'd like to return to this later (#36).
5d1058d
to
e44d9cb
Compare
Ugh, looks like I have a reason to get on that exemption list after all now! I'll approve this PR though, ty for all the fixes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
- Create Batch resources like a job queue and compute environment. - Create a job definition for the ETL; thread through requisite environment variables. - Draft IAM resources that allow the ETL to read objects from S3.
e44d9cb
to
af945d8
Compare
This attempts to fix a "Connection reset by peer" error we're receiving that may be a result of the call to io.BytesIO. I'd like to simplify things to use the native Pandas API for reading from S3 to see if the error goes away. Also, we're now reading these paths in from the environment.
af945d8
to
5c27460
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Overview
I made the following changes to support running the ETL using AWS Batch:
Also, I added documentation on how to perform a deployment. And, I added a flag to the CLI that allows us to start submitting Batch jobs so we can get rolling before #8.
Resolves #6
Resolves #18
Checklist
fixup!
commitsTesting Instructions
Make sure your AWS SSO session is current:
Launch an instance of the included Terraform container image:
Once inside the context of the container image, set
GIT_COMMIT
to5c27460
(I already published this tag):bash-5.1# export GIT_COMMIT=5c27460
Use
infra
to generate and apply a Terraform plan:Follow the new deployment README's instructions to execute database migrations by submitting a Batch job:
Note: While you are following these instructions, observe Batch adjust the desired vCPUs of the compute environment to match the queued database migration job requirements. The job status will stay in
RUNNABLE
until the underlying ECS cluster can win a spot bid and spin up a container instance to take the job. To start, I made the maximum spot bid 90% of on-demand pricing, but we can tune this down later on since the workloads aren't time-sensitive.Next, launch a shell within the Python application container image:
$ ./scripts/console
And finally, use the CLI to get the UUID of one unprocessed Orthanc study and kick off a Batch job:
Note: Unfortunately, it's more likely than not that the job will fail. Most of the failures are related to a new batch of studies in Orthanc. @afamiliar resolved this in #29; we just haven't merged the fix yet. However, take a look at the detail view for the Batch job and the logs in CloudWatch. You'll see that the deployment is working 😎.