Running the HLS PGE on AWS Batch #74

riverma · 2022-03-21T19:07:25Z

riverma
Mar 21, 2022
Maintainer

Running the HLS PGE on AWS Batch Guide

Description

This tutorial demonstrates the steps necessary to execute the HLS PGE v1 er1 on AWS Batch. Running the HLS PGE on Batch can be useful for quick profiling and performance evaluation of the PGE, among other uses.

Requirements

Docker command-line client
Access to AWS Batch
Access to AWS ECR
Access to AWS EBS
Access to AWS CloudWatch
Access to HLS PGE v1 er1

Setup

Running the HLS PGE in AWS Batch requires the following setup / architecture:

  flowchart LR;
  JOB[AWS Batch Job]-->|submit to|JOBQ[AWS Batch Job Queue];
  JOBQ-->|map to|CE[AWS Batch Compute Environment];
  EFS[(AWS EFS)]-->|read runcfg, datasets|CE
  CE-->|write output to|EFS
  CE-->|write logs to|CW[Cloud Watch];
  ECR[(AWS ECR)]-->|pull HLS pge|CE;

AWS ECR

You'll need to upload the HLS PGE to AWS ECR. Make sure you have write permission to ECR, and follow this tutorial.

Create Repository

First, you'll want to create an AWS ECR repository to store your PGE images. You can do this by navigating to AWS ECR, and clicking the "Create Repository" button on the main page, and following the instructions. Ask your system administrator if you don't have such permissions. Note down the repository name / URI (i.e. *.ecr.us-west-2.amazonaws.com or equivalent) - which you'll need in the below steps.

Tag Your PGE Locally

Locally on your own machine, you'll want to tag your HLS PGE prior to sending to AWS ECR, example below:
docker tag opera_pge/dswx_hls:1.0.0-er.1.0 [YOUR-ECR-HOSTNAME].ecr.us-west-2.amazonaws.com/opera_pge/dswx_hls:1.0.0-er.1.0

Push PGE up to AWS ECR

Then push it up to ECR:
docker push [YOUR-ECR-HOSTNAME].dkr.ecr.us-west-2.amazonaws.com/opera_pge/dswx_hls:1.0.0-er.1.0

⚠️ NOTE: replace [YOUR-ECR-HOSTNAME] with your hostname specified within your AWS ECR account. You can find this information on AWS ECR or from the earlier steps.

Once the PGE image has been pushed / uploaded, you should see it appear in the list of ECR images, under the new repository you created. Below is an example of what you should see in your ECR repository, after a successful upload.

AWS EFS

In order to read sample input datasets for your PGE, as well as to read a configuration file and write out output results, your PGE on AWS Batch will need a storage system to interact with. Batch has integrations with AWS EFS to allow for this. Some setup is required however.

Create an EFS filesystem

The first step is to create an AWS EFS file system onto which you can write and read files from. Ask your administrator, or alternatively if you have permissions, set up your own EFS file system by navigating to AWS EFS' dashboard, and clicking the "Create file system" button.

⚠️ NOTE: make sure you specify a Virtual Private Cloud (VPC) that you plan to use for AWS Batch's compute environment (details below)! Otherwise, your PGE images will not be able to access the AWS EFS system. Feel free to specify a single Availability Zone to limit costs, but again, ensure your Batch compute environment is located within the same availability zone.

⚠️ NOTE: after you create the AWS EFS file system, note down the fs-XXX ID, which you'll need in subsequent steps.

Upload PGE Necessary Data to EFS

Before you can upload PGE necessary data and folder structures to EFS, you'll need to have an AWS EC2 instance available that has the AWS EFS filesystem from the previous step mounted as a network data store. Ask your system administrator to set up an EC2 node with the prior EFS filesystem as a mount for you, or follow these steps.

Once you've confirmed you have an EC2 instance available with your prior AWS EFS file system mounted, you'll want to open up a terminal to begin the upload process:

On your local computer, ensure you have the HLS PGE necessary directory structure in place. The HLS PGE Users Guide can provide details on the setup (see @collinss-jpl). For brevity, you'll want to have an output_dir to write results in, a run_config_dir that contains your sample configuration file, and a test_datasets folder that contains sub-folders of sample input data.

Example below:

.
├── output_dir
├── run_config_dir
│   └── dswx_hls_sample_runconfig.yaml
└── test_datasets
    ├── l30_greenland
    │   ├── README.txt
    │   ├── expected_output
    │   │   └── dswx_hls.tif
    │   ├── input_files_hls_v2.0
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B01.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B02.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B03.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B04.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B05.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B06.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B07.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B09.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B10.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.B11.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.Fmask.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.SAA.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.SZA.tif
    │   │   ├── HLS.L30.T22VEQ.2021248T143156.v2.0.VAA.tif
    │   │   └── HLS.L30.T22VEQ.2021248T143156.v2.0.VZA.tif
    │   └── runconfig_dswx_hls.yaml
    └── s30_louisiana
        ├── README.txt
        ├── expected_output
        │   └── dswx_hls.tif
        ├── input_files_hls_v2.0
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B01.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B02.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B03.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B04.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B05.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B06.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B07.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B08.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B09.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B10.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B11.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B12.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.B8A.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.Fmask.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.SAA.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.SZA.tif
        │   ├── HLS.S30.T15SXR.2021250T163901.v2.0.VAA.tif
        │   └── HLS.S30.T15SXR.2021250T163901.v2.0.VZA.tif
        └── runconfig_dswx_hls.yaml

Use the scp client to transfer the folder structure and contents above to your AWS EFS mount on your EC2 instance:

scp -R . [YOUR_EC2_IP]:[YOUR_EFS_MOUNT_PATH]

⚠️ NOTE: make sure the the Unix group for your output_dir on EFS is writable by the group or user your containers are leveraging. If the permissions are not setup correctly, your PGE will fail due to write and permission denied errors when trying to write results to the output_dir. Consult your system administrator on the proper setup.

AWS Batch

We'll now work on setting up AWS Batch such that you'll be able to do a test run of your PGE against the sample data that you previously uploaded.

Setting Up AWS Batch Compute Environments

You'll need a compute environment to run your PGE on Batch. The compute environment helps tell Batch the resources (which EC2 instance types) and the type of procurement model (Spot market, on-demand) to leverage. You can have as many compute environments as you like for various scenarios.

Ask your system administrator, or follow the steps below if you have permission to do these steps:

Navigate to AWS Batch, and click the "Compute environments" link in the left hand panel, and then click "Create" to create

In the Compute environment configuration section, specify Managed if you'd like Batch to control auto-scaling for you, or Unmanaged if you'd like to setup your own custom scaling strategy. See this guide for more details on the latter.

⚠️ NOTE: you'll want to create a Batch service-role that permits AWS batch to connect to your other AWS resources, like AWS ECR and AWS EFS. Ask your administrator to set this up for you, but note down the name of the role as you'll need it later in this guide.

Within the Instance Configuration section, choose between On-Demand or Spot. Do not choose Fargate - as that runs your HLS PGE in a non-EC2 environment more akin to a container scaling platform. Fill out the details as needed based on your instance needs. For the "Allowed instance types" drop-down, you'll want to specify the instance types this compute environment can support and leverage the "Allocation strategy" to choose among those instance types. See this guide for more explanation.

Important! - within the Networking section, you'll want to ensure you're on the same VPC as your AWS EFS file system is located on. Ask your system administrator for details.

Click "Create compute environment"

Setting up AWS Batch Job Queues

AWS Batch Job Queues are essential for defining a mapping between submitted jobs and respective compute environments. Since you can't submit an AWS Batch job directly to a compute environment, you use queues to be the go-between. Create as many queues as you want, including more less or more than the number of compute environments, in case you want to establish a precedence ordering of which compute environment to leverage based on job volumes. See this guide for more explanation.

Create a queue by navigating to the AWS Batch dashboard, and in the left hand panel, selecting "Job Queues". Once open, click the "Create" button in the top right hand side of the page.

In the "Job queue configuration" section, give your queue a unique name like aws-batch-(ops|dev)-(username)-spotonly-queue for example. Leave "Priority" to be 1 and leave "Scheduling policy ARN' to be disabled.

In the "Tags" section, feel free to add tags as you wish. Contact your system administrator on tagging norms and policies.
In the "Connected compute environments" section, select your previously created compute environments from the previous step, and set the compute environment ordering if you select multiple. Multiple compute environments can be mapped to a single queue and are leveraged based on the AWS job scheduler. Why have multiple compute environments? To help balance your resources if you need massive compute scale. See step 8 in this guide for more features / limitations.

Click "Create"

Setting up an AWS Batch Job Definition

Job definitions help define the type of job you'd like to submit. For example, you may have different configurations for your PGE that require different settings, versions of binaries, or Run Configs. Having different job definitions helps you with these needs.

⚠️ NOTE: you can always create versions of your job definition for more nimble use cases, instead of creating brand new job definitions. These job definitions will be referenced by a :1 or a :2 etc postfix.

Create a new job definition by navigating to the AWS Batch dashboard, and selecting "Job definitions" in the left hand panel. Then click "Create"

In the "Job type" section, select "Single-node"

In the "General configuration" section, give your job a unique name and set the time out to a time value you expect your PGE to never run to - i.e. to time out if it reaches this number of seconds. 3600 may not be enough, so set carefully depending on your specific PGE. You can always override this value when submitting a job though!

In the "Platform compatibility" section, select "EC2" and in the "Execution role" toggle, make sure that selection is enabled and that you've selected your Batch service role that was created in a previous step.

In the "Job configuration" section, you'll work through steps to ensure your HLS PGE is using the right PGE image, runs the right Docker command, and mounts your AWS EFS file system resources effectively.

The first step is to specify your "Image" to point to your AWS ECR PGE image. Use the ECR URI from the earlier ECR setup instructions here. Additionally, you'll want to specify "Bash" for "Command syntax" and use the custom HLS PGE Docker arguments to run the command needed (see HLS PGE Users Guide for more details). An example is given below:

Next, you'll want to fill out the "vCpus" and "Memory" fields per your job requirements. This will help map your future job submissions to appropriate EC2 instance nodes within the compute environments you set up, based on the allocation strategies you specified. The default values below are probably sufficient for a test run.

Next, you'll specify the "Job role" and "Security configuration". For the job role, use the Batch service role you defined earlier, and for "Security configuration", enable it and set the user as conda per the HLS PGE Users Guide requirements.

Next, you'll set up your "Mount points configuration". Again, consult the HLS PGE Users Guide on the specific requirements of which mount points are needed to run the PGE, but for the HLS v1 ecr1 PGE, the following should be copied verbatim:

Next, in the "Volumes configuration" section, you'll want to specify specific sub-directories within your AWS EFS file system to map against the mount points your HLS PGE Docker container requires. Again, consult the HLS PGE Users Guide for more custom specifics for your use case, but for the HLS v1 ecr1 PGE, the following is sufficient, noting to modify the fs-XXX to point to the AWS EFS file system ID you set up in a previous step.

Set all other configuration to default settings or customize as you need, and click "Create".

Submitting a Job

If you followed the steps in the Setup section successfully, you're ready to submit a job! As with all AWS actions, you can use the Command Line Interface (CLI) to programmatically interact with AWS. Submitting jobs is no different. The coverage of using the CLI is outside the scope of this guide, but see this guide for how to set up the CLI and these resources for more details on the Batch CLI interface. For this guide, we'll use the GUI interface. Your local AWS environment may need additional steps to enable CLI use! Consult your system administrator.

⚠️ NOTE: in order to run an HLS PGE job successfully, you'll need to have sample data, configuration, and an output directory ready within your EFS filesystem. Please see the "Upload PGE Necessary Data to EFS" section for details.

Navigate to the AWS Batch Dashboard and select the "Jobs" link in the left hand panel. Click "Submit new job".

In the "General configuration" section, give your job a good name (does not need to be unique), and in the "Job definition" selection, select the job definition you created in previous steps as well as the "Job queue" you wish to use. You can customize the "Execution timeout" as needed here as well to override the job definition settings. "Array jobs" and "Job dependencies" are beyond the scope of this guide, but see the array jobs docs or the job dependencies docs for more details.

Customize the "Job configuration" as you see fit or use the defaults within your job definition.
Modify the "Retry Strategies" if desired, otherwise ignore.
Click "Submit" to submit the job

Now that you've successfully submitted your job, navigate to the AWS Batch Dashboard in the left hand panel. Here you will see a small panel showing job states. You should see at least one job submitted under the "Runnable" state. It may take a minute or two for the job to move on to the "Running" state, and subsequently the "Succeeded" or "Failed" states.

You can also navigate to the Jobs view in the left hand panel to see more information about your job. You should eventually see your job in the "Running" state

If you click on the job ID, you'll be navigated to a job details page, which includes a link titled "Log stream name" that points to the AWS CloudWatch logging results of your job.

⚠️ NOTE: your logging results may not exist until your job completes! This entirely depends on the nature of the PGE you're using.

After your job completes, check the AWS CloudWatch logs or your AWS EFS file system's output_dir folder for your results!

Congratulations on finishing the tutorial!

Next Steps

Some recommendations on further steps you might want to consider:

Set up additional AWS Batch Compute environments per your EC2 instance type needs
Create new job definitions (or versions) for each different HLS PGE input dataset sample you want to run. It depends on the nature of the Docker container you're running, but for the HLS PGE, you must specify the input datasets via a volume mount and not a Docker command line argument, thus new job definitions are necessary to specify different input data sets
Leverage the AWS CLI for more automated actions for modifying your job definitions or submitting jobs
Create further sub-directory structure within your AWS EFS filesystem to support your testing needs, for example, sub folder for each experiment you may want to run. Make sure to modify the AWS Batch job definition accordingly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the HLS PGE on AWS Batch #74

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Running the HLS PGE on AWS Batch #74

riverma Mar 21, 2022 Maintainer

Running the HLS PGE on AWS Batch Guide

Description

Requirements

Setup

AWS ECR

Create Repository

Tag Your PGE Locally

Push PGE up to AWS ECR

AWS EFS

Create an EFS filesystem

Upload PGE Necessary Data to EFS

AWS Batch

Setting Up AWS Batch Compute Environments

Setting up AWS Batch Job Queues

Setting up an AWS Batch Job Definition

Submitting a Job

Next Steps

Replies: 0 comments

riverma
Mar 21, 2022
Maintainer