Support for attaching multiple EFA interfaces via `awsprov_templates.json` #76

tiagt · 2025-01-19T12:40:53Z

What type of PR is this?

/kind feature

Which issue(s) this PR fixes:

N/A

DESCRIPTION:

Customers using the AWS cloud provider plug-in are currently unable to leverage the entire available Elastic Fabric Adapter (EFA) bandwidth for clusters provisioned by inline templates. Instead, customers must use EC2 Launch Templates (finer-grained control) or EC2 Fleet (massively parallelized workloads), which require deeper knowledge of AWS services and maintaining separate resources.

Inline provisioning templates present as a convenient way of adding capacity just-in-time. They're straightforward to use, repeatable, and give engineers a development environment where they can iterate by focusing on the workload, while the infrastructure is represented by a single file (awsprov_templates.json). A Launch Template, for example, is an AWS resource that must be created ahead of time, referenced by an ARN, and maintained/updated separately.

This pull request proposes to add a new parameter to awsprov_templates.json called efaCount. If maxNumber controls how may instances to launch, efaCount controls how many EFA interfaces to create and attach at each node. For customers just starting with Resource Connector (RC), or even for seasoned ones, this inline configuration seems to unlock/cover a large number of use-cases with a fraction of the complexity.

Represented below is the typical configuration for an EFA-enabled cluster, with the added efaCount parameter.

{
    "interfaceType": "efa",
    "placementGroupName": "pg",
    "efaCount": 2
}

Note

P5 and P5e instances can provide up to 3,200 Gbps of networking bandwidth using 32 EFA interfaces per node. This change makes it possible to provision a distributed AI/ML training cluster by controlling maxNumber and efaCount.

IMPACT:

Consider the following scenario:

A customer wants to run a simulation workload using 2 hpc7a.96xlarge EC2 instances over MPI
The customer installs and configures Resource Connector with the AWS plug-in
The file awsprov_templates.json is modified to include a new template for the infrastructure above, and interfaceType is set to efa, per documentation
A job's submitted and launched successfully, but the measured bandwidth is half the advertised 300 Gbps
After further investigation, the customer notices that only one EFA interface was attached and no part of the documentation adds enough detail to help the customer fix it

Warning

The Resource Connector documentation may be missing some important details or pointers that can help customers better navigate a cloud-enabled solution:

Mentioning clearly that inline templates provision only one ENI or EFA interface depending on interfaceType
Some EC2 instances support multiple EFA interfaces as a means to increase/maximize inter-node networking bandwidth
As of now, to take advantage of multiple EFA interfaces, customers must pursue an LT or EC2 Fleet solution

Should this pull request be accepted, of course, the above should change. In case it isn't, I would still ask for a revision of the documentation to help customers.

Overall impact:

Inline templates can now provision clusters that maximize inter-node networking bandwidth
Infrastructure is provisioned dynamically, supporting rapid iteration without maintaining separate AWS resources (such as LTs, EC2 Fleets)
Accelerated lift to AWS, lower barrier to entry (run a high-performance MPI workload soon after RC installation)

HOW TO DETECT THE ERROR:

Start from the implementation in this pull request
Launch an MPI job across two EC2 instances (e.g. hpc7a.*) supporting N EFA interfaces; the job must scale w/ the available bandwidth
Inspect the deployed infrastructure and confirm only one EFA was provisioned and attached to each instance; benchmark the results and take note
Add the following line to awsprov_templates.json under "interfaceType"... - "efaCount": 2,
Launch the exact the same job, but with this minimal change to the provisioning template
Inspect the deployed infrastructure once, confirming that two EFAs were now provisioned per instance; benchmark once again the compare

HOW TO WORKAROUND/AVOID THE ERROR:

The only "workaround" is to delegate the infrastructure provisioning to a Launch Template or EC2 Fleet, which can significantly increase the complexity of the solution.

HOW TO RECOVER FROM THE FAILURE:

N/A

ROOT CAUSE OF THE PROBLEM:

Provisioning multiple EFA interfaces is not implemented for inline templates.

UNIT TEST CASES:

TBD after discussing this PR with the team.

TEST SUGGESTIONS TO QA:

TBD after discussing this PR with the team.

POSSIBLE IMPACT FEATURES:

Possibly introduces regression on the existing node provisioning code. Effort was placed on making the solution backwards compatible. If no efaCount parameter is specified, then it's 1 by default, retaining the existing behavior.

For your review.

Thank you,

Tiago C T
[email protected]
Sr. Solutions Architect at AWS

…s.json.

tiagt · 2025-01-19T13:32:37Z

For testing, I'm including the following template which also demonstrates the backwards compatibility.

{
  "templates": [
    {
      "templateId": "Template-HPC7A-1xEFA",
      "maxNumber": 2,
      "attributes": {
        "type": [ "String", "X86_64" ],
        "ncores": [ "Numeric", "192" ],
        "ncpus": [ "Numeric", "192" ],
        "mem": [ "Numeric", "786432" ],
        "awshost": [ "Boolean", "1" ]
      },
      "imageId": "ami-00000000000000000",
      "subnetId": "subnet-00000000000000000",
      "vmType": "hpc7a.96xlarge",
      "securityGroupIds": [
        "sg-00000000000000000",
        "sg-00000000000000001"
      ],
      "instanceTags": "group=hpc7a_tests;Name=lsf_worker_hpc7a_1x_efa",
      "userData": "zone=us-east-2b",
      "placementGroupName": "pg-hpc7a-cluster",
      "interfaceType": "efa"
    },
    {
      "templateId": "Template-HPC7A-2xEFA",
      "maxNumber": 2,
      "attributes": {
        "type": [ "String", "X86_64" ],
        "ncores": [ "Numeric", "192" ],
        "ncpus": [ "Numeric", "192" ],
        "mem": [ "Numeric", "786432" ],
        "awshost": [ "Boolean", "1" ]
      },
      "imageId": "ami-00000000000000000",
      "subnetId": "subnet-00000000000000000",
      "vmType": "hpc7a.96xlarge",
      "securityGroupIds": [
        "sg-00000000000000000",
        "sg-00000000000000001"
      ],
      "instanceTags": "group=hpc7a_tests;Name=lsf_worker_hpc7a_2x_efa",
      "userData": "zone=us-east-2b",
      "placementGroupName": "pg-hpc7a-cluster",
      "interfaceType": "efa",
      "efaCount": 2
    }
  ]
}

yuch7 · 2025-01-20T18:12:53Z

In Testing

tiagt and others added 2 commits August 9, 2024 10:24

Add multi-EFA support via an 'efaCount' parameter in awsprov_template…

282623b

…s.json.

Merge branch 'IBMSpectrumComputing:main' into main

f050c4d

tiagt requested review from zhchgbj, JishanXing, liyancn, nanguanqi, yuch7, qi622 and IBM-TYan as code owners January 19, 2025 12:40

tiagt changed the title ~~Support for attaching multiple EFA devices via awsprov_templates.json~~ Support for attaching multiple EFA interfaces via awsprov_templates.json Jan 19, 2025

yuch7 added the enhancement New feature or request label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for attaching multiple EFA interfaces via `awsprov_templates.json` #76

Support for attaching multiple EFA interfaces via `awsprov_templates.json` #76

tiagt commented Jan 19, 2025 •

edited

Loading

tiagt commented Jan 19, 2025 •

edited

Loading

yuch7 commented Jan 20, 2025

Support for attaching multiple EFA interfaces via awsprov_templates.json #76

Are you sure you want to change the base?

Support for attaching multiple EFA interfaces via awsprov_templates.json #76

Conversation

tiagt commented Jan 19, 2025 • edited Loading

tiagt commented Jan 19, 2025 • edited Loading

yuch7 commented Jan 20, 2025

Support for attaching multiple EFA interfaces via `awsprov_templates.json` #76

Support for attaching multiple EFA interfaces via `awsprov_templates.json` #76

tiagt commented Jan 19, 2025 •

edited

Loading

tiagt commented Jan 19, 2025 •

edited

Loading