Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for attaching multiple EFA interfaces via awsprov_templates.json #76

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tiagt
Copy link

@tiagt tiagt commented Jan 19, 2025

What type of PR is this?

/kind feature

Which issue(s) this PR fixes:

N/A

DESCRIPTION:

Customers using the AWS cloud provider plug-in are currently unable to leverage the entire available Elastic Fabric Adapter (EFA) bandwidth for clusters provisioned by inline templates. Instead, customers must use EC2 Launch Templates (finer-grained control) or EC2 Fleet (massively parallelized workloads), which require deeper knowledge of AWS services and maintaining separate resources.

Inline provisioning templates present as a convenient way of adding capacity just-in-time. They're straightforward to use, repeatable, and give engineers a development environment where they can iterate by focusing on the workload, while the infrastructure is represented by a single file (awsprov_templates.json). A Launch Template, for example, is an AWS resource that must be created ahead of time, referenced by an ARN, and maintained/updated separately.

This pull request proposes to add a new parameter to awsprov_templates.json called efaCount. If maxNumber controls how may instances to launch, efaCount controls how many EFA interfaces to create and attach at each node. For customers just starting with Resource Connector (RC), or even for seasoned ones, this inline configuration seems to unlock/cover a large number of use-cases with a fraction of the complexity.

Represented below is the typical configuration for an EFA-enabled cluster, with the added efaCount parameter.

{
    "interfaceType": "efa",
    "placementGroupName": "pg",
    "efaCount": 2
}

Note

P5 and P5e instances can provide up to 3,200 Gbps of networking bandwidth using 32 EFA interfaces per node. This change makes it possible to provision a distributed AI/ML training cluster by controlling maxNumber and efaCount.

IMPACT:

Consider the following scenario:

  • A customer wants to run a simulation workload using 2 hpc7a.96xlarge EC2 instances over MPI
  • The customer installs and configures Resource Connector with the AWS plug-in
  • The file awsprov_templates.json is modified to include a new template for the infrastructure above, and interfaceType is set to efa, per documentation
  • A job's submitted and launched successfully, but the measured bandwidth is half the advertised 300 Gbps
  • After further investigation, the customer notices that only one EFA interface was attached and no part of the documentation adds enough detail to help the customer fix it

Warning

The Resource Connector documentation may be missing some important details or pointers that can help customers better navigate a cloud-enabled solution:

  1. Mentioning clearly that inline templates provision only one ENI or EFA interface depending on interfaceType
  2. Some EC2 instances support multiple EFA interfaces as a means to increase/maximize inter-node networking bandwidth
  3. As of now, to take advantage of multiple EFA interfaces, customers must pursue an LT or EC2 Fleet solution

Should this pull request be accepted, of course, the above should change. In case it isn't, I would still ask for a revision of the documentation to help customers.

Overall impact:

  • Inline templates can now provision clusters that maximize inter-node networking bandwidth
  • Infrastructure is provisioned dynamically, supporting rapid iteration without maintaining separate AWS resources (such as LTs, EC2 Fleets)
  • Accelerated lift to AWS, lower barrier to entry (run a high-performance MPI workload soon after RC installation)

HOW TO DETECT THE ERROR:

  1. Start from the implementation in this pull request
  2. Launch an MPI job across two EC2 instances (e.g. hpc7a.*) supporting N EFA interfaces; the job must scale w/ the available bandwidth
  3. Inspect the deployed infrastructure and confirm only one EFA was provisioned and attached to each instance; benchmark the results and take note
  4. Add the following line to awsprov_templates.json under "interfaceType"... - "efaCount": 2,
  5. Launch the exact the same job, but with this minimal change to the provisioning template
  6. Inspect the deployed infrastructure once, confirming that two EFAs were now provisioned per instance; benchmark once again the compare

HOW TO WORKAROUND/AVOID THE ERROR:

The only "workaround" is to delegate the infrastructure provisioning to a Launch Template or EC2 Fleet, which can significantly increase the complexity of the solution.

HOW TO RECOVER FROM THE FAILURE:

N/A

ROOT CAUSE OF THE PROBLEM:

Provisioning multiple EFA interfaces is not implemented for inline templates.

UNIT TEST CASES:

TBD after discussing this PR with the team.

TEST SUGGESTIONS TO QA:

TBD after discussing this PR with the team.

POSSIBLE IMPACT FEATURES:

Possibly introduces regression on the existing node provisioning code. Effort was placed on making the solution backwards compatible. If no efaCount parameter is specified, then it's 1 by default, retaining the existing behavior.

For your review.

Thank you,

Tiago C T
[email protected]
Sr. Solutions Architect at AWS

@tiagt tiagt changed the title Support for attaching multiple EFA devices via awsprov_templates.json Support for attaching multiple EFA interfaces via awsprov_templates.json Jan 19, 2025
@tiagt
Copy link
Author

tiagt commented Jan 19, 2025

For testing, I'm including the following template which also demonstrates the backwards compatibility.

{
  "templates": [
    {
      "templateId": "Template-HPC7A-1xEFA",
      "maxNumber": 2,
      "attributes": {
        "type": [ "String", "X86_64" ],
        "ncores": [ "Numeric", "192" ],
        "ncpus": [ "Numeric", "192" ],
        "mem": [ "Numeric", "786432" ],
        "awshost": [ "Boolean", "1" ]
      },
      "imageId": "ami-00000000000000000",
      "subnetId": "subnet-00000000000000000",
      "vmType": "hpc7a.96xlarge",
      "securityGroupIds": [
        "sg-00000000000000000",
        "sg-00000000000000001"
      ],
      "instanceTags": "group=hpc7a_tests;Name=lsf_worker_hpc7a_1x_efa",
      "userData": "zone=us-east-2b",
      "placementGroupName": "pg-hpc7a-cluster",
      "interfaceType": "efa"
    },
    {
      "templateId": "Template-HPC7A-2xEFA",
      "maxNumber": 2,
      "attributes": {
        "type": [ "String", "X86_64" ],
        "ncores": [ "Numeric", "192" ],
        "ncpus": [ "Numeric", "192" ],
        "mem": [ "Numeric", "786432" ],
        "awshost": [ "Boolean", "1" ]
      },
      "imageId": "ami-00000000000000000",
      "subnetId": "subnet-00000000000000000",
      "vmType": "hpc7a.96xlarge",
      "securityGroupIds": [
        "sg-00000000000000000",
        "sg-00000000000000001"
      ],
      "instanceTags": "group=hpc7a_tests;Name=lsf_worker_hpc7a_2x_efa",
      "userData": "zone=us-east-2b",
      "placementGroupName": "pg-hpc7a-cluster",
      "interfaceType": "efa",
      "efaCount": 2
    }
  ]
}

@yuch7 yuch7 added the enhancement New feature or request label Jan 20, 2025
@yuch7
Copy link
Collaborator

yuch7 commented Jan 20, 2025

In Testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants