Support for attaching multiple EFA interfaces via awsprov_templates.json
#76
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
Which issue(s) this PR fixes:
N/A
DESCRIPTION:
Customers using the AWS cloud provider plug-in are currently unable to leverage the entire available Elastic Fabric Adapter (EFA) bandwidth for clusters provisioned by inline templates. Instead, customers must use EC2 Launch Templates (finer-grained control) or EC2 Fleet (massively parallelized workloads), which require deeper knowledge of AWS services and maintaining separate resources.
Inline provisioning templates present as a convenient way of adding capacity just-in-time. They're straightforward to use, repeatable, and give engineers a development environment where they can iterate by focusing on the workload, while the infrastructure is represented by a single file (
awsprov_templates.json
). A Launch Template, for example, is an AWS resource that must be created ahead of time, referenced by an ARN, and maintained/updated separately.This pull request proposes to add a new parameter to
awsprov_templates.json
calledefaCount
. IfmaxNumber
controls how may instances to launch,efaCount
controls how many EFA interfaces to create and attach at each node. For customers just starting with Resource Connector (RC), or even for seasoned ones, this inline configuration seems to unlock/cover a large number of use-cases with a fraction of the complexity.Represented below is the typical configuration for an EFA-enabled cluster, with the added
efaCount
parameter.Note
P5 and P5e instances can provide up to 3,200 Gbps of networking bandwidth using 32 EFA interfaces per node. This change makes it possible to provision a distributed AI/ML training cluster by controlling
maxNumber
andefaCount
.IMPACT:
Consider the following scenario:
hpc7a.96xlarge
EC2 instances over MPIawsprov_templates.json
is modified to include a new template for the infrastructure above, andinterfaceType
is set toefa
, per documentationWarning
The Resource Connector documentation may be missing some important details or pointers that can help customers better navigate a cloud-enabled solution:
interfaceType
Should this pull request be accepted, of course, the above should change. In case it isn't, I would still ask for a revision of the documentation to help customers.
Overall impact:
HOW TO DETECT THE ERROR:
hpc7a.*
) supporting N EFA interfaces; the job must scale w/ the available bandwidthawsprov_templates.json
under"interfaceType"...
-"efaCount": 2,
HOW TO WORKAROUND/AVOID THE ERROR:
The only "workaround" is to delegate the infrastructure provisioning to a Launch Template or EC2 Fleet, which can significantly increase the complexity of the solution.
HOW TO RECOVER FROM THE FAILURE:
N/A
ROOT CAUSE OF THE PROBLEM:
Provisioning multiple EFA interfaces is not implemented for inline templates.
UNIT TEST CASES:
TBD after discussing this PR with the team.
TEST SUGGESTIONS TO QA:
TBD after discussing this PR with the team.
POSSIBLE IMPACT FEATURES:
Possibly introduces regression on the existing node provisioning code. Effort was placed on making the solution backwards compatible. If no
efaCount
parameter is specified, then it's 1 by default, retaining the existing behavior.