Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GPU node spec in radixconfig #1004

Open
nilsgstrabo opened this issue Dec 6, 2023 · 0 comments
Open

Refactor GPU node spec in radixconfig #1004

nilsgstrabo opened this issue Dec 6, 2023 · 0 comments

Comments

@nilsgstrabo
Copy link
Contributor

nilsgstrabo commented Dec 6, 2023

We should simplify the node spec (https://www.radix.equinor.com/references/reference-radix-config/#node) in radixconfig.yaml.
Today the user sets gpu and gpuCount. radix-operator translates this into a toleration and a nodeAffinity selector.
The current node spec does not support definition VM types, e.g. with more/faster memory and cpus. It is bound to GPU types only.

I suggest we deprecate/delete the current node spec and replace with a "simple" string value, e.g. nodeType: nv-v100-2, where nv-v100-2 is in a fixed list of nodes types (mapped to a Azure VM size). We can then add new nodepools with new vm types and add the new vm type to the list of supported types.

Use labels to direct the pod to correct nodepool.

DoD:
In radixconfig it must be possible to add extra information to select different node types

I.e. bronze = VM type, Mem, CPU

@emirgens emirgens added the 🤔 refinement needed This needs more details label Jan 9, 2024
@emirgens emirgens removed the 🤔 refinement needed This needs more details label Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants