This module creates a slurm controller node via the SchedMD/slurm-gcp slurm_controller_instance and slurm_instance_template modules.
More information about Slurm On GCP can be found at the project's GitHub page and in the Slurm on Google Cloud User Guide.
The user guide provides detailed instructions on customizing and enhancing the Slurm on GCP cluster as well as recommendations on configuring the controller for optimal performance at different scales.
WARNING: The variables enable_reconfigure, enable_cleanup_compute and enable_cleanup_subscriptions, if set to true, require additional dependencies to be installed on the system running
terraform apply
. Python3 (>=3.6.0, <4.0.0) must be installed along with the pip packages listed in the requirements.txt file of SchedMD/slurm-gcp. See the documentation below.
- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
use:
- network1
- homefs
- compute_partition
settings:
machine_type: c2-standard-8
This creates a controller node with the following attributes:
- connected to the primary subnetwork of
network1
- the filesystem with the ID
homefs
(defined elsewhere in the blueprint) mounted - One partition with the ID
compute_partition
(defined elsewhere in the blueprint) - machine type upgraded from the default
c2-standard-4
toc2-standard-8
For a complete example using this module, see slurm-gcp-v5-cluster.yaml.
The schedmd-slurm-gcp-v5-controller module supports the reconfiguration of
partitions and slurm configuration in a running, active cluster. This option is
activated through the enable_reconfigure
setting:
- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
settings:
enable_reconfigure: true
This option has some additional requirements:
-
The Pub/Sub API must be activated in the target project:
gcloud services enable pubsub.googleapis.com --project "<<PROJECT_ID>>"
-
The authenticated user in the local development environment (or where
terraform apply
is called) must have the Pub/Sub Admin (roles/pubsub.admin) IAM role. -
Python and some python packages need to be installed with pip in the local development environment deploying the cluster. One can use following commands:
wget https://raw.githubusercontent.com/SchedMD/slurm-gcp/5.6.2/scripts/requirements.txt pip3 install -r requirements.txt --user
For more information, see the description of this module.
-
The project in your gcloud config must match the project the cluster is being deployed onto due to a known issue with the reconfigure scripts. To set your default config project, run the following command:
gcloud config set core/project <<PROJECT ID>>
If the gcloud project ID is not properly set you may see an error during terraform deployment similar to the following:
google.api_core.exceptions.NotFound: 404 Resource not found Could not find in SpannerConfigStore: TopicByProjectIdAndName(project_id=<incorrect project #>, topic_name=<topic name>)
For more information on creating valid custom images for the controller VM instance or for custom instance templates, see our vm-images.md documentation page.
More information on GPU support in Slurm on GCP and other HPC Toolkit modules can be found at docs/gpu-support.md
For more information on how to configure an on premise slurm cluster with hybrid cloud partitions, see the schedmd-slurm-gcp-v5-hybrid module and our extended instructions in our docs.
The HPC Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.
Copyright 2023 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Name | Version |
---|---|
terraform | >= 0.14.0 |
>= 3.83 |
Name | Version |
---|---|
>= 3.83 |
Name | Source | Version |
---|---|---|
slurm_controller_instance | github.com/SchedMD/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_controller_instance | 5.6.2 |
slurm_controller_template | github.com/SchedMD/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template | 5.6.2 |
Name | Type |
---|---|
google_compute_default_service_account.default | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
access_config | Access configurations, i.e. IPs via which the VM instance can be accessed via the Internet. | list(object({ |
[] |
no |
additional_disks | List of maps of disks. | list(object({ |
[] |
no |
can_ip_forward | Enable IP forwarding, for NAT instances for example. | bool |
false |
no |
cgroup_conf_tpl | Slurm cgroup.conf template file path. | string |
null |
no |
cloud_parameters | cloud.conf options. | object({ |
{ |
no |
cloudsql | Use this database instead of the one on the controller. server_ip : Address of the database server. user : The user to access the database as. password : The password, given the user, to access the given database. (sensitive) db_name : The database to access. |
object({ |
null |
no |
compute_startup_script | Startup script used by the compute VMs. | string |
"" |
no |
compute_startup_scripts_timeout | The timeout (seconds) applied to the compute_startup_script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled. |
number |
300 |
no |
controller_startup_script | Startup script used by the controller VM. | string |
"" |
no |
controller_startup_scripts_timeout | The timeout (seconds) applied to the controller_startup_script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled. |
number |
300 |
no |
deployment_name | Name of the deployment. | string |
n/a | yes |
disable_controller_public_ips | If set to false. The controller will have a random public IP assigned to it. Ignored if access_config is set. | bool |
true |
no |
disable_default_mounts | Disable default global network storage from the controller * /usr/local/etc/slurm * /etc/munge * /home * /apps Warning: If these are disabled, the slurm etc and munge dirs must be added manually, or some other mechanism must be used to synchronize the slurm conf files and the munge key across the cluster. |
bool |
false |
no |
disable_smt | Disables Simultaneous Multi-Threading (SMT) on instance. | bool |
true |
no |
disk_auto_delete | Whether or not the boot disk should be auto-deleted. | bool |
true |
no |
disk_labels | Labels specific to the boot disk. These will be merged with var.labels. | map(string) |
{} |
no |
disk_size_gb | Boot disk size in GB. | number |
50 |
no |
disk_type | Boot disk type, can be either pd-ssd, local-ssd, or pd-standard. | string |
"pd-ssd" |
no |
enable_bigquery_load | Enable loading of cluster job usage into big query. | bool |
false |
no |
enable_cleanup_compute | Enables automatic cleanup of compute nodes and resource policies (e.g. placement groups) managed by this module, when cluster is destroyed. NOTE: Requires Python and pip packages listed at the following link: https://github.com/SchedMD/slurm-gcp/blob/3979e81fc5e4f021b5533a23baa474490f4f3614/scripts/requirements.txt WARNING: Toggling this may impact the running workload. Deployed compute nodes may be destroyed and their jobs will be requeued. |
bool |
false |
no |
enable_cleanup_subscriptions | Enables automatic cleanup of pub/sub subscriptions managed by this module, when cluster is destroyed. NOTE: Requires Python and pip packages listed at the following link: https://github.com/SchedMD/slurm-gcp/blob/3979e81fc5e4f021b5533a23baa474490f4f3614/scripts/requirements.txt WARNING: Toggling this may temporarily impact var.enable_reconfigure behavior. |
bool |
false |
no |
enable_confidential_vm | Enable the Confidential VM configuration. Note: the instance image must support option. | bool |
false |
no |
enable_devel | Enables development mode. Not for production use. | bool |
false |
no |
enable_oslogin | Enables Google Cloud os-login for user login and authentication for VMs. See https://cloud.google.com/compute/docs/oslogin |
bool |
true |
no |
enable_reconfigure | Enables automatic Slurm reconfiguration when Slurm configuration changes (e.g. slurm.conf.tpl, partition details). Compute instances and resource policies (e.g. placement groups) will be destroyed to align with new configuration. NOTE: Requires Python and Google Pub/Sub API. WARNING: Toggling this will impact the running workload. Deployed compute nodes will be destroyed and their jobs will be requeued. |
bool |
false |
no |
enable_shielded_vm | Enable the Shielded VM configuration. Note: the instance image must support option. | bool |
false |
no |
epilog_scripts | List of scripts to be used for Epilog. Programs for the slurmd to execute on every node when a user's job completes. See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog. |
list(object({ |
[] |
no |
gpu | GPU information. Type and count of GPU to attach to the instance template. See https://cloud.google.com/compute/docs/gpus more details. - type : the GPU type, e.g. nvidia-tesla-t4, nvidia-a100-80gb, nvidia-tesla-a100, etc - count : number of GPUs If both 'var.gpu' and 'var.guest_accelerator' are set, 'var.gpu' will be used. |
object({ |
null |
no |
guest_accelerator | Alternative method of providing 'var.gpu' with a consistent naming scheme to other HPC Toolkit modules. If both 'var.gpu' and 'var.guest_accelerator' are set, 'var.gpu' will be used. |
list(object({ |
null |
no |
instance_image | Defines the image that will be used in the Slurm controller VM instance. This value is overridden if any of source_image , source_image_family orsource_image_project are set.Expected Fields: name: The name of the image. Mutually exclusive with family. family: The image family to use. Mutually exclusive with name. project: The project where the image is hosted. For more information on creating custom images that comply with Slurm on GCP see the "Slurm on GCP Custom Images" section in docs/vm-images.md. |
map(string) |
{ |
no |
instance_template | Self link to a custom instance template. If set, other VM definition variables such as machine_type and instance_image will be ignored in favor of the provided instance template. For more information on creating custom images for the instance template that comply with Slurm on GCP see the "Slurm on GCP Custom Images" section in docs/vm-images.md. |
string |
null |
no |
labels | Labels, provided as a map. | map(string) |
{} |
no |
login_startup_scripts_timeout | The timeout (seconds) applied to the login startup script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled. |
number |
300 |
no |
machine_type | Machine type to create. | string |
"c2-standard-4" |
no |
metadata | Metadata, provided as a map. | map(string) |
{} |
no |
min_cpu_platform | Specifies a minimum CPU platform. Applicable values are the friendly names of CPU platforms, such as Intel Haswell or Intel Skylake. See the complete list: https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform |
string |
null |
no |
network_ip | Private IP address to assign to the instance if desired. | string |
"" |
no |
network_self_link | Network to deploy to. Either network_self_link or subnetwork_self_link must be specified. | string |
null |
no |
network_storage | An array of network attached storage mounts to be configured on all instances. | list(object({ |
[] |
no |
on_host_maintenance | Instance availability Policy. | string |
"MIGRATE" |
no |
partition | Cluster partitions as a list. | list(object({ |
[] |
no |
preemptible | Allow the instance to be preempted. | bool |
false |
no |
project_id | Project ID to create resources in. | string |
n/a | yes |
prolog_scripts | List of scripts to be used for Prolog. Programs for the slurmd to execute whenever it is asked to run a job step from a new job allocation. See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog. |
list(object({ |
[] |
no |
region | Region where the instances should be created. | string |
null |
no |
service_account | Service account to attach to the controller instance. If not set, the default compute service account for the given project will be used with the "https://www.googleapis.com/auth/cloud-platform" scope. |
object({ |
null |
no |
shielded_instance_config | Shielded VM configuration for the instance. Note: not used unless enable_shielded_vm is 'true'. enable_integrity_monitoring : Compare the most recent boot measurements to the integrity policy baseline and return a pair of pass/fail results depending on whether they match or not. enable_secure_boot : Verify the digital signature of all boot components, and halt the boot process if signature verification fails. enable_vtpm : Use a virtualized trusted platform module, which is a specialized computer chip you can use to encrypt objects like keys and certificates. |
object({ |
{ |
no |
slurm_cluster_name | Cluster name, used for resource naming and slurm accounting. If not provided it will default to the first 8 characters of the deployment name (removing any invalid characters). | string |
null |
no |
slurm_conf_tpl | Slurm slurm.conf template file path. | string |
null |
no |
slurmdbd_conf_tpl | Slurm slurmdbd.conf template file path. | string |
null |
no |
source_image | The custom VM image. It is recommended to use instance_image instead. |
string |
"" |
no |
source_image_family | The custom VM image family. It is recommended to use instance_image instead. |
string |
"" |
no |
source_image_project | The hosting the custom VM image. It is recommended to use instance_image instead. |
string |
"" |
no |
static_ips | List of static IPs for VM instances. | list(string) |
[] |
no |
subnetwork_project | The project that subnetwork belongs to. | string |
null |
no |
subnetwork_self_link | Subnet to deploy to. Either network_self_link or subnetwork_self_link must be specified. | string |
null |
no |
tags | Network tag list. | list(string) |
[] |
no |
zone | Zone where the instances should be created. If not specified, instances will be spread across available zones in the region. |
string |
null |
no |
Name | Description |
---|---|
controller_instance_id | The server-assigned unique identifier of the controller compute instance. |