This directory contains a set of example blueprint files that can be fed into gHPC to create a deployment.
- Instructions
- Blueprint Descriptions
- hpc-cluster-small.yaml
- hpc-cluster-high-io.yaml
- image-builder.yaml
- cloud-batch.yaml
- batch-mpi.yaml
- lustre.yaml
- slurm-gcp-v5-hpc-centos7.yaml
- slurm-gcp-v5-ubuntu2004.yaml
- slurm-gcp-v5-high-io.yaml
- hpc-cluster-intel-select.yaml
- daos-cluster.yaml
- daos-slurm.yaml
- hpc-cluster-amd-slurmv5.yaml
- quantum-circuit-simulator.yaml
- spack-gromacs.yaml
- omnia-cluster.yaml
- hpc-cluster-small-sharedvpc.yaml
- hpc-cluster-localssd.yaml
- htcondor-pool.yaml
- starccm-tutorial.yaml
- fluent-tutorial.yaml
- Blueprint Schema
- Writing an HPC Blueprint
- Variables
Ensure project_id
, zone
, and region
deployment variables are set correctly
under vars
before using an example blueprint.
NOTE: Deployment variables defined under
vars
are automatically passed to modules if the modules have an input that matches the variable name.
The following block will configure terraform to point to an existing GCS bucket
to store and manage the terraform state. Add your own bucket name in place of
<<BUCKET_NAME>>
and (optionally) a service account in place of
<<SERVICE_ACCOUNT>>
in the configuration. If not set, the terraform state will
be stored locally within the generated deployment directory.
Add this block to the top-level of your blueprint:
terraform_backend_defaults:
type: gcs
configuration:
bucket: <<BUCKET_NAME>>
impersonate_service_account: <<SERVICE_ACCOUNT>>
You can set the configuration using the CLI in the create
and expand
subcommands as well:
./ghpc create examples/hpc-cluster-small.yaml \
--vars "project_id=${GOOGLE_CLOUD_PROJECT}" \
--backend-config "bucket=${GCS_BUCKET}"
NOTE: The
--backend-config
argument supports comma-separated list of name=value variables to set Terraform Backend configuration in blueprints. This feature only supports variables of string type. If you set configuration in both the blueprint and CLI, the tool uses values at CLI. "gcs" is set as type by default.
The example blueprints listed below labeled with the core badge () are located in this folder and are developed and tested by the HPC Toolkit team directly.
The community blueprints are contributed by the community (including the HPC Toolkit team, partners, etc.) and are labeled with the community badge (). The community blueprints are located in the community folder.
Blueprints that are still in development and less stable are also labeled with the experimental badge ().
Creates a basic auto-scaling Slurm cluster with mostly default settings. The
blueprint also creates a new VPC network, and a filestore instance mounted to
/home
.
There are 2 partitions in this example: debug
and compute
. The debug
partition uses n2-standard-2
VMs, which should work out of the box without
needing to request additional quota. The purpose of the debug
partition is to
make sure that first time users are not immediately blocked by quota
limitations.
There is a compute
partition that achieves higher performance. Any
performance analysis should be done on the compute
partition. By default it
uses c2-standard-60
VMs with placement groups enabled. You may need to request
additional quota for C2 CPUs
in the region you are deploying in. You can
select the compute partition using the -p compute
argument when running srun
.
For this example the following is needed in the selected region:
- Cloud Filestore API: Basic HDD (Standard) capacity (GB): 1,024 GB
- Compute Engine API: Persistent Disk SSD (GB): ~50 GB
- Compute Engine API: Persistent Disk Standard (GB): ~20 GB static + 20 GB/node up to 500 GB
- Compute Engine API: N2 CPUs: 10
- Compute Engine API: C2 CPUs: 4 for controller node and 60/node active
in
compute
partition up to 1,204 - Compute Engine API: Affinity Groups: one for each job in parallel - only
needed for
compute
partition - Compute Engine API: Resource policies: one for each job in parallel -
only needed for
compute
partition
Creates a Slurm cluster with tiered file systems for higher performance. It connects to the default VPC of the project and creates two partitions and a login node.
File systems:
- The homefs mounted at
/home
is a default "BASIC_HDD" tier filestore with 1 TiB of capacity - The projectsfs is mounted at
/projects
and is a high scale SSD filestore instance with 10TiB of capacity. - The scratchfs is mounted at
/scratch
and is a DDN Exascaler Lustre file system designed for high IO performance. The capacity is ~10TiB.
There are two partitions in this example: low_cost
and compute
. The
low_cost
partition uses n2-standard-4
VMs. This partition can be used for
debugging and workloads that do not require high performance.
Similar to the small example, there is a compute partition that should be used for any performance analysis.
For this example the following is needed in the selected region:
- Cloud Filestore API: Basic HDD (Standard) capacity (GB) per region: 1,024 GB
- Cloud Filestore API: High Scale SSD capacity (GB) per region: 10,240 GiB - min quota request is 61,440 GiB
- Compute Engine API: Persistent Disk SSD (GB): ~14,050 GB
- Compute Engine API: Persistent Disk Standard (GB): ~396 GB static + 20 GB/node up to 4596 GB
- Compute Engine API: N2 CPUs: 158
- Compute Engine API: C2 CPUs: 8 for controller node and 60/node active
in
compute
partition up to 12,008 - Compute Engine API: Affinity Groups: one for each job in parallel - only
needed for
compute
partition - Compute Engine API: Resource policies: one for each job in parallel -
only needed for
compute
partition
This Blueprint uses the Packer template module to create custom VM images by applying software and configurations to existing images.
This example performs the following:
- Creates a network needed to build the image (see Custom Network).
- Sets up a script that will be used to configure the image (see Toolkit Runners).
- Builds a new image by modifying the Slurm image (see Packer Template).
- Deploys a Slurm cluster using the newly built image (see Slurm Cluster Based on Custom Image).
Note: this example relies on the default behavior of the Toolkit to derive naming convention for networks and other modules from the
deployment_name
.
The commands needed to run through this example would look like:
# Create a deployment from the blueprint
./ghpc create examples/image-builder.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
# Deploy the network for packer (1) and generate the startup script (2)
terraform -chdir=image-builder-001/builder-env init
terraform -chdir=image-builder-001/builder-env validate
terraform -chdir=image-builder-001/builder-env apply
# Provide startup script to Packer
terraform -chdir=image-builder-001/builder-env output \
-raw startup_script_scripts_for_image > \
image-builder-001/packer/custom-image/startup_script.sh
# Build image (3)
cd image-builder-001/packer/custom-image
packer init .
packer validate -var startup_script_file=startup_script.sh .
packer build -var startup_script_file=startup_script.sh .
# Deploy Slurm cluster (4)
cd -
terraform -chdir=image-builder-001/cluster init
terraform -chdir=image-builder-001/cluster validate
terraform -chdir=image-builder-001/cluster apply
# When you are done you can clean up the resources in reverse order of creation
terraform -chdir=image-builder-001/cluster destroy --auto-approve
terraform -chdir=image-builder-001/builder-env destroy --auto-approve
Using a custom VM image can be more scalable than installing software using boot-time startup scripts because:
- it avoids reliance on continued availability of package repositories
- VMs will join an HPC cluster and execute workloads more rapidly due to reduced boot-time configuration
- machines are guaranteed to boot with a static set of packages available when the custom image was created. No potential for some machines to be upgraded relative to other based upon their creation time!
A tool called Packer builds custom VM images by creating short-lived VMs, executing scripts on them, and saving the boot disk as an image that can be used by future VMs. The short-lived VM must operate in a network that
- has outbound access to the internet for downloading software
- has SSH access from the machine running Packer so that local files/scripts can be copied to the VM
This deployment group creates such a network, while using Cloud Nat and Identity-Aware Proxy (IAP) to allow outbound traffic and inbound SSH connections without exposing the machine to the internet on a public IP address.
The Toolkit startup-script module supports boot-time configuration of VMs using "runners". Runners are configured as a series of scripts uploaded to Cloud Storage. A simple, standard VM startup script runs at boot-time, downloads the scripts from Cloud Storage and executes them in sequence.
The standard bash startup script is exported as a string by the startup-script module.
The script in this example is performing the trivial task of creating a file in the image's home directory just to demonstrate the capability. You can expand the startup-script module to install more complex dependencies.
The Packer template in this deployment group accepts several methods for
executing custom scripts. To pass the exported startup string to it, you
must collect it from the Terraform module and provide it to the Packer template.
After running terraform -chdir=image-builder-001/builder-env apply
as
instructed by ghpc
, execute the following:
terraform -chdir=image-builder-001/builder-env \
output -raw startup_script_install_ansible > \
image-builder-001/packer/custom-image/startup_script.sh
cd image-builder-001/packer/custom-image
packer init .
packer validate -var startup_script_file=startup_script.sh .
packer build -var startup_script_file=startup_script.sh .
For this example the following is needed in the selected region:
- Compute Engine API: Images (global, not regional quota): 1 image per invocation of
packer build
- Compute Engine API: Persistent Disk SSD (GB): ~50 GB
- Compute Engine API: Persistent Disk Standard (GB): ~64 GB static + 32 GB/node up to 704 GB
- Compute Engine API: N2 CPUs: 4 (for short-lived Packer VM and Slurm login node)
- Compute Engine API: C2 CPUs: 4 for controller node and 60/node active
in
compute
partition up to 1,204 - Compute Engine API: Affinity Groups: one for each job in parallel - only
needed for
compute
partition - Compute Engine API: Resource policies: one for each job in parallel -
only needed for
compute
partition
Once the Slurm cluster has been deployed we can test that our Slurm compute
partition is now using the image we built. It should contain the hello.txt
file that was added during image build:
- SSH into the login node
slurm-image-builder-001-login0
. - Run a job that prints the contents of the added file:
$ srun -N 2 cat /home/hello.txt
Hello World
Hello World
This example demonstrates how to use the HPC Toolkit to set up a Google Cloud Batch job that mounts a Filestore instance and runs startup scripts.
The blueprint creates a Filestore and uses the startup-script
module to mount
and load "data" onto the shared storage. The batch-job-template
module creates
an instance template to be used for the Google Cloud Batch compute VMs and
renders a Google Cloud Batch job template. A login node VM is created with
instructions on how to SSH to the login node and submit the Google Cloud Batch
job.
This blueprint demonstrates how to use Spack to run a real MPI job on Batch.
The blueprint contains the following:
- A shared
filestore
filesystem. - A
spack-install
module that builds a script to install Spack and the WRF application onto the sharedfilestore
. - A
startup-script
module which uses the above script and stages job data. - A builder
vm-instance
which performs the Spack install and then shuts down. - A
batch-job-template
that builds a Batch job to execute the WRF job. - A
batch-login
VM that can be used to test and submit the Batch job.
Usage instructions:
-
Spack install
After
terraform apply
completes, you must wait for Spack installation to finish before running the Batch job. You will observe that a VM namedspack-builder-0
has been created. This VM will automatically shut down once Spack installation has completed. When using a Spack cache this takes about 25 minutes. Without a Spack cache this will take 2 hours. To view build progress or debug you can inspect/var/logs/messages
and/var/log/spack.log
on the builder VM. -
Access login node
After the builder shuts down, you can ssh to the Batch login node named
batch-wrf-batch-login
. Instructions on how to ssh to the login node are printed to the terminal after a successfulterraform apply
. You can reprint these instructions by calling the following:terraform -chdir=batch-wrf/primary output instructions_batch-login
Once on the login node you should be able to inspect the Batch job template found in the
/home/batch-jobs
directory. This Batch job will call a script found at/share/wrfv3/submit_wrfv3.sh
. Note that the/share
directory is shared between the login node and the Batch job. -
Submit the Batch job
Use the command provided in the terraform output instructions to submit your Batch job and check its status. The Batch job may take several minutes to start and once running should complete within 5 minutes.
-
Inspect results
The Batch job will create a folder named
/share/jobs/<unique id>
. Once the job has finished this folder will contain the results of the job. You can inspect thersl.out.0000
file for a summary of the job.
Creates a DDN EXAScaler lustre file-system that is mounted in two client instances.
The DDN Exascaler Lustre
file system is designed for high IO performance. It has a default capacity of ~10TiB and is mounted at /lustre
.
After the creation of the file-system and the client instances, the lustre drivers will be automatically installed and the mount-point configured on the VMs. This may take a few minutes after the VMs are created and can be verified by running:
watch mount -t lustre
For this example the following is needed in the selected region:
- Compute Engine API: Persistent Disk SSD (GB): ~14TB: 3500GB MDT, 3500GB OST[0-2]
- Compute Engine API: Persistent Disk Standard (GB): ~756GB: 20GB MDS, 276GB MGS, 3x20GB OSS, 2x200GB client-vms
- Compute Engine API: N2 CPUs: ~116: 32 MDS, 32 MGS, 3x16 OSS, 2x2 client-vms
This example creates an HPC cluster similar to the one created by hpc-cluster-small.yaml, but uses modules built from version 5 of slurm-gcp.
The cluster will support 2 partitions named debug
and compute
.
The debug
partition is the default partition and runs on smaller
n2-standard-2
nodes. The compute
partition is not default and requires
specifying in the srun
command via the --partition
flag. The compute
partition runs on compute optimized nodes of type cs-standard-60
. The
compute
partition may require additional quota before using.
For this example the following is needed in the selected region:
- Cloud Filestore API: Basic HDD (Standard) capacity (GB): 1,024 GB
- Compute Engine API: Persistent Disk SSD (GB): ~50 GB
- Compute Engine API: Persistent Disk Standard (GB): ~50 GB static + 50 GB/node up to 1,250 GB
- Compute Engine API: N2 CPUs: 12
- Compute Engine API: C2 CPUs: 4 for controller node and 60/node active
in
compute
partition up to 1,204 - Compute Engine API: Affinity Groups: one for each job in parallel - only
needed for
compute
partition - Compute Engine API: Resource policies: one for each job in parallel -
only needed for
compute
partition
Similar to the previous example, but using Ubuntu 20.04 instead of CentOS 7. Other operating systems are supported by SchedMD for the the Slurm on GCP project and images are listed here. Only the examples listed in this page been tested by the Cloud HPC Toolkit team.
This example creates an HPC cluster similar to the one created by hpc-cluster-small.yaml, but uses modules built from version 5 of slurm-gcp and Ubuntu.
The cluster will support 2 partitions named debug
and compute
.
The debug
partition is the default partition and runs on smaller
n2-standard-2
nodes. The compute
partition is not default and requires
specifying in the srun
command via the --partition
flag. The compute
partition runs on compute optimized nodes of type cs-standard-60
. The
compute
partition may require additional quota before using.
For this example the following is needed in the selected region:
- Cloud Filestore API: Basic HDD (Standard) capacity (GB): 1,024 GB
- Compute Engine API: Persistent Disk SSD (GB): ~50 GB
- Compute Engine API: Persistent Disk Standard (GB): ~50 GB static + 50 GB/node up to 1,250 GB
- Compute Engine API: N2 CPUs: 12
- Compute Engine API: C2 CPUs: 4 for controller node and 60/node active
in
compute
partition up to 1,204 - Compute Engine API: Affinity Groups: one for each job in parallel - only
needed for
compute
partition - Compute Engine API: Resource policies: one for each job in parallel -
only needed for
compute
partition
This example uses Slurm on GCP version 5.x modules to replicate the hpc-cluster-high-io.yaml core example. With version 5, additional features are available and utilized in this example:
- node groups are used to allow multiple machine types in a single partition, differentiated by node names.
- Active cluster reconfiguration is on by default. When updating a partition or
cluster configuration, the overwrite option (
-w
) can be used and upon re-applying the deployment, the changes will become active without having to destroy and recreate the cluster.
This blueprint will create a cluster with the following storage tiers:
- The homefs mounted at
/home
is a default "BASIC_HDD" tier filestore with 1 TiB of capacity - The projectsfs is mounted at
/projects
and is a high scale SSD filestore instance with 10TiB of capacity. - The scratchfs is mounted at
/scratch
and is a DDN Exascaler Lustre file system designed for high IO performance. The capacity is ~10TiB.
The cluster will support 2 partitions:
lowcost
- Includes two node groups,
n2s2
of machine typen2-standard-2
andn2s4
of machine typen2-standard-4
. - Default partition.
- Designed to run with lower cost nodes and within a typical project's default quota.
- Includes two node groups,
compute
- Includes two node groups,
c2s60
of machine typec2-standard-60
andc2s30
of machine typec2-standard-30
. - Can be used by setting the
--partition
option insrun
tocompute
. - Designed for performance, but may require additional quota before using.
- Includes two node groups,
This example defines partitions with more than one node group each. For more information on node groups and why they are used, see the documentation in the schedmd-slurm-gcp-v5-node-group module documentation. Some reference commands are listed here for specifying not only the partition, but also the correct node group when executing a Slurm command on a cluster generated by this blueprint.
Partition: compute; Node Group: c2s30; Machine Type: c2-standard-30
srun -N 4 -p compute -w highioslur-compute-c2s30-[0-3] hostname
Partition: compute; Node Group: c2s60; Machine Type: c2-standard-60
srun -N 4 -p compute --mincpus=30 hostname
Partition: lowcost; Node Group: n2s2; Machine Type: n2-standard-2
srun -N 4 -w highioslur-lowcost-n2s2-[0-3] hostname
Partition: lowcost; Node Group: n2s4; Machine Type: n2-standard-4
srun -N 4 --mincpus=2 hostname
For this example the following is needed in the selected region:
- Cloud Filestore API: Basic HDD (Standard) capacity (GB) per region: 1,024 GB
- Cloud Filestore API: High Scale SSD capacity (GB) per region: 10,240 GiB - min quota request is 61,440 GiB
- Compute Engine API: Persistent Disk SSD (GB): ~14,050 GB
- Compute Engine API: Persistent Disk Standard (GB): ~396 GB static + 20 GB/node up to 4596 GB
- Compute Engine API: N2 CPUs:
- 4 for the login node
- 2 per node for active nodes in the
n2s2
group, maximum 20. - 4 per node for active nodes in the
n2s4
group, maximum 40. - Maximum possible: 64
- Compute Engine API: C2 CPUs:
- 8 for controller node
- 60 per node for active nodes in the
c2s60
group, maximum 12,000. - 30 per node for active nodes in the
c2s30
group, maximum 6,000. - Maximum possible: 18,008
- Compute Engine API: Affinity Groups: one for each job in parallel - only
needed for
compute
partition - Compute Engine API: Resource policies: one for each job in parallel -
only needed for
compute
partition
This example provisions a Slurm cluster automating the steps to comply to the Intel Select Solutions for Simulation & Modeling Criteria. It is more extensively discussed in a dedicated README for Intel examples.
This example provisions a DAOS cluster with managed instance groups for the servers and for clients. It is more extensively discussed in a dedicated README for Intel examples.
This example provisions DAOS servers and a Slurm cluster. It is more extensively discussed in a dedicated README for Intel examples.
This example provisions a Slurm cluster using AMD VM machine types. It automates the initial setup of Spack, including a script that can be used to install the AMD Optimizing C/C++ Compiler (AOCC) and compile OpenMPI with AOCC. It is more extensively discussed in a dedicated README for AMD examples.
This blueprint provisions a N1 series VM with NVIDIA T4 GPU accelerator and compiles qsim, a Google Quantum AI-developed tool that simulates quantum circuits using CPUs and GPUs. The installation of qsim, the CUDA Toolkit, and the cuQuantum SDK is fully automated but takes a significant time (approx. 20 minutes). Once complete, a qsim example can be run by connecting to the VM by SSH and running
conda activate qsim
python /var/tmp/qsim-example.py
Spack is an HPC software package manager. This example creates a small Slurm cluster with software installed using the spack-install module The controller will install and configure spack, and install gromacs using spack. Spack is installed in a shared location (/sw) via filestore. This build leverages the startup-script module and can be applied in any cluster by using the output of spack-install or startup-script modules.
The installation will occur as part of the Slurm startup-script, a warning message will be displayed upon SSHing to the login node indicating that configuration is still active. To track the status of the overall startup script, run the following command on the login node:
sudo tail -f /var/log/messages
Spack specific installation logs will be sent to the spack_log as configured in your blueprint, by default /var/log/spack.log in the login node.
sudo tail -f /var/log/spack.log
Once the Slurm and Spack configuration is complete, spack will be available on the login node. To use spack in the controller or compute nodes, the following command must be run first:
source /sw/spack/share/spack/setup-env.sh
To load the gromacs module, use spack:
spack load gromacs
NOTE: Installing spack compilers and libraries in this example can take hours to run on startup. To decrease this time in future deployments, consider including a spack build cache as described in the comments of the example.
Creates a simple Dell Omnia provisioned cluster with an
omnia-manager node that acts as the slurm manager and 2 omnia-compute nodes on
the pre-existing default network. Omnia will be automatically installed after
the nodes are provisioned. All nodes mount a filestore instance on /home
.
NOTE: The omnia-cluster.yaml example uses
vm-instance
modules to create the cluster. For these instances, Simultaneous Multithreading (SMT) is turned off by default, meaning that only the physical cores are visible. For the compute nodes, this means that 30 physical cores are visible on thec2-standard-60
VMs. To activate all 60 virtual cores, includethreads_per_core=2
under settings for the compute vm-instance module.
This blueprint demonstrates the use of the Slurm and Filestore modules in the service project of an existing Shared VPC. Before attempting to deploy the blueprint, one must first complete initial setup for provisioning Filestore in a Shared VPC service project.
This blueprint demonstrates the use of Slurm and Filestore, with the definition of a partition which deploys compute nodes that have local ssd drives deployed. Before deploying this blueprint, one must first ensure to have an existing VPC properly configured (allowing Internet access and allowing inter virtual machine communications, for NFS and also for communications between the Slurm nodes)
This blueprint provisions an auto-scaling HTCondor pool based upon the HPC VM Image.
Also see the tutorial, which walks through the use of this blueprint.
This blueprint provisions a simple cluster for use with a Simcenter StarCCM+ tutorial.
The main tutorial is described on the HPC Toolkit website.
This blueprint provisions a simple cluster for use with an Ansys Fluent tutorial.
The main tutorial is described on the HPC Toolkit website.
Similar documentation can be found on Google Cloud Docs.
A user defined blueprint should follow the following schema:
# Required: Name your blueprint.
blueprint_name: my-blueprint-name
# Top-level variables, these will be pulled from if a required variable is not
# provided as part of a module. Any variables can be set here by the user,
# labels will be treated differently as they will be applied to all created
# GCP resources.
vars:
# Required: This will also be the name of the created deployment directory.
deployment_name: first_deployment
project_id: GCP_PROJECT_ID
# https://cloud.google.com/compute/docs/regions-zones
region: us-central1
zone: us-central1-a
# https://cloud.google.com/resource-manager/docs/creating-managing-labels
labels:
global_label: label_value
# Many modules can be added from local and remote directories.
deployment_groups:
- group: groupName
modules:
# Local source, prefixed with ./ (/ and ../ also accepted)
- id: <a unique id> # Required: Name of this module used to uniquely identify it.
source: ./modules/role/module-name # Required: Points to the module directory.
kind: < terraform | packer > # Optional: Type of module, currently choose from terraform or packer. If not specified, `kind` will default to `terraform`
# Optional: All configured settings for the module. For terraform, each
# variable listed in variables.tf can be set here, and are mandatory if no
# default was provided and are not defined elsewhere (like the top-level vars)
settings:
setting1: value1
setting2:
- value2a
- value2b
setting3:
key3a: value3a
key3b: value3b
# Embedded module (part of the toolkit), prefixed with modules/
- source: modules/role/module-name
# GitHub module over SSH, prefixed with [email protected]
- source: [email protected]:org/repo.git//modules/role/module-name
# GitHub module over HTTPS, prefixed with github.com
- source: github.com/org/repo//modules/role/module-name
The blueprint file is composed of 3 primary parts, top-level parameters, deployment variables and deployment groups. These are described in more detail below.
The following is a template that can be used to start writing a blueprint from scratch.
---
blueprint_name: # boilerplate-blueprint
vars:
project_id: # my-project-id
deployment_name: # boilerplate-001
region: us-central1
zone: us-central1-a
deployment_groups:
- group: primary
modules:
- id: # network1
source: # modules/network/vpc
- blueprint_name (required): This name can be used to track resources and
usage across multiple deployments that come from the same blueprint.
blueprint_name
is used as a value for theghpc_blueprint
label key, and must abide to label value naming constraints:blueprint_name
must be at most 63 characters long, and can only contain lowercase letters, numeric characters, underscores and dashes.
vars:
region: "us-west-1"
labels:
"user-defined-deployment-label": "slurm-cluster"
...
Deployment variables are set under the vars field at the top level of the blueprint file. These variables can be explicitly referenced in modules as Blueprint Variables. Any module setting (inputs) not explicitly provided and matching exactly a deployment variable name will automatically be set to these values.
Deployment variables should be used with care. Module default settings with the same name as a deployment variable and not explicitly set will be overwritten by the deployment variable.
The “labels” deployment variable is a special case as it will be appended to labels found in module settings, whereas normally an explicit module setting would be left unchanged. This ensures that deployment-wide labels can be set alongside module specific labels. Precedence is given to the module specific labels if a collision occurs. Default module labels will still be overwritten by deployment labels.
The HPC Toolkit uses special reserved labels for monitoring each deployment. These are set automatically, but can be overridden in vars or module settings. They include:
- ghpc_blueprint: The name of the blueprint the deployment was created from
- ghpc_deployment: The name of the specific deployment
- ghpc_role: See below
A module role is a default label applied to modules (ghpc_role
), which
conveys what role that module plays within a larger HPC environment.
The modules provided with the HPC toolkit have been divided into roles matching the names of folders in the modules/ and community/modules directories (compute, file-system etc.).
When possible, custom modules should use these roles so that they match other modules defined by the toolkit. If a custom module does not fit into these roles, a new role can be defined.
A module's parent folder will define the module’s role if possible. Therefore, regardless of where the module is located, the module directory should be explicitly referenced at least 2 layers deep, where the top layer refers to the “role” of that module.
If a module is not defined at least 2 layers deep and the ghpc_role
label has
not been explicitly set in settings, ghpc_role will default to undefined
.
Below we show some of the core modules and their roles (as parent folders).
modules/
└── <<ROLE>
└── <<MODULE_NAME>>
modules/
├── compute
│ └── vm-instance
├── file-system
│ ├── pre-existing-network-storage
│ └── filestore
├── monitoring
│ └── dashboard
├── network
│ ├── pre-existing-vpc
│ └── vpc
├── packer
│ └── custom-image
└── scripts
└── startup-script
Deployment groups allow distinct sets of modules to be defined and deployed as a group. A deployment group can only contain modules of a single kind, for example a deployment group may not mix packer and terraform modules.
For terraform modules, a top-level main.tf will be created for each deployment group so different groups can be created or destroyed independently.
A deployment group is made of 2 fields, group and modules. They are described in more detail below.
Defines the name of the group. Each group must have a unique name. The name will be used to create the subdirectory in the deployment directory.
Modules are the building blocks of an HPC environment. They can be composed in a blueprint file to create complex deployments. Several modules are provided by default in the modules folder.
To learn more about how to refer to a module in a blueprint file, please consult the modules README file.
Variables can be used to refer both to values defined elsewhere in the blueprint and to the output and structure of other modules.
Variables in a blueprint file can refer to deployment variables or the outputs of other modules. For deployment and module variables, the syntax is as follows:
vars:
zone: us-central1-a
deployment_groups:
- group: primary
modules:
- id: resource1
source: path/to/module/1
...
- id: resource2
source: path/to/module/2
...
settings:
key1: $(vars.zone)
key2: $(resource1.name)
The variable is referred to by the source, either vars for deploment variables or the module ID for module variables, followed by the name of the value being referenced. The entire variable is then wrapped in “$()”.
Currently, references to variable attributes and string operations with variables are not supported.
Literal variables are not interpreted by ghpc
directly, but rather embedded in the
underlying module. Literal variables should only be used by those familiar
with the underlying module technology (Terraform or Packer); no validation
will be done before deployment to ensure that they are referencing
something that exists.
Literal variables are occasionally needed when referring to the data structure of the underlying module. For example, to refer to the subnetwork self link from a vpc module through terraform itself:
subnetwork_self_link: ((module.network1.primary_subnetwork.self_link))
Here the network1 module is referenced, the terraform module name is the same as the ID in the blueprint file. From the module we can refer to it's underlying variables as deep as we need, in this case the self_link for it's primary_subnetwork.
The entire text of the variable is wrapped in double parentheses indicating that everything inside will be provided as is to the module.
Whenever possible, blueprint variables are preferred over literal variables.
ghpc
will perform basic validation making sure all blueprint variables are
defined before creating a deployment, making debugging quicker and easier.
Under circumstances where the variable notation conflicts with the content of a setting or string, for instance when defining a startup-script runner that uses a subshell like in the example below, a non-quoted backslash (\
) can be used as an escape character. It preserves the literal value of the next character that follows:
\$(not.bp_var)
evaluates to$(not.bp_var)
.\((not.literal_var))
evaluates to((not.literal_var))
.
deployment_groups:
- group: primary
modules:
- id: resource1
source: path/to/module/1
settings:
key1: \((not.literal_var)) ## Evaluates to "((not.literal_var))".
...
- id: resource2
source: path/to/module/2
...
settings:
key1: |
#!/bin/bash
echo \$(cat /tmp/file1) ## Evaluates to "echo $(cat /tmp/file1)"