-
Notifications
You must be signed in to change notification settings - Fork 2
Adding a node to the Slurm cluster
If slurm and munge are already installed, you might need to remove the users, groups and packages before moving forward (see here)
Setup Munge and Slurm users and groups :
export MUNGEUSER=991
sudo groupadd -g $MUNGEUSER munge
sudo useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=992
sudo groupadd -g $SLURMUSER slurm
sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
Install Munge and Slurm :
sudo apt install slurm-wlm slurm-client munge
Add new node info to server's slurm.conf
Connect server to worker through SSH
(Optional) Create a passphrase-less ssh key for Norlab purposes :
ssh-keygen -q -t rsa -b 4096 -N '' -f ~/.ssh/norlab
Share a passphrase-less ssh key :
ssh-copy-id -i ~/.ssh/norlab.pub user@ip
From server to worker :
- Copy
/etc/munge/munge.key
- Copy
~/.slurm.conf
andcgroup.conf
files from/etc/slurm-llnl
Add GPU info to /etc/slurm-llnl/grep.conf
on worker
Enable and start correct services on worker :
sudo systemctl enable munge
sudo systemctl start munge
sudo systemctl enable slurmd
sudo systemctl start slurmd
Restart slurmctld
service on server :
sudo systemctl restart slurmctld
To enable a node for computing on the cluster (kills jobs that are running) :
sudo scontrol update nodename=<nodename> state=idle
To re-enable a node for computing on the cluster (keeps jobs running) :
sudo scontrol update nodename=<nodename> state=resume
To disable access to a node on the cluster (lets jobs finish) :
sudo scontrol update nodename=<nodename> state=drain reason=<reason>
All IO data should be situated on the server
Each node needs consistant directories w.r.t. server for slurm deployment
Use rsync
to synchronize files, as it's faster and more secure than scp
The slurm script should contain :
- Initial data transfer for input data
- Execution of code through a Docker container (with
docker run
) - Final data transfer for output data
- The Docker container should have the option
--rm
, to avoid cluttering the worker node - Thus, volumes should be used to ensure IO data is external to the container
-
trap
anddocker wait
commands can be used to wait until the end of the container's life, to subsequently transfer the output data back to the server node
- The Docker container should have the option
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=2-00:00
#SBATCH --job-name=ubt-m2f
#SBATCH --output=%x-%j.out
SERVER_USER=user-norlab
SERVER_IP=123.456.789.123
INPUT_DIR=dataset
OUTPUT_DIR=outputs
rsync -vPr --rsh=ssh "$SERVER_USER@$SERVER_IP:$(pwd)/$INPUT_DIR" .
docker build -t image_name .
container_id=$(
docker run --gpus all -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES --rm --ipc host --detach \
--mount type=bind,source="$(pwd)",target=/app \
--mount type=bind,source="$(pwd)/$INPUT_DIR",target="/app/$INPUT_DIR" \
--mount type=bind,source="$(pwd)/$OUTPUT_DIR",target="/app/$OUTPUT_DIR" \
--mount type=bind,source=/dev/shm,target=/dev/shm \
image_name python train_net.py
)
transfer_outputs() {
rsync -vPr --rsh=ssh "$OUTPUT_DIR" "$SERVER_USER@$SERVER_IP:$(pwd)"
}
trap transfer_outputs EXIT
docker wait $container_id
To specify which GPU you need with slurm, specify a list of nodes lacking the resources you need
-
--include
will not work, as it implies that every node should be used
sbatch --exclude=node1,node3 my-script.sh
https://nekodaemon.com/2022/09/02/Slurm-Quick-Installation-for-Cluster-on-Ubuntu-20-04/
https://github.com/norlab-ulaval/mask_bev/blob/main/slurm_train.sh
- Warthog Teach and Repeat (ROS1)
- Warthog Teach and Repeat (ROS2)
- Time Synchronization (NTP)
- Time Synchronization (PTP)
- Deployment of Robotic Total Stations (RTS)
- Deployment of the backpack GPS
- Warthog Emlid GPS
- Atlans-C INS
- How to use a CB Radio when going in the forest
- IP forwarding
- Emlid Data Postprocessing (PPK)
- Setting up a reliable robot communication with Zenoh
- Zenoh rmw
- Lessons Learned
- Robots' 3D Models
- Order Management
- Fast track Master → PhD
- Intellectual Property
- Repository Guidelines
- TF Cheatsheet
- Montmorency Forest Wintertime Dataset
- RTS-GT Dataset 2023
- Deschenes2021 Dataset
- TIGS Dataset
- DRIVE Datasets
- BorealHDR
- TimberSeg 1.0
- DARPA Subterranean Challenge - Urban Dataset
- How to upload a dataset to VALERIA
- ROS1 Bridge
- Migrating a repository to ROS2 (Humble)
- ROS2 and rosbags
- MCAP rosbags
- DDS Configuration (work in progress)
- Using a USB Microphone with ROS2
- ROS2 in VSCode
- ROS2 Troubleshooting