-
Notifications
You must be signed in to change notification settings - Fork 2
Mamba Server Setup
This guide's purpose is to give a quick overview of how to install everything required for the Mamba Server.
-
Install Fedora Server
-
Install NVidia drivers
# https://www.reddit.com/r/Fedora/comments/12ju2sg/i_need_help_with_installing_nvidia_drivers_to/ sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm sudo dnf update -y sudo dnf install akmod-nvidia -y sudo dnf install xorg-x11-drv-nvidia-cuda -y sudo reboot now
-
Alternative Install NVidia drivers via dnf module
Links to official install procedure supported by NVidia:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#fedorasudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/x86_64/cuda-$distro.repo
Replace
$distro
with the latest available version matching the server, currentlyfedora39
.Remark: often NVidia can be late by updating the version number of their repository. For example the current version of Fedora is 40 and the latest repo version is fedora39. Although the repository version is anterior, there will be no issue installing that version until a new one is made available available.
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/cuda-fedora39.repo sudo dnf makecache
cuda-fedora39-x86_64
should appear in the list of enabled repositories, along with an entry in/etc/yum.repos.d/cuda-fedora39.repo
.Next we install the nvidia driver and cuda toolkit via its now available module. Choose the appropriate version and between preferred closed or opensource module version. That will install a dkms driver that will be automatically updated with every kernel update.
Choose between the latest version of the driverlatest-dkms
that will be updated with dnf update, of pin a specific version for example555-dkms
. Do the same for cuda toolkit via the meta-packagecuda-toolkit
or target a specific verison of cuda.If a previously installed kernel via rpmfusion is installed, remove everything first.
sudo dnf autoremove akmod-nvidia xorg-x11-drv-nvidia-*
Then install the new drivers via its module.
sudo dnf module list sudo dnf module install nvidia-driver:latest-dkms
Check that the dkms module is built successfuly for all installed kernels.
$ sudo dkms status nvidia/555.42.02, 6.8.10-300.fc40.x86_64, x86_64: installed
Finally, proceed with the installation of the cuda toolkit
sudo dnf install cuda-toolkit
Select the default cuda version.
sudo update-alternatives --display cuda sudo update-alternatives --config cuda
Set nvcc and other utilities's PATH in
bashrc
orbash_profile
per user
/etc/environment
or/etc/profile.d/cuda.sh
for all usersexport CUDACXX=/usr/local/cuda/bin/nvcc export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
Make the driver persistent
sudo systemctl enable nvidia-persistenced.service sudo systemctl start nvidia-persistenced.service
Before rebooting update initial ramdisk.
sudo dracut -f sudo reboot now
-
Install Munge
export MUNGEUSER=1111 sudo groupadd -g $MUNGEUSER munge sudo useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge export SLURMUSER=1121 sudo groupadd -g $SLURMUSER slurm sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm sudo dnf install munge munge-devel munge-libs -y sudo dnf install rng-tools -y rngd -r /dev/urandom sudo mungekey sudo chown munge:munge /etc/munge/munge.key sudo chmod 400 /etc/munge/munge.key sudo chown -R munge: /etc/munge/ /var/log/munge/ sudo systemctl enable munge.service sudo systemctl start munge.service
-
Install Slurm
sudo dnf openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y sudo dnf install libcgroup libcgroup-tools libcgroup-devel mariadb mariadb-devel mariadb-server -y sudo dnf install autoconf automake perl -y dnf install dbus-devel -y # Build Slurm RPM sudo su cd wget https://download.schedmd.com/slurm/slurm-24.05.0-0rc1.tar.bz2 rpmbuild -ta slurm-24.05.0-0rc1.tar.bz2 cd rpmbuild/RPMS/x86_64/ dnf --nogpgcheck localinstall *.rpm -y # To reinstall if you recompile dnf --nogpgcheck reinstall *.rpm -y
-
Configure Slurm: Copy the configs from norlab-ulaval/dotfiles-mamba-server to
/etc/slurm
-
Ensure the permissions are correct
mkdir /var/spool/slurmctld chown slurm: /var/spool/slurmctld chmod 755 /var/spool/slurmctld touch /var/log/slurmctld.log chown slurm: /var/log/slurmctld.log touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
-
Check that slurm is correctly configured:
slurmd -C
-
Start the services
systemctl enable slurmd.service systemctl start slurmd.service systemctl status slurmd.service systemctl enable slurmctld.service systemctl start slurmctld.service systemctl status slurmctld.service
-
Setup accounting
systemctl enable mariadb.service systemctl start mariadb.service # Inspired by: https://github.com/Artlands/Install-Slurm/blob/master/README.md#setting-up-mariadb-database-master mysql # Change password in the following line > create user 'slurm'@'localhost' identified by '${DB_USER_PASSWORD}'; grant all on slurm_acct_db.* TO 'slurm'@'localhost'; create database slurm_acct_db; > GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option; > SHOW VARIABLES LIKE 'have_innodb'; > FLUSH PRIVILEGES; > CREATE DATABASE slurm_acct_db; > quit; # Verify you can login mysql -p -u slurm
-
Copy
/etc/my.cnf.d/innodb.cnf
from norlab-ulaval/dotfiles-mamba-server -
Restart mariadb
systemctl stop mariadb mv /var/lib/mysql/ib_logfile? /tmp/ mv /var/lib/mysql/* /tmp/ systemctl start mariadb
-
Check the ownership of some files
chown slurm slurmdbd.conf touch /var/log/slurmctld.log chown slurm /var/log/slurmctld.log chown slurm slurm*
-
Check if slurmdb can start correctly using
slurmdbd -D -vvv
-
Start the services
systemctl enable slurmdbd systemctl start slurmdbd systemctl status slurmdbd systemctl enable slurmctld.service systemctl start slurmctld.service systemctl status slurmctld.service
-
Add accounts
sudo sacctmgr add account norlab Description="Norlab mamba-server" Organization=norlab sacctmgr add user wigum Account=norlab
-
If the Slurm services crash at startup, add the following lines to each slurm service (
slurmctl
,slurmdbd
andslurmd
) usingsystemctl edit slurmXXX.service
[Service] Restart=always RestartSec=5s
-
Setup the LVM: Resize LVM.
lvextend -l +100%FREE fedora xfs_growfs /dev/fedora/root # Verify the fs took all the place lsblk -f
-
Install
nvidia-container-toolkit
: nvidia-container-toolkit and setup for container use: cdi-supportsudo dnf install dkms curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \ sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo sudo dnf config-manager --enable nvidia-container-toolkit-experimental sudo dnf install nvidia-container-toolkit -y sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml # Create a systemd service to run the following line at startup sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml sudo reboot now
-
Verify that the network interface is correctly configured. BE SURE TO USE THE
enp68s0
interface.sudo dnf install speedtest-cli -y speedtest-cli --secure # Should print approximately 1000Mb/s
If not, change the ethernet port and be sure the config
/etc/NetworkManager/system-connections/enp68s0.nmconnection
looks like this# /etc/NetworkManager/system-connections/enp68s0.nmconnection [connection] id=enp68s0 uuid=493e6911-a544-4f84-a708-dd54a2fe1aef type=ethernet autoconnect=true interface-name=enp68s0 [ethernet] auto-negotiate=true duplex=full speed=2500 [ipv4] method=auto [ipv6] addr-gen-mode=eui64 method=auto [proxy]
-
Install a docker version that supports buildx plugin: Install docker on fedora
sudo useradd -c 'Full name' -m <username> -G docker
sudo passwd <username>
We also recommend setting the following env variable in their .bashrc
:
export SQUEUE_FORMAT="%.18i %.9P %.25j %.8u %.2t %.10M %.6D %.20e %b %.8c"
- Cronjob to clean podman cache
Add the following cronjob to sudo crontab -u root -e
:
cat /etc/passwd | grep /bin/bash | awk -F: '{ print $1}' | while read user; do echo "Processing user $user..." && sudo -u $user -H bash -c "cd && podman system prune -af"; done
As users do not have root access on the Mamba Server, every project should be ran in a container. We recommend using podman.
First, on your host machine, write a Dockerfile
to run your project inside a container.
Then, build and test that everything works on your machine before testing it on the server.
We recommend putting your data in a directory and to symlink it to the data
folder of your project.
We describe here how to add volumes to avoid copying the data in the container.
# Build the image
buildah build --layers -t myproject .
# Run docker image
export CONFIG=path/to/config> # for example `config/segsdet.yaml`
export CUDA_VISIBLE_DEVICES=0 # or `0,1` for specific GPUs, will be automatically set by SLURM
podman run --gpus all --rm -it --ipc host \
-v .:/app/ \
-v /app/data \
-v ./data/coco/:/app/data/coco \
-v /dev/shm:/dev/shm \
myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"
After you verified everything works on your machine, copy the code on the server and write a Slurm job script.
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=10-00:00
#SBATCH --job-name=$NAME
#SBATCH --output=%x-%j.out
cd ~/myproject || exit
buildah build --layers -t myproject .
export CONFIG=path/to/config> # for example `config/segsdet.yaml`
# Notice there is no -it option
podman run --gpus all --rm --ipc host \
-v .:/app/ \
-v /app/data \
-v ./data/coco/:/app/data/coco \
-v /dev/shm:/dev/shm \
myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"
Then, you can queue the job using sbatch job.sh
and see the queued jobs using squeue
.
For an easier experience, you can use willGuimont/sjm.
After you've verified this works, use the following code to kill the container when the slurm job stops.
# Notice the -d option to detach the process, and no -it option
container_id=$(
podman run --gpus all --rm -d --ipc host \
-v .:/app/ \
-v /app/data \
-v ./data/coco/:/app/data/coco \
-v /dev/shm:/dev/shm \
myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"
)
stop_container() {
podman logs $container_id
podman container stop $container_id
}
trap stop_container EXIT
echo "Container ID: $container_id"
podman wait $container_id
You can then run the job using:
sbatch job.sh
And see the running jobs using:
squeue
-
SSH and X11 forwarding with GLX support
First make sure X11 forwarding is enabled server side. Check these two lines in
/etc/ssh/sshd_config
X11Forwarding yes X11DisplayOffset 10
If they were not, enable them and restart sshd
sudo systemctl restart sshd
Make sure
xauth
is available on the server or install itsudo dnf install xorg-x11-xauth
Next install basic utilities to test GLX and Vulkan capabilities on the server. We'll need them to benchmark the remote connection's performance.
sudo dnf install glx-utils vulkan-tools
If you encounter problems, make sure on the client-side the server is allowed to display.
Make sure the ip of the server is valid. + to add - to remove from the trusted list.xhost + 132.203.26.231
Connect to the server from your client using ssh.
Use the options -X or -Y to redirect X11 via the ssh tunnel. The redirection works despite the fact that the server is headless. But Xorg must be installed.
The -X option will automatically update the DSIPLAY env variable. Note: the ip of the server is subject to change, make sure to have the last updated.ssh -X [email protected]
Test that X redirection is working by executing a simple X graphical application.
$ xterm
Test GLX support with
glxinfo
glxinfo
Test what GLX implementation is used by default
$ glxinfo | grep -i vendor server glx vendor string: SGI client glx vendor string: Mesa Project and SGI Vendor: Mesa (0xffffffff) OpenGL vendor string: Mesa
Check both NVidia and Mesa implementations work for GLX passthrough.
__GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -i vendor __GLX_VENDOR_LIBRARY_NAME=mesa glxinfo | grep -i vendor
Choose the best implementation between Nvidia and Mesa.
On Nvidia GPUs NVidia's implementation gives the best results.export __GLX_VENDOR_LIBRARY_NAME=nvidia glxgears
For Vulkan aplications the process is similar
vulkaninfo VK_DRIVER_FILES="/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json" vkcube
- Warthog Teach and Repeat (ROS1)
- Warthog Teach and Repeat (ROS2)
- Time Synchronization (NTP)
- Time Synchronization (PTP)
- Deployment of Robotic Total Stations (RTS)
- Deployment of the backpack GPS
- Warthog Emlid GPS
- Atlans-C INS
- How to use a CB Radio when going in the forest
- IP forwarding
- Emlid Data Postprocessing (PPK)
- Setting up a reliable robot communication with Zenoh
- Zenoh rmw
- Lessons Learned
- Robots' 3D Models
- Order Management
- Fast track Master → PhD
- Intellectual Property
- Repository Guidelines
- TF Cheatsheet
- Montmorency Forest Wintertime Dataset
- RTS-GT Dataset 2023
- Deschenes2021 Dataset
- TIGS Dataset
- DRIVE Datasets
- BorealHDR
- TimberSeg 1.0
- DARPA Subterranean Challenge - Urban Dataset
- How to upload a dataset to VALERIA
- ROS1 Bridge
- Migrating a repository to ROS2 (Humble)
- ROS2 and rosbags
- MCAP rosbags
- DDS Configuration (work in progress)
- Using a USB Microphone with ROS2
- ROS2 in VSCode
- ROS2 Troubleshooting