Skip to content

Compute

Abir Hossen edited this page Mar 7, 2023 · 9 revisions

Chameleon Cloud

Chameleon cloud platform provides some nice features for research in systems, like FPGA, Networking (even layer 2-3), OpenStack, Experiment Precis, hypervisors (OpenStack, KVM), even granular measurements up to low level bare metal performance events (energy and power consumption on bare metal, etc….). check this

Login

Chameleon Cloud

Navigate the closest site for your project (project-site page)

Selecting it from the pull-down menu 'Experiment' on the top of the Chameleon Cloud home page. Currently, there are three sites available, CHI@TACC (UT Austin), CHI@UC (U of Chicago) and KVM (virtualized cloud). Check here for more information.

Open the Lease page

It is under the "Reservations" tab on the left side of the project-site page

Check resource availability

Click Host Calendar to check the availability of hardware resources that you need. (Note: in general, the lease term of hardware can be not more than 7 days.)

How to create a VM

Detailed illustrative instructions can be found in the link. The important notes of each step are summarized in the following.

Request resources

Back to the Leases page, click "Create Lease" to request hardware where you need to enter a lease name, start and end time, and node (machine) type.

Configuration and Launching

When the hardware is ready, it needs to be configured before launching. Launching may take several minutes. A few important configurations are:

  • Image source: operating system + drivers + packages (you can use public images or create one by yourself.)
  • Create a pair of ssh keys and save the private key for later accessing
  • Open port 22 to allow ssh connection

Public IP

Go to the "Floating IPs" page under the "Network" tab on the left to create a Floating IP and associate it with the leased machine. The floating IP is a public IP that can be used for accessing the leased machine.

Access to the leased machine

  • Change permission of the private key: chmod 600 privateKey.pem
  • Add the key to your SSH identify: ssh-add privateKey.pem
  • Login: ssh cc@<the floating IP>. Login as cc, but not your Chameleon Cloud username.

Cancer Genetics Lab

Below are two powerful desktops in the Cancer Genetics Lab. We are offered the access to them by Professor Phillip Buckhaults, the director of the Cancer Genetics Lab. To access these machines, you will need to use the VPN since they locate inside the campus network. The credential for the two machines is the same as shown below. A directory called "AISys" has been created inside the home directory. Please use AISys as the root directory for holding your projects.

  1. Macintosh Machine 1 : IP 172.21.129.245
    • CPU: 12 physical cores/24 logical cores
    • RAM: 64 GB
    • GPU: ATI Radeon HD 5770, VRAM 1GB
    • OS: MacOS 12
  2. Macintosh Machine 2 : IP 172.21.129.246
    • CPU: 28 physical cores/56 logical cores
    • RAM: 192 GB
    • GPU: AMD Radeon Pro 580X, VRAM 8GB
    • OS: MacOS 12

Credential:

  • Username: cancergeneticslab
  • Password: allons-y

To access the above machines, you need to be inside in the usc campus network first. You can use the Cisco VPN to enter the campus network.

GPU Server

Access to GPU server

  • Step 1: log in to the ssh proxy server using the account. 54.197.93.250 is the IP of the proxy server and may be updated. Please ensure that the IP and the private key are up to date before issuing the connection.

    ssh -i <private_key_to_proxy_server> [email protected]
    USC users: you can directly connect to the GPU server using the USC vpn using the following: ssh [email protected]

  • Step 2: Use the existing reverse ssh tunnel to connect to the GPU server using the user account, aisys

    ssh -p 22122 aisys@localhost
    It will ask for a password, use lab2212

  • Step 3: Switch to your user account in the GPU server

    su - <your username>

To check which GPUs are currently used

nvidia-smi

  • Check the memory-Usage.

To reserve GPUs resources

Please update the Google sheet, Resource_Reservation_on_GPU_Server, and send a message in the slack channel, gpu-reservation, if your reservation will be more than one week or will take over all GPUs.

Mounting the 3.5TB storage (only use if not mounted)

To mount the 3.5TB storage: sudo mount /dev/sda1 /home/aisysStorage run from the root dir

Data Transfer

Inside /nfs/general/ in the GPU server, there is a folder under your username. This folder can be used to automatically transfer your files between the GPU workstation and the data-archiving server, which can be accessed through a web portal (https://pjamshid-nas.us6.quickconnect.to/).

When transferring files into the data-archiving server, copy/move your files to your folder inside /nfs/general/, say the folder, /nfs/general/suj. You will observe that the ownership of the files inside /nfs/general/suj will change to “1028 users” where 1028 is the user id of ‘suj’ in the data-archiving server.

When transferring files to the GPU server, if you can access to the campus network, you can use the command ‘scp’ with your credential in the GPU server. But if can not access to the campus network, you can login the web portal of the data-archiving server, open ‘File Station’ and then drag files into your folder under the shared directory ‘DataTransfer’.

For Admin

Allow user to run specific sudo command:

First run

  • sudo visudo

Next, add the line below:

  • [user-name] ALL=(ALL) NOPASSWD: /path/to/command, /path/to/command2, /path/to/command3

For example, giving an user john to run mount command with sudo:

  • john ALL=(ALL) NOPASSWD: /usr/bin/mount

Giving an user-group the above access:

  • %[group-name] ALL=(ALL) NOPASSWD: /usr/bin/mount

To find the command path:

  • which [command]

RCI

Apply for an account

Fill the application form in the link to request an account.

Create an environment for experiments

The research computing center provides a mechanism for users to create their own experimental environments by loading modules that have been installed. You may ask the research computing center to install applications/tools for you, which are not available but needs to be installed systematically. Basic subcommands of the command module are listed below.

  • module avail # display all available modules
  • module load/unload <modulefile> # load/unload a module
  • module list # list current loaded module
  • module --help

If all new stuff you needs is some Python packages, then you can first load one anaconda module and create a Python virtual environment and install necessary Python packages in the virtual environment. Then when submitting a job in the virtual environment, these installed Python packages will be carried over. An example of loading anaconda and install Python packages is given below.

  • module load python3/anaconda/5.2.0
  • conda create --name <environemnt_name>
  • conda activate <environment_name>
  • pip install <A Python package>

CPU and GPU job scripts

An example of a CPU job script

#!/bin/sh
#SBATCH --job-name=test
#SBATCH -n 28          # This is the number of compute cores, a lower number will queue faster (28 cores per machine)
#SBATCH -N 1           # this is the number of compute nodes requested, usually 1 unless MPI job
#SBATCH --output job_%j.out
#SBATCH --error job_%j.err

## Load some necessary module
module load <modulename>

## run you experiment
yourExecutableScript.bash <argument1> ... <argumentN>
  • The standard output of the job: job.out
  • The standard error of the job : job.err

An example of a job script that requests a GPU resource

#!/usr/bin/bash
#SBATCH --job-name=GPUTest
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --output job%j.out
#SBATCH --error job%j.err
#SBATCH -p gpu
#SBATCH --gres=gpu:2

module load cuda/9.2 # match Tensorflow 1.12

yourExecutableScript.bash <argument1> ... <argumentN>

Specify a working partition

If you want to deploy your task to any specific node, say jamshidi-lab, use #SBATCH -p jamshidi-lab. So, a sample job script that will be deployed in jamshidi-lab partition should look like

#!/usr/bin/bash
#SBATCH --job-name=specify_working_node
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --output job%j.out
#SBATCH --error job%j.err
#SBATCH -p jamshidi-lab

module load <pre-installed module>

yourExecutableScript.bash <argument1> ... <argumentN>

Basic Slurm commands

Globus for transferring data at high speeds

Research Computing is providing a new tool, Globus, to assist you in transferring data at very high speeds to storage endpoints. The Globus transfer tool will allow you to use a much faster parallel transfer method to move data to and from resources like Hyperion and external universities. Other internal archive and general-purpose storage endpoints will be available in the future. For most users, a VPN connection is not required. The basic instructions are shown below. Any further questions please contact the RCI center at [email protected].

One caveat: AT&T broadband customers need to connect to Hyperion using globus via VPN.

  • Create a globus account here: https://www.globusid.org/create
  • Download the globus transfer tool here: https://www.globus.org/globus-connect-personal
  • Once you log into the globus web portal, simply search for uofsc#Hyperion in the collection field of the file manager, you will be required to log in with your Hyperion account and approve a DUO push. In the other blank pane, search for the name of your personal endpoint. Browse to the desired locations in the path field and begin transferring by selecting a file or folder and clicking start in the Hyperion pane. The transfer job will be submitted and will run in the background. You will receive an email to your associated globus email address when the transfer is complete. Note that you DO NOT need to be connected to the VPN to log into globus or transfer files.

Useful Links

Devices

NVidia TK1, TX1, TX2 and Xavier AGX

  • Name: tk1.cse.sc.edu, Address: 10.173.131.119

  • Name: tx1-1.cse.sc.edu, Address: 10.173.131.120

  • Name: tx1-2.cse.sc.edu, Address: 10.173.131.122

  • Name: tx2-1.cse.sc.edu, Address: 10.173.131.121

  • Name: xavier1.cse.sc.edu, Address: 10.173.131.123

  • Name: nano1.cse.sc.edu, Address: 10.173.131.124

  • Name: nano2.cse.sc.edu, Address: 10.173.131.125

  • Name: nano3.cse.sc.edu, Address: 10.173.131.126

  • Name: nano4.cse.sc.edu, Address: 10.173.131.127

  • Name: coraldev1.cse.sc.edu, Address: 10.173.131.128

  • Name: coraldev2.cse.sc.edu, Address: 10.173.131.129

These are resources for running experiments, I will add more of these later. Please let everybody in our lab (via mailing list) know when you are running experiments, so others do not run things at the same time.

Un-boxing and Bringing up the Desktop GUI:

Once you open the Jetson TX1/TX2 box please perform the following steps to load the GUI.

  1. Connect a monitor with Jetson TX1/TX2 using a HDMI cable.
  2. Use the USB 3.0 ports on Jetson TX1/TX2 board to connect your keyboard/mouse/pointing devices.
  3. Use an ethernet cable to connect Jetson TX1/TX2 to the network.
  4. Power on the Jetson TX1/TX2 board using the supplied AC Adapter and press the Power button.

This will bring up a command terminal and prompt for password.
--password for user nvidia: nvidia
--Password for user ubuntu: ubuntu

Then execute the following commands on you terminal.

  1. command: cd NVIDIA-INSTALLER

  2. command: sudo ./installer.sh
    This will install some dependencies to load GUI and once finished will ask the system to reboot. Use:

  3. command: sudo reboot
    Now you will have a desktop gui which will make your navigation easier.

Jetpack 3.3 Installation:

Currently all the Jetson TX1/TX2 are using Jetpack 3.3. In order to configure Jetson TX1/TX2 you will need a host os and Jetson TX1/TX2. The installations are performed remotely from the host os because configuring own system dynamically cannot be performed on embedded architectures.

Please make sure the host os is connected to the same network as the Jetson TX1/TX2.

The instructions for flashing os and installing necessary software are listed below.

  1. Download Jetpack 3.3 Installer from https://developer.nvidia.com/embedded/downloads#?search=jetpack%203.3 (You might need to create your own nvidia developer account to download the binary)

  2. Extract the installer and copy it to a new directory using -- command: mkdir TX1/TX2 (whichever you are using)
    -- command: cp JetPack-L4T-3.3-linux-x64_b39.run ~/TX1 or TX2

  3. Change the permission to make it executable.
    -- command: chmod +x JetPack-L4T-3.3-linux-x64_b39.run ~/TX1 or TX2

  4. Install ssh-askpass. This is very important as once the flashing is done the jetpack will ask you for remote system (Jetson TX1/TX2) ip, username and password. Without this step it will get stuck and will not get installed correctly.
    -- command: sudo apt-get install ssh-askpass-gnome ssh-askpass

  5. Run the installer
    -- command: ./JetPack-L4T-3.3-linux-x64_b39.run (Do not use sudo)

This would start installing Jetpack on your host and will show the progress using a Nvidia Component Manager. In the component manager select Full (Flash OS and other necessary software e.g. cuda, cudnn, opencv, tensorRT etc.) installation and select to resolve all dependencies in the component manager gui. It will also prompt you to accept all the software license agreements and make sure you accept them (unless you have discovered patches to choose rebellion).

Once the jetpack installation is completed on your host os it will show you some additional steps to perform as it requires the Jetson TX1/TX2 to run on force recovery mode. Please perform the following steps to do that.

  1. Disconnect the ac adapter from Jetson TX1/TX2.
  2. Connect the developer cable between Jetson and Host machine.
  3. Power on your Jetson TX1/TX2 using the power button after connecting the power cable.
  4. Keep pressing the Force Recovery Button and while pressing it press and release the reset button.
  5. Please wait for 2 seconds after releasing the reset button and then release the force recovery button.

In order to confirm that the jetson is ready to be flashed using the force recovery mode open a terminal in your host os and use
-- command: lsusb

You should see a list of usbs and one of them should be NVIDIA-CORP which will indicate the Jetson is ready to be configured. Then press the enter button on the terminal from which the force recovery mode was initiated on your host os. This would start flashing os and install jetpack 3.3. It would create filesystems on your jetson tx1/tx2.

Currently, there is an issue with Jetpack 3.3 which is it only flashes the os but does not install all the necessary software. In order to do so, you have to run the Jetpack run file again and this time make sure rather than selecting the full installation you select custom and right click on the target system ans select install. Before, doing so make sure you unplug the developer cable. This would ask you for the jetson tx1/tx2 ip, username and password. In order to get the ip from Jetson TX1/Tx2 use:
-- command: ifconfig
Use the following command to make sure your Jetson TX1/TX2 is reachable from your host.
-- command: ping jetson_ip_address
This time it will install all the necessary software. Once the softwares are installed now you may be interested in using tensorflow/caffe/pytorch etc.

Use the following
-- command: sudo apt-get install python-setuptools (for python 2.7)
-- command: sudo apt-get install python-pip

Tensorflow Installation:

For Tensorflow:
-- command: sudo pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/jp33 tensorflow-gpu You should be able to open a python interpreter to ensure tensorflow is running.

Tensorflow & Keras install for jetson Xavier & Nano devices

-- command: sudo apt-get install python3-venv

-- command: python3 -m venv your_env

-- command: source your_env/bin/activate

-- command: pip3 install Cython pandas

-- command: sudo apt-get install libhdf5-serial-dev libhdf5-dev

-- command: sudo apt-get install libblas3 liblapack3 liblapack-dev libblas-dev

-- command: sudo apt-get install gfortran

-- command: wget https://developer.download.nvidia.com/compute/redist/jp/v42/tensorflow-gpu/tensorflow_gpu-1.13.1+nv19.3-cp36-cp36m-linux_aarch64.whl

-- command: pip3 install tensorflow_gpu-1.13.1+nv19.3-cp36-cp36m-linux_aarch64.whl

-- command: pip3 install keras

How to set up a reverse ssh tunnel

Step 1: Set up the GPU server

Create, in the GPU server (OS is Ubuntu), there is a sudo user, aisys, and there is a pair of private-public keys associated with the user aisys. That is, inside the campus network, one can access the GPU server by using the private key for the user aisys.

Step 2: Create a new VPS in a cloud service provider, e.g. AWS

  • Create a VPS (OS is Ubuntu) with an IP. 54.197.93.250
  • Create a sudo user, aisys, with a password, in the VPS
  • Enable ssh service in the VPS
    sudo apt update
    sudo apt install openssh-server
    sudo systemctl status ssh
    
  • Create a pair of ssh keys for the user aisys in the VPS
  • Enable the access to the VPS using the private key for aisys by attaching the public key to the file /home/aisys/.ssh/known_hosts in the VPS
  • Copy private key to the GPU server and distribute to users who need to access the VPS

Step 3: Set up the reverse ssh tunnel

  • Initiate the reverse ssh tunnel from the GPU server by setting up a system service such that it will be automatically restarted if it is out of service.
  • To set up the system service, aisys_tunnel.service, A template is shown below. Whenever using a new VPS, it needs to replace, in /etc/systemd/system/aisys_tunnel.service, the private key (the id_rsa_vps file in the template) for accessing the new VPS and the public IP (54.197.93.250 in the template) of the VPS with the new ones. After updating the aisys_tunnel.service, it will need to disable the current aisys_tunnel.service, and then enable the update aisys_tunnel.serivce.
 [Unit]
 Description=Maintain Tunnel
 After=network.target
 
 [Service]
 User=aisys
 ExecStart=/usr/bin/ssh -i /home/aisys/.ssh/id_rsa_vps  -o ServerAliveInterval=60 -o ExitOnForwardFailure=yes -gnNT -  R 22122:localhost:22 [email protected]
 RestartSec=15
 Restart=always
 KillMode=mixed
 
 [Install]
 WantedBy=multi-user.target

Step 4: Connect to the reverse ssh tunnel without explicitly specifying the private key

  • Copy the private key, which is used to access the gpu server, to the VPS at ~/.ssh/id_rsa. Having the default path (default key name, id_rsa) is to have ssh automatically pick it when the VPS connects to the reverse tunnel.

Step 5: Testing

  • On a local machine, open a terminal and connect to the VPS:

    ssh -I <private key to access the VPS> aisys@<IP of the VPS>

  • On the VPS, connect to the reverser ssh tunnel:

    ssh -p 22122 aisys@localhost

  • On the GPU server, switch to your own user instead of using login user, aisys

    su - <your user account on the GPU server>

Reference:

  1. README-setup-tunnel-as-systemd-service.md
  2. Self healing reverse SSH setup with system
  3. How to Set Up SSH Keys on Ubuntu 20.04
  4. Set up SSH public key authentication to connect to a remote system
  5. SSH reverse tunnel - disable request for password when using key authentication
  6. How to Fix SSH Failed Permission Denied