Compute

Chameleon Cloud

Chameleon cloud platform provides some nice features for research in systems, like FPGA, Networking (even layer 2-3), OpenStack, Experiment Precis, hypervisors (OpenStack, KVM), even granular measurements up to low level bare metal performance events (energy and power consumption on bare metal, etc….). check this

Login

Chameleon Cloud

Navigate the closest site for your project (project-site page)

Selecting it from the pull-down menu 'Experiment' on the top of the Chameleon Cloud home page. Currently, there are three sites available, CHI@TACC (UT Austin), CHI@UC (U of Chicago) and KVM (virtualized cloud). Check here for more information.

Open the Lease page

It is under the "Reservations" tab on the left side of the project-site page

Check resource availability

Click Host Calendar to check the availability of hardware resources that you need. (Note: in general, the lease term of hardware can be not more than 7 days.)

How to create a VM

Detailed illustrative instructions can be found in the link. The important notes of each step are summarized in the following.

Request resources

Back to the Leases page, click "Create Lease" to request hardware where you need to enter a lease name, start and end time, and node (machine) type.

Configuration and Launching

When the hardware is ready, it needs to be configured before launching. Launching may take several minutes. A few important configurations are:

Image source: operating system + drivers + packages (you can use public images or create one by yourself.)
Create a pair of ssh keys and save the private key for later accessing
Open port 22 to allow ssh connection

Public IP

Go to the "Floating IPs" page under the "Network" tab on the left to create a Floating IP and associate it with the leased machine. The floating IP is a public IP that can be used for accessing the leased machine.

Access to the leased machine

Change permission of the private key: chmod 600 privateKey.pem
Add the key to your SSH identify: ssh-add privateKey.pem
Login: ssh cc@<the floating IP>. Login as cc, but not your Chameleon Cloud username.

Cancer Genetics Lab

Below are two powerful desktops in the Cancer Genetics Lab. We are offered the access to them by Professor Phillip Buckhaults, the director of the Cancer Genetics Lab. To access these machines, you will need to use the VPN since they locate inside the campus network. The credential for the two machines is the same as shown below. A directory called "AISys" has been created inside the home directory. Please use AISys as the root directory for holding your projects.

Macintosh Machine 1 : IP 172.21.129.245
- CPU: 12 physical cores/24 logical cores
- RAM: 64 GB
- GPU: ATI Radeon HD 5770, VRAM 1GB
- OS: MacOS 12
Macintosh Machine 2 : IP 172.21.129.246
- CPU: 28 physical cores/56 logical cores
- RAM: 192 GB
- GPU: AMD Radeon Pro 580X, VRAM 8GB
- OS: MacOS 12

Credential:

Username: cancergeneticslab
Password: allons-y

To access the above machines, you need to be inside in the usc campus network first. You can use the Cisco VPN to enter the campus network.

GPU Server

Access to GPU server

Step 1: log in to the ssh proxy server using the account. 54.197.93.250 is the IP of the proxy server and may be updated. Please ensure that the IP and the private key are up to date before issuing the connection.

ssh -i <private_key_to_proxy_server> [email protected]
USC users: you can directly connect to the GPU server using the USC vpn using the following: ssh [email protected]
Step 2: Use the existing reverse ssh tunnel to connect to the GPU server using the user account, aisys

ssh -p 22122 aisys@localhost
It will ask for a password, use lab2212
Step 3: Switch to your user account in the GPU server

su - <your username>

To check which GPUs are currently used

nvidia-smi

Check the memory-Usage.

To reserve GPUs resources

Please update the Google sheet, Resource_Reservation_on_GPU_Server, and send a message in the slack channel, gpu-reservation, if your reservation will be more than one week or will take over all GPUs.

Mounting the 3.5TB storage (only use if not mounted)

To mount the 3.5TB storage: sudo mount /dev/sda1 /home/aisysStorage run from the root dir

Data Transfer

Inside /nfs/general/ in the GPU server, there is a folder under your username. This folder can be used to automatically transfer your files between the GPU workstation and the data-archiving server, which can be accessed through a web portal (https://pjamshid-nas.us6.quickconnect.to/).

When transferring files into the data-archiving server, copy/move your files to your folder inside /nfs/general/, say the folder, /nfs/general/suj. You will observe that the ownership of the files inside /nfs/general/suj will change to “1028 users” where 1028 is the user id of ‘suj’ in the data-archiving server.

When transferring files to the GPU server, if you can access to the campus network, you can use the command ‘scp’ with your credential in the GPU server. But if can not access to the campus network, you can login the web portal of the data-archiving server, open ‘File Station’ and then drag files into your folder under the shared directory ‘DataTransfer’.

For Admin

Allow user to run specific sudo command:

First run

sudo visudo

Next, add the line below:

[user-name] ALL=(ALL) NOPASSWD: /path/to/command, /path/to/command2, /path/to/command3

For example, giving an user john to run mount command with sudo:

john ALL=(ALL) NOPASSWD: /usr/bin/mount

Giving an user-group the above access:

%[group-name] ALL=(ALL) NOPASSWD: /usr/bin/mount

To find the command path:

which [command]

RCI

Apply for an account

Fill the application form in the link to request an account.

Create an environment for experiments

The research computing center provides a mechanism for users to create their own experimental environments by loading modules that have been installed. You may ask the research computing center to install applications/tools for you, which are not available but needs to be installed systematically. Basic subcommands of the command module are listed below.

module avail # display all available modules
module load/unload <modulefile> # load/unload a module
module list # list current loaded module
module --help

If all new stuff you needs is some Python packages, then you can first load one anaconda module and create a Python virtual environment and install necessary Python packages in the virtual environment. Then when submitting a job in the virtual environment, these installed Python packages will be carried over. An example of loading anaconda and install Python packages is given below.

module load python3/anaconda/5.2.0
conda create --name <environemnt_name>
conda activate <environment_name>
pip install <A Python package>

CPU and GPU job scripts

An example of a CPU job script

#!/bin/sh
#SBATCH --job-name=test
#SBATCH -n 28          # This is the number of compute cores, a lower number will queue faster (28 cores per machine)
#SBATCH -N 1           # this is the number of compute nodes requested, usually 1 unless MPI job
#SBATCH --output job_%j.out
#SBATCH --error job_%j.err

## Load some necessary module
module load <modulename>

## run you experiment
yourExecutableScript.bash <argument1> ... <argumentN>

The standard output of the job: job.out
The standard error of the job : job.err

An example of a job script that requests a GPU resource

#!/usr/bin/bash
#SBATCH --job-name=GPUTest
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --output job%j.out
#SBATCH --error job%j.err
#SBATCH -p gpu
#SBATCH --gres=gpu:2

module load cuda/9.2 # match Tensorflow 1.12

yourExecutableScript.bash <argument1> ... <argumentN>

Specify a working partition

If you want to deploy your task to any specific node, say jamshidi-lab, use #SBATCH -p jamshidi-lab. So, a sample job script that will be deployed in jamshidi-lab partition should look like

#!/usr/bin/bash
#SBATCH --job-name=specify_working_node
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --output job%j.out
#SBATCH --error job%j.err
#SBATCH -p jamshidi-lab

module load <pre-installed module>

yourExecutableScript.bash <argument1> ... <argumentN>

Basic Slurm commands

sbatch job.bash # submit a job
squeue -l -u <username> # list jobs under the user, <username>
scancel <JobID> # cancel a job by its ID For more detailed information of the usage, please refer to the introduction of Slurm usage (U of SC Research Computing Center) and the summary of slurm commands (slurm website)

`Globus` for transferring data at high speeds

Research Computing is providing a new tool, Globus, to assist you in transferring data at very high speeds to storage endpoints. The Globus transfer tool will allow you to use a much faster parallel transfer method to move data to and from resources like Hyperion and external universities. Other internal archive and general-purpose storage endpoints will be available in the future. For most users, a VPN connection is not required. The basic instructions are shown below. Any further questions please contact the RCI center at [email protected].

One caveat: AT&T broadband customers need to connect to Hyperion using globus via VPN.

Create a globus account here: https://www.globusid.org/create
Download the globus transfer tool here: https://www.globus.org/globus-connect-personal
Once you log into the globus web portal, simply search for uofsc#Hyperion in the collection field of the file manager, you will be required to log in with your Hyperion account and approve a DUO push. In the other blank pane, search for the name of your personal endpoint. Browse to the desired locations in the path field and begin transferring by selecting a file or folder and clicking start in the Hyperion pane. The transfer job will be submitted and will run in the background. You will receive an email to your associated globus email address when the transfer is complete. Note that you DO NOT need to be connected to the VPN to log into globus or transfer files.

Useful Links

Devices

NVidia TK1, TX1, TX2 and Xavier AGX

Name: tk1.cse.sc.edu, Address: 10.173.131.119
Name: tx1-1.cse.sc.edu, Address: 10.173.131.120
Name: tx1-2.cse.sc.edu, Address: 10.173.131.122
Name: tx2-1.cse.sc.edu, Address: 10.173.131.121
Name: xavier1.cse.sc.edu, Address: 10.173.131.123
Name: nano1.cse.sc.edu, Address: 10.173.131.124
Name: nano2.cse.sc.edu, Address: 10.173.131.125
Name: nano3.cse.sc.edu, Address: 10.173.131.126
Name: nano4.cse.sc.edu, Address: 10.173.131.127
Name: coraldev1.cse.sc.edu, Address: 10.173.131.128
Name: coraldev2.cse.sc.edu, Address: 10.173.131.129

These are resources for running experiments, I will add more of these later. Please let everybody in our lab (via mailing list) know when you are running experiments, so others do not run things at the same time.

Un-boxing and Bringing up the Desktop GUI:

Once you open the Jetson TX1/TX2 box please perform the following steps to load the GUI.

Connect a monitor with Jetson TX1/TX2 using a HDMI cable.
Use the USB 3.0 ports on Jetson TX1/TX2 board to connect your keyboard/mouse/pointing devices.
Use an ethernet cable to connect Jetson TX1/TX2 to the network.
Power on the Jetson TX1/TX2 board using the supplied AC Adapter and press the Power button.

This will bring up a command terminal and prompt for password.
--password for user nvidia: nvidia
--Password for user ubuntu: ubuntu

Then execute the following commands on you terminal.

command: cd NVIDIA-INSTALLER
command: sudo ./installer.sh
This will install some dependencies to load GUI and once finished will ask the system to reboot. Use:
command: sudo reboot
Now you will have a desktop gui which will make your navigation easier.

Jetpack 3.3 Installation:

Currently all the Jetson TX1/TX2 are using Jetpack 3.3. In order to configure Jetson TX1/TX2 you will need a host os and Jetson TX1/TX2. The installations are performed remotely from the host os because configuring own system dynamically cannot be performed on embedded architectures.

Please make sure the host os is connected to the same network as the Jetson TX1/TX2.

The instructions for flashing os and installing necessary software are listed below.

Download Jetpack 3.3 Installer from https://developer.nvidia.com/embedded/downloads#?search=jetpack%203.3 (You might need to create your own nvidia developer account to download the binary)
Extract the installer and copy it to a new directory using -- command: mkdir TX1/TX2 (whichever you are using)
-- command: cp JetPack-L4T-3.3-linux-x64_b39.run ~/TX1 or TX2
Change the permission to make it executable.
-- command: chmod +x JetPack-L4T-3.3-linux-x64_b39.run ~/TX1 or TX2
Install ssh-askpass. This is very important as once the flashing is done the jetpack will ask you for remote system (Jetson TX1/TX2) ip, username and password. Without this step it will get stuck and will not get installed correctly.
-- command: sudo apt-get install ssh-askpass-gnome ssh-askpass
Run the installer
-- command: ./JetPack-L4T-3.3-linux-x64_b39.run (Do not use sudo)

This would start installing Jetpack on your host and will show the progress using a Nvidia Component Manager. In the component manager select Full (Flash OS and other necessary software e.g. cuda, cudnn, opencv, tensorRT etc.) installation and select to resolve all dependencies in the component manager gui. It will also prompt you to accept all the software license agreements and make sure you accept them (unless you have discovered patches to choose rebellion).

Once the jetpack installation is completed on your host os it will show you some additional steps to perform as it requires the Jetson TX1/TX2 to run on force recovery mode. Please perform the following steps to do that.

Disconnect the ac adapter from Jetson TX1/TX2.
Connect the developer cable between Jetson and Host machine.
Power on your Jetson TX1/TX2 using the power button after connecting the power cable.
Keep pressing the Force Recovery Button and while pressing it press and release the reset button.
Please wait for 2 seconds after releasing the reset button and then release the force recovery button.

In order to confirm that the jetson is ready to be flashed using the force recovery mode open a terminal in your host os and use
-- command: lsusb

You should see a list of usbs and one of them should be NVIDIA-CORP which will indicate the Jetson is ready to be configured. Then press the enter button on the terminal from which the force recovery mode was initiated on your host os. This would start flashing os and install jetpack 3.3. It would create filesystems on your jetson tx1/tx2.

Currently, there is an issue with Jetpack 3.3 which is it only flashes the os but does not install all the necessary software. In order to do so, you have to run the Jetpack run file again and this time make sure rather than selecting the full installation you select custom and right click on the target system ans select install. Before, doing so make sure you unplug the developer cable. This would ask you for the jetson tx1/tx2 ip, username and password. In order to get the ip from Jetson TX1/Tx2 use:
-- command: ifconfig
Use the following command to make sure your Jetson TX1/TX2 is reachable from your host.
-- command: ping jetson_ip_address
This time it will install all the necessary software. Once the softwares are installed now you may be interested in using tensorflow/caffe/pytorch etc.

Use the following
-- command: sudo apt-get install python-setuptools (for python 2.7)
-- command: sudo apt-get install python-pip

Tensorflow Installation:

For Tensorflow:
-- command: sudo pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/jp33 tensorflow-gpu You should be able to open a python interpreter to ensure tensorflow is running.

Tensorflow & Keras install for jetson Xavier & Nano devices

-- command: sudo apt-get install python3-venv

-- command: python3 -m venv your_env

-- command: source your_env/bin/activate

-- command: pip3 install Cython pandas

-- command: sudo apt-get install libhdf5-serial-dev libhdf5-dev

-- command: sudo apt-get install libblas3 liblapack3 liblapack-dev libblas-dev

-- command: sudo apt-get install gfortran

-- command: wget https://developer.download.nvidia.com/compute/redist/jp/v42/tensorflow-gpu/tensorflow_gpu-1.13.1+nv19.3-cp36-cp36m-linux_aarch64.whl

-- command: pip3 install tensorflow_gpu-1.13.1+nv19.3-cp36-cp36m-linux_aarch64.whl

-- command: pip3 install keras

How to set up a reverse ssh tunnel

Step 1: Set up the GPU server

Create, in the GPU server (OS is Ubuntu), there is a sudo user, aisys, and there is a pair of private-public keys associated with the user aisys. That is, inside the campus network, one can access the GPU server by using the private key for the user aisys.

Step 2: Create a new VPS in a cloud service provider, e.g. AWS

Create a VPS (OS is Ubuntu) with an IP. 54.197.93.250
Create a sudo user, aisys, with a password, in the VPS

Enable ssh service in the VPS

sudo apt update
sudo apt install openssh-server
sudo systemctl status ssh

Create a pair of ssh keys for the user aisys in the VPS
Enable the access to the VPS using the private key for aisys by attaching the public key to the file /home/aisys/.ssh/known_hosts in the VPS
Copy private key to the GPU server and distribute to users who need to access the VPS

Step 3: Set up the reverse ssh tunnel

Initiate the reverse ssh tunnel from the GPU server by setting up a system service such that it will be automatically restarted if it is out of service.
To set up the system service, aisys_tunnel.service, A template is shown below. Whenever using a new VPS, it needs to replace, in /etc/systemd/system/aisys_tunnel.service, the private key (the id_rsa_vps file in the template) for accessing the new VPS and the public IP (54.197.93.250 in the template) of the VPS with the new ones. After updating the aisys_tunnel.service, it will need to disable the current aisys_tunnel.service, and then enable the update aisys_tunnel.serivce.

 [Unit]
 Description=Maintain Tunnel
 After=network.target
 
 [Service]
 User=aisys
 ExecStart=/usr/bin/ssh -i /home/aisys/.ssh/id_rsa_vps  -o ServerAliveInterval=60 -o ExitOnForwardFailure=yes -gnNT -  R 22122:localhost:22 [email protected]
 RestartSec=15
 Restart=always
 KillMode=mixed
 
 [Install]
 WantedBy=multi-user.target

Step 4: Connect to the reverse ssh tunnel without explicitly specifying the private key

Copy the private key, which is used to access the gpu server, to the VPS at ~/.ssh/id_rsa. Having the default path (default key name, id_rsa) is to have ssh automatically pick it when the VPS connects to the reverse tunnel.

Step 5: Testing

On a local machine, open a terminal and connect to the VPS:

ssh -I <private key to access the VPS> aisys@<IP of the VPS>
On the VPS, connect to the reverser ssh tunnel:

ssh -p 22122 aisys@localhost
On the GPU server, switch to your own user instead of using login user, aisys

su - <your user account on the GPU server>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute

Chameleon Cloud

Login

Navigate the closest site for your project (project-site page)

Open the Lease page

Check resource availability

How to create a VM

Request resources

Configuration and Launching

Public IP

Access to the leased machine

Cancer Genetics Lab

GPU Server

Access to GPU server

To check which GPUs are currently used

To reserve GPUs resources

Mounting the 3.5TB storage (only use if not mounted)

Data Transfer

For Admin

RCI

Apply for an account

Create an environment for experiments

CPU and GPU job scripts

Specify a working partition

Basic Slurm commands

`Globus` for transferring data at high speeds

Useful Links

Devices

NVidia TK1, TX1, TX2 and Xavier AGX

How to set up a reverse ssh tunnel

Step 1: Set up the GPU server

Step 2: Create a new VPS in a cloud service provider, e.g. AWS

Step 3: Set up the reverse ssh tunnel

Step 4: Connect to the reverse ssh tunnel without explicitly specifying the private key

Step 5: Testing

Reference:

Clone this wiki locally

Compute

Chameleon Cloud

Login

Navigate the closest site for your project (project-site page)

Open the Lease page

Check resource availability

How to create a VM

Request resources

Configuration and Launching

Public IP

Access to the leased machine

Cancer Genetics Lab

GPU Server

Access to GPU server

To check which GPUs are currently used

To reserve GPUs resources

Mounting the 3.5TB storage (only use if not mounted)

Data Transfer

For Admin

RCI

Apply for an account

Create an environment for experiments

CPU and GPU job scripts

Specify a working partition

Basic Slurm commands

Globus for transferring data at high speeds

Useful Links

Devices

NVidia TK1, TX1, TX2 and Xavier AGX

How to set up a reverse ssh tunnel

Step 1: Set up the GPU server

Step 2: Create a new VPS in a cloud service provider, e.g. AWS

Step 3: Set up the reverse ssh tunnel

Step 4: Connect to the reverse ssh tunnel without explicitly specifying the private key

Step 5: Testing

Reference:

Clone this wiki locally

`Globus` for transferring data at high speeds