Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add vllm deploy #82

Merged
merged 23 commits into from
Nov 6, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 68 additions & 24 deletions llm_deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,37 +7,36 @@ Go to the vllm dir and pip install -e .
To notice https://github.com/vllm-project/vllm/issues/1283, need to modify the config file to "== 2.0.1" and the pytorch version if facing with CUDA version error.



## Deploy finetuned model on babel via vLLM
## Setting up Babel server
### Login with SSH key
1. Add public ed25519 key to server
Add public ed25519 key to server
```bash
ssh-copy-id -i ~/.ssh/id_ed25519.pub <username>@<mycluster>
```
2. Config ~~/.ssh/config
Config SSH file
```bash
Host <mycluster>
HostName <mycluster>
User <username>
IdentityFile ~/.ssh/id_ed25519
```
3. Login babel with SSH key
Login babel with SSH key
```bash
ssh <mycluster>
```

### Connecting to compute node
1. Jump from login node to compute node
### Connecting to a compute node
Jump from login node to compute node
```bash
srun --pty bash
```
2. Check if you can access the /data/folder
Check if you can access the /data/folder
```bash
cd /data/datasets/
```

### Config environment on compute node
1. Install miniconda
### Config environment on the compute node
Install miniconda
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Expand All @@ -46,47 +45,92 @@ conda create --name myenv
conda activate myenv
# conda deactivate
```
2. Install vllm packages
Install vllm packages
```bash
conda install pip
pip install vllm
```
3. Submit gpu request and open a new terminal
Install fastchat packages
```bash
conda install pip
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install --upgrade pip
pip3 install "fschat[model_worker,webui]"
```
Submit gpu request and open a an interactive terminal
```bash
srun --gres=gpu:1 --time=1-00:00:00 --mem=80G --pty $SHELL
conda activate myenv
```
4. Useful commands for checking gpu jobs
Some useful commands for checking gpu jobs
```bash
# check slurm status
squeue -l
# check gpu status
nvidia-smi
# check gpu usage
pip install gpustat
watch -n 1 gpustat
# quit slurm jobs
scancel job_id
# connect to compute node directly
ssh -J babel babel-x-xx
```

### Host vLLM instance and run inference on server
1. Start vLLM surver with model checkpoint
### Install cuda-toolkit (optional)
Due to the issue with vllm: https://github.com/vllm-project/vllm/issues/1283, we need to use cuda-toolkit=11.7.0 that is compatible with Pytorch 2.0.1.
Install cuda-toolkit=11.7.0 on conda environment
```bash
conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit
```
Check cuda-toolkit version
```bash
nvcc -V
```

## Deploy models on Babel via FastChat API server
Implement the following python commands in three separate interactive terminal windows:
```bash
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path model-checkpoint
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
Call model checkpoint API
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model-checkpoint",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
*Sample output:*
```JSON
{"id":"cmpl-GGvKBiZFdFLzPq2HdtuxbC","object":"text_completion","created":1698692212,"model":"checkpoint-4525","choices":[{"index":0,"text":"city that is known for its icon","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":11,"completion_tokens":6}}
```

## Deploy models on Babel via vllm API server
Start vLLM surver with model checkpoint
```bash
python -m vllm.entrypoints.openai.api_server --model model_checkpoint/
```
1. Call model checkpoint API
Call model checkpoint API
```bash
curl http://localhost:8000/v1/models
```
*Sample output:*
```JSON
{"object":"list","data":[{"id":"Mistral-7B-Instruct-v0.1/","object":"model","created":1697599903,"owned_by":"vllm","root":"Mistral-7B-Instruct-v0.1/","parent":null,"permission":[{"id":"modelperm-d415ecf6362a4f818090eb6428e0cac9","object":"model_permission","created":1697599903,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
```
2. Inference model checkpoint API
Inference model checkpoint API
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model_checkpoint/",
"model": "model_checkpoint",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
Expand All @@ -97,31 +141,31 @@ curl http://localhost:8000/v1/completions \
{"id":"cmpl-bf7552957a8a4bd89186051c40c52de4","object":"text_completion","created":3600699,"model":"Mistral-7B-Instruct-v0.1/","choices":[{"index":0,"text":" city that is known for its icon","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}
```

### Access deployed babel server on a local machine
1. Construct ssh tunnel between babel login node and babel compute node with hosted model
## Access deployed Babel server on a local machine
Construct ssh tunnel between babel login node and babel compute node with hosted model
```bash
ssh -N -L 7662:localhost:8000 username@babel-x-xx
```
The above command creates a localhost:7662 server on bable login node which connects to localhost:8000 on compute node.

2. Construct ssh tunnel between local machine and babel login node
Construct ssh tunnel between local machine and babel login node
```bash
ssh -N -L 8001:localhost:7662 username@<mycluster>
```
The above command creates a localhost:8001 server on your local machine which connects to localhost:7662 on babel login node.

3. Call hosted model on local machine
Call hosted model on local machine
```bash
curl http://localhost:8001/v1/models
```
If the above command runs successfully, you should be able to use REST API on your local machine.

4. (optional) If you fail in building the ssh tunnel, you may add `-v` to the ssh command to see what went wrong.
(optional) If you fail in building the ssh tunnel, you may add `-v` to the ssh command to see what went wrong.




### Userful resource links for babel
## Userful resource links for babel
1. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=BABEL#Cluster_Architecture
2. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=VSCode
3. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=Training_Material
Expand Down