Skip to content

Commit

Permalink
Update user docs for running llm server + upgrade gguf to 0.11.0 (
Browse files Browse the repository at this point in the history
#676)

# Description

Did a pass through and made updates + fixes to the user docs for
`e2e_llama8b_mi300x.md`.

1. Update install instructions for `shark-ai`
2. Update nightly install instructions for `shortfin` and `sharktank`
3. Update paths for model artifacts to ensure they work with
`llama3.1-8b-fp16-instruct`
4. Remove steps to `write edited config`. No longer needed after #487 

Added back `sentencepiece` as a requirement for `sharktank`. Not having
it caused `export_paged_llm_v1` to break when installing nightly:

```text
ModuleNotFoundError: No module named 'sentencepiece'
```

This was obfuscated when building from source, because `shortfin`
includes `sentencepiece` in `requirements-tests.txt`.
  • Loading branch information
stbaione authored and eagarvey-amd committed Jan 8, 2025
1 parent 214ce10 commit 2996ecf
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 64 deletions.
82 changes: 22 additions & 60 deletions docs/shortfin/llm/user/e2e_llama8b_mi300x.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,32 +22,28 @@ python -m venv --prompt shark-ai .venv
source .venv/bin/activate
```

### Install `shark-ai`
## Install stable shark-ai packages

You can install either the `latest stable` version of `shark-ai`
or the `nightly` version:

#### Stable
<!-- TODO: Add `sharktank` to `shark-ai` meta package -->

```bash
pip install shark-ai
pip install shark-ai[apps] sharktank
```

#### Nightly

```bash
pip install sharktank -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
pip install shortfin -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```
### Nightly packages

#### Install dataclasses-json
To install nightly packages:

<!-- TODO: This should be included in release: -->
<!-- TODO: Add `sharktank` to `shark-ai` meta package -->

```bash
pip install dataclasses-json
pip install shark-ai[apps] sharktank \
--pre --find-links https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```

See also the
[instructions here](https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).

### Define a directory for export files

Create a new directory for us to export files like
Expand Down Expand Up @@ -78,8 +74,8 @@ This example uses the `llama8b_f16.gguf` and `tokenizer.json` files
that were downloaded in the previous step.

```bash
export MODEL_PARAMS_PATH=$EXPORT_DIR/llama3.1-8b/llama8b_f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/llama3.1-8b/tokenizer.json
export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json
```

#### General env vars
Expand All @@ -91,8 +87,6 @@ The following env vars can be copy + pasted directly:
export MLIR_PATH=$EXPORT_DIR/model.mlir
# Path to export config.json file
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
# Path to export edited_config.json file
export EDITED_CONFIG_PATH=$EXPORT_DIR/edited_config.json
# Path to export model.vmfb file
export VMFB_PATH=$EXPORT_DIR/model.vmfb
# Batch size for kvcache
Expand All @@ -108,7 +102,7 @@ to export our model to `.mlir` format.

```bash
python -m sharktank.examples.export_paged_llm_v1 \
--irpa-file=$MODEL_PARAMS_PATH \
--gguf-file=$MODEL_PARAMS_PATH \
--output-mlir=$MLIR_PATH \
--output-config=$OUTPUT_CONFIG_PATH \
--bs=$BS
Expand Down Expand Up @@ -137,37 +131,6 @@ iree-compile $MLIR_PATH \
-o $VMFB_PATH
```

## Write an edited config

We need to write a config for our model with a slightly edited structure
to run with shortfin. This will work for the example in our docs.
You may need to modify some of the parameters for a specific model.

### Write edited config

```bash
cat > $EDITED_CONFIG_PATH << EOF
{
"module_name": "module",
"module_abi_version": 1,
"max_seq_len": 131072,
"attn_head_count": 8,
"attn_head_dim": 128,
"prefill_batch_sizes": [
$BS
],
"decode_batch_sizes": [
$BS
],
"transformer_block_count": 32,
"paged_kv_cache": {
"block_seq_stride": 16,
"device_block_count": 256
}
}
EOF
```

## Running the `shortfin` LLM server

We should now have all of the files that we need to run the shortfin LLM server.
Expand All @@ -178,15 +141,14 @@ Verify that you have the following in your specified directory ($EXPORT_DIR):
ls $EXPORT_DIR
```

- edited_config.json
- config.json
- meta-llama-3.1-8b-instruct.f16.gguf
- model.mlir
- model.vmfb
- tokenizer_config.json
- tokenizer.json

### Launch server:

<!-- #### Set the target device
TODO: Add instructions on targeting different devices,
when `--device=hip://$DEVICE` is supported -->
### Launch server

#### Run the shortfin server

Expand All @@ -209,7 +171,7 @@ Run the following command to launch the Shortfin LLM Server in the background:
```bash
python -m shortfin_apps.llm.server \
--tokenizer_json=$TOKENIZER_PATH \
--model_config=$EDITED_CONFIG_PATH \
--model_config=$OUTPUT_CONFIG_PATH \
--vmfb=$VMFB_PATH \
--parameters=$MODEL_PARAMS_PATH \
--device=hip > shortfin_llm_server.log 2>&1 &
Expand Down Expand Up @@ -252,7 +214,7 @@ port = 8000 # Change if running on a different port
generate_url = f"http://localhost:{port}/generate"

def generation_request():
payload = {"text": "What is the capital of the United States?", "sampling_params": {"max_completion_tokens": 50}}
payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
try:
resp = requests.post(generate_url, json=payload)
resp.raise_for_status() # Raises an HTTPError for bad responses
Expand Down
5 changes: 1 addition & 4 deletions sharktank/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
iree-turbine

# Runtime deps.
gguf==0.10.0
gguf>=0.11.0
numpy<2.0

# Needed for newer gguf versions (TODO: remove when gguf package includes this)
# sentencepiece>=0.1.98,<=0.2.0

# Model deps.
huggingface-hub==0.22.2
transformers==4.40.0
Expand Down

0 comments on commit 2996ecf

Please sign in to comment.