Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed mii fastgen example #2779

Merged
merged 21 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions examples/large_models/deepspeed_mii/LLM/DeepSpeed_mii_handler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import logging
import os
from abc import ABC

import mii

from ts.context import Context
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)
logger.info("DeepSpeed MII version %s", mii.__version__)


class DeepSpeedMIIHandler(BaseHandler, ABC):
"""
Diffusers handler class for text to image generation.
"""

def __init__(self):
self.device = int(os.getenv("LOCAL_RANK", 0))
self.initialized = False

def initialize(self, ctx: Context):
"""In this initialize function, the Stable Diffusion model is loaded and
initialized here.
Args:
ctx (context): It is a JSON Object containing information
pertaining to the model artifacts parameters.
"""
model_dir = ctx.system_properties.get("model_dir")
model_name = ctx.model_yaml_config["handler"]["model_name"]
model_path = ctx.model_yaml_config["handler"]["model_path"]
self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])

model_config = {
"tensor_parallel": int(ctx.model_yaml_config["handler"]["tensor_parallel"]),
"max_length": int(ctx.model_yaml_config["handler"]["max_length"]),
}
self.pipe = mii.pipeline(
model_name_or_path=model_path,
model_config=model_config,
)
logger.info("Model %s loaded successfully", model_name)
self.initialized = True

def preprocess(self, requests):
"""Basic text preprocessing, of the user's prompt.
Args:
requests (str): The Input data in the form of text is passed on to the preprocess
function.
Returns:
list : The preprocess function returns a list of prompts.
"""
inputs = []
for _, data in enumerate(requests):
input_text = data.get("data")
if input_text is None:
input_text = data.get("body")
if isinstance(input_text, (bytes, bytearray)):
input_text = input_text.decode("utf-8")
logger.info("Received text: '%s'", input_text)
inputs.append(input_text)
return inputs

def inference(self, inputs):
"""Generates the image relevant to the received text.
Args:
input_batch (list): List of Text from the pre-process function is passed here
Returns:
list : It returns a list of the generate images for the input text
"""
inferences = self.pipe(
inputs, max_new_tokens=self.max_new_tokens
).generated_texts

logger.info("Generated text: %s", inferences)
return inferences

def postprocess(self, inference_output):
"""Post Process Function converts the generated image into Torchserve readable format.
Args:
inference_output (list): It contains the generated image of the input text.
Returns:
(list): Returns a list of the images.
"""

return inference_output
5 changes: 5 additions & 0 deletions examples/large_models/deepspeed_mii/LLM/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Running LLM model using Microsoft DeepSpeed-MII in Torchserve

This example demo serving HF LLM model with Microsoft DeepSpeed-MII in Torchserve. With DeepSpeed-MII there has been significant progress in system optimizations for DL model inference, drastically reducing both latency and cost.

The notebook example can be found in mii-deepspeed-fastgen.ipynb.
162 changes: 162 additions & 0 deletions examples/large_models/deepspeed_mii/LLM/mii-deepspeed-fastgen.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Running LLM model using Microsoft DeepSpeed-MII in Torchserve.\n",
"This notebook briefs on serving HF LLM model with Microsoft DeepSpeed-MII in Torchserve. With DeepSpeed-MII there has been significant progress in system optimizations for DL model inference, drastically reducing both latency and cost."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### Step 1: Download model\n",
"Login into huggingface hub with token by running the below command"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"huggingface-cli login"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"!python ../../utils/Download_model.py --model_name meta-llama/Llama-2-13b-hf"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### Step 2: Generate model artifacts"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2045.86s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n"
]
}
],
"source": [
"!torch-model-archiver --model-name mii-llama--Llama-2-13b-hf --version 1.0 --handler DeepSpeed_mii_handler.py --config-file model-config.yaml -r requirements.txt --archive-format no-archive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"!mv model mii-llama--Llama-2-13b-hf\n",
"!cd ../../../../ && mkdir model_store && mv mii-llama--Llama-2-13b-hf model_store"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### Step 3: Start torchserve"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"!torchserve --ncs --start --model-store model_store --models mii-llama--Llama-2-13b-hf --ts-config benchmarks/config.properties"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"### Step 4: Run inference\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"!curl \"http://localhost:8080/predictions/mii-Llama-2-13b-hf\" -T examples/large_models/deepspeed_mii/LLM/sample.txt"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
20 changes: 20 additions & 0 deletions examples/large_models/deepspeed_mii/LLM/model-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 1200
parallelType: "tp"
deviceType: "gpu"
# example of user specified GPU deviceIds
deviceIds: [0,1,2,3] # seting CUDA_VISIBLE_DEVICES
mreso marked this conversation as resolved.
Show resolved Hide resolved

torchrun:
nproc-per-node: 4

# TorchServe Backend parameters
handler:
model_name: "meta-llama/Llama-2-13b-hf"
model_path: "model/models--meta-llama--Llama-2-13b-hf/snapshots/99afe33d7eaa87c7fc6ea2594a0e4e7e588ee0a4"
tensor_parallel: 4
max_length: 4096
max_new_tokens: 256
1 change: 1 addition & 0 deletions examples/large_models/deepspeed_mii/LLM/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
deepspeed-mii
1 change: 1 addition & 0 deletions examples/large_models/deepspeed_mii/LLM/sample.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The museum format went through significant transformations in the 20th century. For a long time, museums collected the art of previous generations. The demonstration of contemporary art required new approaches and fresh ideas. Modernization attempts appeared most often in the design of the outer parts of buildings; museums received attractive exterior decoration, such as the glass pyramids of the Louvre. The museum was supposed to evoke a respectful attitude towards what was stored within its walls. That is why museums were arranged in palaces or in specially built buildings, the appearance of which was supposed to inspire respect. However, it gradually became clear that this approach did not attract modern visitors. It became apparent that contemporary art needed a contemporary place of expression.
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,9 @@ public void run() {
long begin = System.currentTimeMillis();
for (int i = 0; i < repeats; i++) {
reply = replies.poll(responseTimeout, TimeUnit.SECONDS);
if (req.getCommand() != WorkerCommands.LOAD) {
break;
}
}

long duration = System.currentTimeMillis() - begin;
Expand Down
5 changes: 4 additions & 1 deletion ts/model_service_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,10 @@ def handle_connection(self, cl_socket):
if cmd == b"I":
if service is not None:
resp = service.predict(msg)
cl_socket.sendall(resp)
if LOCAL_RANK == 0:
cl_socket.sendall(resp)
else:
logging.info("skip sending response at rank %d", LOCAL_RANK)
else:
raise RuntimeError(
"Received command: {}, but service is not loaded".format(cmd)
Expand Down
3 changes: 3 additions & 0 deletions ts/protocol/otf_message_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,9 @@ def create_predict_response(
:param code:
:return:
"""
if str(os.getenv("LOCAL_RANK", 0)) != "0":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This low level method should not be concerned about checking the LOCAL_RANK env environment. We should just not call it when its not "0".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checking is a central control to avoid this checking in handler implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the intention here. I still think this is not the preferred place to put this check. The function creating a response from some arguments should not need to know about the concept of ranks. In fact we already have a central point to check for the rank in this PR which is here.
The only point where this is missing is here. Looking at this I think we should move send_intermediate_predict_response out of ts.protocol.otf_message_handler (Users should not need deal with the modules containing our comms between fe/be. What if we drop otf protocol?). My first instinct is to move it under ts.handler_utils. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to move it to handler_utils. The only concern is backward compatibility. It will break existing cx.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we do not guarantee bc as of now. Its probably time to plan out our bc strategy for future release as we're running out or 0.X version numbers. Anyways, if your concern is very high you can leave a wrapper in otf_message_handler that fires a deprecation warning and then calls the one in handler_utils until next release. cx will only get worse if we let the code rot.

return None

msg = bytearray()
msg += struct.pack("!i", code)

Expand Down