Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Distributed(Tensor Parallel) Inference Recipe #2245

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion recipes/configs/generation.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# Config for running the InferenceRecipe in generate.py to generate output from an LLM
# Config for running the InferenceRecipe in generate.py to generate output
# from Llama2 7B model
#
# This config assumes that you've run the following command before launching
# this run:
# tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --ignore-patterns "*.safetensors" --hf-token <HF_TOKEN>
#
# To launch, run the following command from root torchtune directory:
# tune run generate --config generation
Expand Down
45 changes: 45 additions & 0 deletions recipes/configs/llama3/70B_generation_distributed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Config for running the InferenceRecipe in dev/generate_v2.py to generate output
# using a Llama3 70B Instruct model
#
# This config assumes that you've run the following command before launching:
# tune download meta-llama/Meta-Llama-3-70B-Instruct --output-dir /tmp/Meta-Llama-3-70B-Instruct --ignore-patterns "original/consolidated*" --hf-token <HF_TOKEN>
#
# To launch, run the following command from root torchtune directory:
# tune run --nproc_per_node 8 dev/generate_v2_distributed --config llama3/70B_generation_distributed.yaml

# Model arguments
model:
_component_: torchtune.models.llama3.llama3_70b

# Transform arguments
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /tmp/Meta-Llama-3-70B-Instruct/original/tokenizer.model
prompt_template: null
max_seq_len: 8192

# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Meta-Llama-3-70B-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: "00030"
recipe_checkpoint: null
output_dir: ./
model_type: LLAMA3

# Device
device: cuda
dtype: bf16
seed: 1234
log_level: INFO

# Generation arguments
prompt:
system: null
user:
text: Tell a joke.
max_new_tokens: 200
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
45 changes: 45 additions & 0 deletions recipes/configs/llama3_1/70B_generation_distributed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Config for running the InferenceRecipe in dev/generate_v2.py to generate output
# using a Llama3.1 70B Instruct model
#
# This config assumes that you've run the following command before launching:
# tune download meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir /tmp/Meta-Llama-3.1-70B-Instruct --ignore-patterns "original/consolidated*" --hf-token <HF_TOKEN>
#
# To launch, run the following command from root torchtune directory:
# tune run --nproc_per_node 8 dev/generate_v2_distributed --config llama3_1/70B_generation_distributed.yaml

# Model arguments
model:
_component_: torchtune.models.llama3_1.llama3_1_70b

# Transform arguments
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /tmp/Meta-Llama-3.1-70B-Instruct/original/tokenizer.model
prompt_template: null
max_seq_len: 8192

# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Meta-Llama-3.1-70B-Instruct/
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: "00030"
recipe_checkpoint: null
output_dir: ./
model_type: LLAMA3

# Device
device: cuda
dtype: bf16
seed: 1234
log_level: INFO

# Generation arguments
prompt:
system: null
user:
text: Tell a joke.
max_new_tokens: 200
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
4 changes: 1 addition & 3 deletions recipes/configs/llama3_2_vision/11B_generation_v2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@
# To launch, run the following command from root torchtune directory:
# tune run dev/generate_v2 --config llama3_2_vision/generation_v2

output_dir: ./ # Not needed

# Model arguments
model:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
Expand All @@ -27,7 +25,7 @@ checkpointer:
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: "00005"
output_dir: ${output_dir}
output_dir: ./
model_type: LLAMA3_VISION

# Device
Expand Down
14 changes: 9 additions & 5 deletions recipes/dev/generate_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,18 +39,22 @@ def __call__(self, prompt: Dict[str, Any]) -> List[Message]:

# Iterate through roles and add content
for role, content in prompt.items():
if isinstance(content, str):
if content is None:
continue
elif isinstance(content, str):
new_content = [{"type": "text", "content": content}]
else:
assert (
"image" in content.keys()
), "Multiple entries per role expect an image key"
elif "image" in content.keys():
image_loc = content["image"]
image = load_image(image_loc)
new_content = [
{"type": "image", "content": image},
{"type": "text", "content": content["text"]},
]
else:
assert (
"text" in content.keys()
), "Multiple entries per role expect at least a text key"
new_content = [{"type": "text", "content": content["text"]}]
messages.append(Message(role=role, content=new_content))

# Finally, add an empty assistant message to kick-start generation
Expand Down
Loading
Loading