Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DJL running without speculative decoding #2678

Open
eduardzl opened this issue Jan 24, 2025 · 0 comments
Open

DJL running without speculative decoding #2678

eduardzl opened this issue Jan 24, 2025 · 0 comments

Comments

@eduardzl
Copy link

eduardzl commented Jan 24, 2025

Hello.
I am using 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 container to run inference for Llama3.3-70B-Instruct. The container is being launched using Docker.
Have created repo dir with 2 models : 70B model and 8B model (model ids : mymodel and mymodeldraft)
Here are serving.properties of both models:

For 70B model (mymodel):

engine=Python
option.mpi_mode=True
option.tensor_parallel_degree=8
option.trust_remote_code=true
option.rolling_batch=lmi-dist
option.max_input_len=32768
option.max_output_len=32768
option.max_model_len=32768
option.gpu_memory_utilization=0.5
option.max_rolling_batch_size=32
option.enable_prefix_caching=true
option.enable_streaming=false
option.speculative_draft_model=mymodeldraft
option.draft_model_tp_size=8
option.speculative_length=5

For 8B model (mymodeldraft):

engine=Python
option.mpi_mode=True
option.tensor_parallel_degree=8
option.trust_remote_code=true
option.rolling_batch=lmi-dist
option.max_input_len=32768
option.max_output_len=32768
option.max_model_len=32768
option.gpu_memory_utilization=0.4
option.max_rolling_batch_size=32
option.enable_prefix_caching=true
option.enable_streaming=false

Launching the container, DJL is starting, both model are loaded.
But in the log I see message :

INFO PyProcess W-749-mymodel-stdout: [1,0]<stdout>:WARNING 01-24 08:07:00 arg_utils.py:66] Speculative decoding feature is only available on SageMaker. Running without speculative decoding...
When running inference, seems that speculative decoding is not active, draft model is not being called.

We are running DJL container on SageMaker endpoints.
Can you please explain how can we make this feature work ?
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant