You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello.
I am using 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 container to run inference for Llama3.3-70B-Instruct. The container is being launched using Docker.
Have created repo dir with 2 models : 70B model and 8B model (model ids : mymodel and mymodeldraft)
Here are serving.properties of both models:
Launching the container, DJL is starting, both model are loaded.
But in the log I see message :
INFO PyProcess W-749-mymodel-stdout: [1,0]<stdout>:WARNING 01-24 08:07:00 arg_utils.py:66] Speculative decoding feature is only available on SageMaker. Running without speculative decoding...
When running inference, seems that speculative decoding is not active, draft model is not being called.
We are running DJL container on SageMaker endpoints.
Can you please explain how can we make this feature work ?
Thank you.
The text was updated successfully, but these errors were encountered:
Hello.
I am using 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 container to run inference for Llama3.3-70B-Instruct. The container is being launched using Docker.
Have created repo dir with 2 models : 70B model and 8B model (model ids : mymodel and mymodeldraft)
Here are serving.properties of both models:
For 70B model (mymodel):
engine=Python
option.mpi_mode=True
option.tensor_parallel_degree=8
option.trust_remote_code=true
option.rolling_batch=lmi-dist
option.max_input_len=32768
option.max_output_len=32768
option.max_model_len=32768
option.gpu_memory_utilization=0.5
option.max_rolling_batch_size=32
option.enable_prefix_caching=true
option.enable_streaming=false
option.speculative_draft_model=mymodeldraft
option.draft_model_tp_size=8
option.speculative_length=5
For 8B model (mymodeldraft):
engine=Python
option.mpi_mode=True
option.tensor_parallel_degree=8
option.trust_remote_code=true
option.rolling_batch=lmi-dist
option.max_input_len=32768
option.max_output_len=32768
option.max_model_len=32768
option.gpu_memory_utilization=0.4
option.max_rolling_batch_size=32
option.enable_prefix_caching=true
option.enable_streaming=false
Launching the container, DJL is starting, both model are loaded.
But in the log I see message :
INFO PyProcess W-749-mymodel-stdout: [1,0]<stdout>:WARNING 01-24 08:07:00 arg_utils.py:66] Speculative decoding feature is only available on SageMaker. Running without speculative decoding...
When running inference, seems that speculative decoding is not active, draft model is not being called.
We are running DJL container on SageMaker endpoints.
Can you please explain how can we make this feature work ?
Thank you.
The text was updated successfully, but these errors were encountered: