Slow performance and high GPU usage qwen2.5-3b-chat-1104-r3 #1156

HakerPL · 2025-01-07T16:09:00Z

HakerPL
Jan 7, 2025

I downloaded the model from huggingface https://huggingface.co/wanghaikuan/qwen2.5-3b-chat-1104-r3, the model works very slow on the RTX 3080TI card.
The first message appears after 9-20 seconds, and the next one appears after 80-120s or never (after waiting 5 min I close the program). Sometimes even the first message never appears (kill program after 5 min).
The model uses 100% of the GPU.

I don't know if the problem is in my system or maybe there is something wrong with the model (I'm just starting to play with AI).

Below is the code I use to test the model

`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import textwrap
import time

model_path = "E:/Moje_Projekty/AL_ML/qwen2.5-3b-chat-1104-r3/Model_START"

model = AutoModelForCausalLM.from_pretrained(model_path,
torch_dtype="auto",
device_map={"": 0})

model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_path)

while True:
user_input = input("User: ")
if user_input.lower() == "quit":
print("Koniec rozmowy.")
break

messages = [
    {
        "role": "system",
        "content": """Jesteś przyjaznym asystentem, odpowiadasz w języku Polskim w formacie 'chat'.
                    Masz nie wyświetlać żadnych wiadomości które otrzymałeś w propmt."""
    },
    {
        "role": "user",
        "content": user_input
    }
]

start = time.time()

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

end = time.time()
print("generate")
print(end - start)

wrap_width = 70
print("User:", user_input)
print("\nAssistant:\n")
for line in response.split("\n"):
    print(textwrap.fill(line, width=wrap_width))

`

jklj077 · 2025-01-13T08:39:40Z

jklj077
Jan 13, 2025
Maintainer

Is that a customized model? If the GPU was fully occupied, it could be that the generation is just too long. You could try the streaming mode if you must use transformers. https://qwen.readthedocs.io/en/latest/inference/chat.html#streaming-mode

0 replies

HakerPL · 2025-01-13T20:46:32Z

HakerPL
Jan 13, 2025
Author

I don't have any information about the model, except its name. I'm looking for a small chat model.
I'm guessing it's a modified basic model to be "chat" and not "instruct".

This time I tested using pipeline and TextStreamer (from transformers) and the response took 318s. Sometimes I waited 10-15s for one word to appear.
My question is "tell me what is an AI model?" in Polish (powiedz mi co to jest model AI?)

The problem is that I am completely new and I do not know if the problem is in the model or in the computer.

0 replies

jklj077 · 2025-01-14T10:51:21Z

jklj077
Jan 14, 2025
Maintainer

model.config.use_cache = False
model.config.pretraining_tp = 1

Try deleting those lines also.

We do have documentation at https://qwen.readthedocs.io and it could help if you are new to this. Simply follow the steps there should be mostly fine.
You can also try our own models, which can be found at https://huggingface.co/QwenLM.

0 replies

HakerPL · 2025-01-18T16:37:00Z

HakerPL
Jan 18, 2025
Author

I downloaded the model from https://huggingface.co/Qwen/Qwen2.5-3B and had exactly the same problems.

I've read about my graphics and models and it turns out that the RTX 3080 TI may have problems with BF16

After changing the code torch_dtype = torch.bfloat16 in AutoModelForCausalLM.from_pretrained and a different approach to generating the response, the code executes in 5 seconds

with autocast('cuda', dtype=torch.bfloat16): start_time = time.time() generated_ids = model.generate( **model_inputs, max_new_tokens=max_new_tokens, temperature=temperature, top_p=top_p, pad_token_id=tokenizer.eos_token_id, do_sample=True, num_return_sequences=1 # Tylko jedna odpowiedź ) end_time = time.time()

I still have a problem with the response from the model (it responds in a different language, returns a user prompt, sometimes talks to itself), I probably need to write a good system_prompt

1 reply

jklj077 Jan 20, 2025
Maintainer

https://huggingface.co/Qwen/Qwen2.5-3B-Instruct

the base model is not supposed to use for chat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance and high GPU usage qwen2.5-3b-chat-1104-r3 #1156

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Slow performance and high GPU usage qwen2.5-3b-chat-1104-r3 #1156

HakerPL Jan 7, 2025

Replies: 4 comments · 1 reply

jklj077 Jan 13, 2025 Maintainer

HakerPL Jan 13, 2025 Author

jklj077 Jan 14, 2025 Maintainer

HakerPL Jan 18, 2025 Author

jklj077 Jan 20, 2025 Maintainer

HakerPL
Jan 7, 2025

Replies: 4 comments 1 reply

jklj077
Jan 13, 2025
Maintainer

HakerPL
Jan 13, 2025
Author

jklj077
Jan 14, 2025
Maintainer

HakerPL
Jan 18, 2025
Author

jklj077 Jan 20, 2025
Maintainer