Replies: 4 comments 1 reply
-
Is that a customized model? If the GPU was fully occupied, it could be that the generation is just too long. You could try the streaming mode if you must use |
Beta Was this translation helpful? Give feedback.
-
I don't have any information about the model, except its name. I'm looking for a small chat model. This time I tested using pipeline and TextStreamer (from transformers) and the response took 318s. Sometimes I waited 10-15s for one word to appear. The problem is that I am completely new and I do not know if the problem is in the model or in the computer. |
Beta Was this translation helpful? Give feedback.
-
Try deleting those lines also. We do have documentation at https://qwen.readthedocs.io and it could help if you are new to this. Simply follow the steps there should be mostly fine. |
Beta Was this translation helpful? Give feedback.
-
I downloaded the model from https://huggingface.co/Qwen/Qwen2.5-3B and had exactly the same problems. I've read about my graphics and models and it turns out that the RTX 3080 TI may have problems with BF16 After changing the code torch_dtype = torch.bfloat16 in AutoModelForCausalLM.from_pretrained and a different approach to generating the response, the code executes in 5 seconds
I still have a problem with the response from the model (it responds in a different language, returns a user prompt, sometimes talks to itself), I probably need to write a good system_prompt |
Beta Was this translation helpful? Give feedback.
-
I downloaded the model from huggingface https://huggingface.co/wanghaikuan/qwen2.5-3b-chat-1104-r3, the model works very slow on the RTX 3080TI card.
![image](https://private-user-images.githubusercontent.com/13417491/400825906-16bfffd6-61b3-4118-9e6a-417ec6bfaf2c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMTgzMTYsIm5iZiI6MTczOTExODAxNiwicGF0aCI6Ii8xMzQxNzQ5MS80MDA4MjU5MDYtMTZiZmZmZDYtNjFiMy00MTE4LTllNmEtNDE3ZWM2YmZhZjJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA5VDE2MjAxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTNjOTNhNmJhNjhhZGJhMGMxMjE2ZTRmZGM3MDBlMWJjMTQyYzY1ZDYzYTU5OTRlNzEzMGUyNjQ5ZmNmOTc5ZjMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.apCeavjBOfKw11mSkEpnj4G1QjNJjMuWtUOgK5G0lco)
The first message appears after 9-20 seconds, and the next one appears after 80-120s or never (after waiting 5 min I close the program). Sometimes even the first message never appears (kill program after 5 min).
The model uses 100% of the GPU.
I don't know if the problem is in my system or maybe there is something wrong with the model (I'm just starting to play with AI).
Below is the code I use to test the model
`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import textwrap
import time
model_path = "E:/Moje_Projekty/AL_ML/qwen2.5-3b-chat-1104-r3/Model_START"
model = AutoModelForCausalLM.from_pretrained(model_path,
torch_dtype="auto",
device_map={"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_path)
while True:
user_input = input("User: ")
if user_input.lower() == "quit":
print("Koniec rozmowy.")
break
`
Beta Was this translation helpful? Give feedback.
All reactions