Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output Results are not the same for the same model [lmms-lab/LLaVA-Video-7B-Qwen2] #2775

Closed
Noctis-SC opened this issue Jan 7, 2025 · 0 comments

Comments

@Noctis-SC
Copy link

Noctis-SC commented Jan 7, 2025

Hello I don't know whether is bug or not. Maybe I miss something important but I'm wondering why when I run this model lmms-lab/LLaVA-Video-7B-Qwen2 using sglang and this command:

python3 -m sglang.launch_server --model-path /mnt/datadisk0/sanya/LLaVA-NeXT-SC/lmms-lab/LLaVA-Video-7B-Qwen2/ --port=30000 --chat-template=chatml-llava --load-format safetensors --trust-remote-code

my result output for the example video, after running this:
`from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
model="/mnt/datadisk0/sanya/LLaVA-NeXT-SC/lmms-lab/LLaVA-Video-7B-Qwen2",
messages=[
{
"role": "system", # Changed from 'user' to 'system'
"content": "Please describe what is happening in this video: http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
},
],
temperature=0,
max_tokens=300,
)

print(response.choices[0].message.content)`

is this:

The video is a promotional clip for a product called "For Bigger Fun." It features a group of people, including children and adults, engaging in various activities that highlight the product's features. The video opens with a shot of a child playing with a toy, followed by a scene of a family playing a board game together. The next scene shows a group of friends playing a video game, with one person holding a controller and the others watching on a large screen. The video then cuts to a shot of a child playing with a toy car, followed by a scene of a family playing a game of soccer in a backyard. The video wraps up with a shot of a child playing with a toy robot, followed by a scene of a family playing a game of basketball in a park. Throughout the video, the product is shown to be versatile and fun for people of all ages.

But when I run it locally:

`# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from transformers import Qwen2Tokenizer
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
import os
warnings.filterwarnings("ignore")

def load_video(video_path, max_frames_num,fps=1,force_sample=False):
if not isinstance(video_path, str):
raise TypeError(f"Expected video_path to be a string, but got {type(video_path)}")

if not os.path.exists(video_path):
    raise FileNotFoundError(f"Video file not found: {video_path}")
if max_frames_num == 0:
    return np.zeros((1, 336, 336, 3))
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
total_frame_num = len(vr)
video_time = total_frame_num / vr.get_avg_fps()
fps = round(vr.get_avg_fps()/fps)
frame_idx = [i for i in range(0, len(vr), fps)]
frame_time = [i/fps for i in frame_idx]
if len(frame_idx) > max_frames_num or force_sample:
    sample_fps = max_frames_num
    uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
    frame_idx = uniform_sampled_frames.tolist()
    frame_time = [i/vr.get_avg_fps() for i in frame_idx]
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
spare_frames = vr.get_batch(frame_idx).asnumpy()
# import pdb;pdb.set_trace()
return spare_frames,frame_time,video_time

pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
model_name = "llava_qwen"

model_name = "qwen"

device = "cuda"
device_map = "auto"
import os
assert os.path.exists(pretrained), f"Model path {pretrained} does not exist"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "/mnt/datadisk0/ForBiggerFun.mp4"
max_frames_num = 16
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video = [video]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
input_ids,
images=video,
modalities= ["video"],
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)
`
more precise:

The video begins with a close-up of a hand plugging a Chromecast device into an HDMI port on the back of a television. The scene transitions to a cozy living room where two people are sitting on a couch, watching TV. The TV screen displays an animated movie featuring penguins in a snowy landscape. The next scene shows a person using a laptop to stream content from YouTube, followed by a group of people playing a video game together. A person is then seen holding a smartphone, possibly controlling the game or streaming content. The video continues with a person watching a car race on TV, followed by a family sitting on a couch, watching a movie together. The scene shifts to a baby in a high chair surrounded by toys and a mobile, indicating a child-friendly environment. The video wraps up with a dramatic scene of a building collapsing, followed by a white screen displaying the text 'Everything you love, now on your TV.' This is followed by the Google Chrome logo and the URL 'google.com/chromecast,' promoting the Chromecast device.

this is my pip list:

Package Version Editable project location


accelerate 1.2.1
aiofiles 23.2.1
aiohappyeyeballs 2.4.4
aiohttp 3.11.11
aiosignal 1.3.2
annotated-types 0.7.0
anyio 4.7.0
async-timeout 5.0.1
attrs 24.3.0
av 14.0.1
bitsandbytes 0.41.0
certifi 2024.12.14
charset-normalizer 3.4.1
click 8.1.8
datasets 2.16.1
decord 0.6.0
deepspeed 0.14.4
dill 0.3.7
distro 1.9.0
docker-pycreds 0.4.0
docstring_parser 0.16
einops 0.6.1
einops-exts 0.0.4
exceptiongroup 1.2.2
fastapi 0.115.6
ffmpy 0.5.0
filelock 3.16.1
flash_attn 2.7.2.post1
frozenlist 1.5.0
fsspec 2023.10.0
ftfy 6.3.1
gitdb 4.0.12
GitPython 3.1.44
gradio 5.9.1
gradio_client 1.5.2
h11 0.14.0
hf_transfer 0.1.8
hjson 3.1.0
httpcore 1.0.7
httpx 0.28.1
huggingface-hub 0.27.0
idna 3.10
Jinja2 3.1.5
jiter 0.8.2
joblib 1.4.2
latex2mathml 3.77.0
llava 1.7.0.dev0 /mnt/datadisk0/sanya/LLaVA-NeXT-SC
markdown-it-py 3.0.0
markdown2 2.5.2
MarkupSafe 2.1.5
mdurl 0.1.2
mpmath 1.3.0
multidict 6.1.0
multiprocess 0.70.15
networkx 3.4.2
ninja 1.11.1.3
numpy 1.26.1
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvtx-cu12 12.1.105
open_clip_torch 2.29.0
openai 1.59.3
opencv-python 4.10.0.84
orjson 3.10.13
packaging 24.2
pandas 2.2.3
peft 0.4.0
pillow 11.1.0
pip 24.3.1
platformdirs 4.3.6
propcache 0.2.1
protobuf 5.29.2
psutil 6.1.1
py-cpuinfo 9.0.0
pyarrow 18.1.0
pyarrow-hotfix 0.6
pydantic 2.10.4
pydantic_core 2.27.2
pydub 0.25.1
Pygments 2.18.0
python-dateutil 2.9.0.post0
python-multipart 0.0.20
pytz 2024.2
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
rich 13.9.4
ruff 0.8.6
safehttpx 0.1.6
safetensors 0.5.0
scikit-learn 1.2.2
scipy 1.14.1
semantic-version 2.10.0
sentencepiece 0.1.99
sentry-sdk 2.19.2
setproctitle 1.3.4
setuptools 75.1.0
shellingham 1.5.4
shortuuid 1.0.13
shtab 1.7.1
six 1.17.0
smmap 5.0.2
sniffio 1.3.1
starlette 0.41.3
svgwrite 1.4.3
sympy 1.13.3
threadpoolctl 3.5.0
timm 1.0.12
tokenizers 0.15.2
tomlkit 0.13.2
torch 2.1.2
torchvision 0.16.2
tqdm 4.67.1
transformers 4.40.0.dev0
triton 2.1.0
typeguard 4.4.1
typer 0.15.1
typing_extensions 4.12.2
tyro 0.9.5
tzdata 2024.2
urllib3 1.26.20
uvicorn 0.34.0
wandb 0.18.7
wavedrom 2.0.3.post3
wcwidth 0.2.13
websockets 14.1
wheel 0.44.0
xxhash 3.5.0
yarl 1.18.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant