Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack error #177

Closed
Charlesliu77 opened this issue Jan 3, 2025 · 5 comments
Closed

Stack error #177

Charlesliu77 opened this issue Jan 3, 2025 · 5 comments

Comments

@Charlesliu77
Copy link

the image size of inputs are different, i got the error below when using the dynamic_s2 preprocess method:
RuntimeError: stack expects each tensor to be equal size, but got [2560, 3584] at entry 0 and [3072, 3584] at entry 1.

@bfshi
Copy link
Collaborator

bfshi commented Jan 7, 2025

Hi @Charlesliu77, could you point to which line of the code this issue happens at?

@Charlesliu77
Copy link
Author

Hi @Charlesliu77, could you point to which line of the code this issue happens at?

llava_arch.py: line 378
image_features = torch.stack(image_features, dim=0)
the input image in different size after dynamic_s2 and token processing can't stack together

@bfshi
Copy link
Collaborator

bfshi commented Jan 8, 2025

Hi, can you try replacing this line with

if all([feature.shape[0] == image_features[0].shape[0] for feature in image_features]):
    image_features = torch.stack(image_features, dim=0)

@Charlesliu77
Copy link
Author

Hi, can you try replacing this line with

if all([feature.shape[0] == image_features[0].shape[0] for feature in image_features]):
    image_features = torch.stack(image_features, dim=0)

Thanks a lot, it works, but i have another question about the model verison, what's the difference between Nvila and Nvila-lite?

@bfshi
Copy link
Collaborator

bfshi commented Jan 9, 2025

NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead of 2x2 in the mm projector, and NVILA-Lite uses dynamic res instead of dynamic s2. We will update more details about NVILA-Lite in our next version of the preprint. Stay tuned!

@bfshi bfshi closed this as completed Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants