Flamingo: a Visual Language Model for Few-Shot Learning |
Github |
A |
1 |
Chinchilla |
NFNet |
Perceiver |
480 |
2022/04 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
Github |
A |
1 |
Flan-T5 / OPT |
CLIP ViT-L/14 / Eva-CLIP ViT-G/14 |
Q-Former |
224 |
2023/01 |
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention |
Github |
A |
1 |
LLaMA |
CLIP-ViT-L/14 |
MLP |
224 |
2023/03 |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |
Github |
A |
1 |
Vicuna |
Eva-CLIP ViT-G/14 |
Q-Former |
224 |
2023/04 |
Visual Instruction Tuning |
Github |
A |
1 |
Vicuna |
CLIP ViT-L/14 |
Linear |
224 |
2023/04 |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |
Github |
A |
1 |
LLaMA |
CLIP ViT-L/14 |
Abstractor |
224 |
2023/04 |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model |
Github |
A |
1 |
LLaMA |
CLIP-ViT-L/14 |
MLP |
224 |
2023/04 |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
Github |
A |
1 |
Flan-T5 / Vicuna |
Eva-CLIP ViT-G/14 |
Q-Former |
224 |
2023/05 |
Otter: A Multi-Modal Model with In-Context Instruction Tuning |
Github |
A |
1 |
LLaMA |
CLIP ViT-L/14 |
Perceiver |
224 |
2023/05 |
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models |
Github |
A |
1 |
LLaMA |
CLIP ViT-L/14 |
MLP |
224 |
2023/05 |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans |
Github |
A |
1 |
LLaMA |
CLIP ViT-L/14 |
Perceiver |
224 |
2023/05 |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic |
Github |
A |
1 |
Vicuna |
CLIP ViT-L/14 |
Linear |
224 |
2023/06 |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models |
Github |
A |
1 |
Vicuna |
CLIP ViT-L/14 |
Linear |
224 |
2023/06 |
Valley: Video Assistant with Large Language model Enhanced abilitY |
Github |
A |
1 |
Stable-Vicuna |
CLIP ViT-L/14 |
Temporal Module + Linear |
224 |
2023/06 |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? |
Github |
A |
1 |
Vicuna |
EVA-1B |
Resampler |
420 |
2023/07 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
Github |
A |
1 |
Qwen |
OpenCLIP ViT-bigG |
Cross-Attention |
448 |
2023/08 |
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions |
Github |
A |
1 |
Flan-T5 / Vicuna |
Eva-CLIP ViT-G/14 |
Q-Former + MLP |
224 |
2023/08 |
IDEFICS |
Huggingface |
A |
1 |
LLaMA |
OpenCLIP ViT-H/14 |
Perceiver |
224 |
2023/08 |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models |
Github |
A |
1 |
LLaMA, MPT |
CLIP ViT-L/14 |
Perceiver |
224 |
2023/08 |
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition |
Github |
A |
1 |
InternLM |
Eva-CLIP ViT-G/14 |
Perceiver |
224 |
2023/09 |
Improved Baselines with Visual Instruction Tuning |
Github |
A |
1 |
Vicuna 1.5 |
CLIP ViT-L/14 |
MLP |
336 |
2023/10 |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning |
Github |
A |
1 |
LLaMA-2 |
EVA |
Linear |
448 |
2023/10 |
Fuyu-8B: A Multimodal Architecture for AI Agents |
HF |
A |
1 |
Persimmon |
- |
Linear |
unlimited |
2023/10 |
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model |
Github |
A |
1 |
LLaMA |
CLIP ViT-L/14 |
Abstractor |
224*20 |
2023/10 |
CogVLM: Visual Expert for Pretrained Language Models |
Github |
A |
1 |
Vicuna 1.5 |
EVA2-CLIP-E |
MLP |
490 |
2023/11 |
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models |
Github |
A |
1 |
Qwen |
OpenCLIP ViT-bigG |
Cross-Attention |
896 |
2023/11 |
ShareGPT4V:ImprovingLargeMulti-Modal Models with Better Captions |
Github |
A |
1 |
Vicuna-1.5 |
CLIP ViT-L/14 |
MLP |
336 |
2023/11 |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
Github |
A |
1 |
LLaMA-2 |
CLIP ViT-L/14 |
Abstractor |
448 |
2023/11 |
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models |
Github |
A |
1 |
LLaMA-2 |
CLIP ViT-L/14 + CLIP ConvNeXt-XXL + DINOv2 ViT-G/14 |
Linear + Q-Former |
672 |
2023/11 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
Github |
A |
1 |
Vicuna |
InternViT |
QLLaMA / MLP |
336 |
2023/12 |
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices |
Github |
A |
1 |
MobileLLaMA |
CLIP ViT-L/14 |
LDP (conv-based) |
336 |
2023/12 |
VILA: On Pre-training for Visual Language Models |
Github |
A |
1 |
LLaMA-2 |
CLIP ViT-L |
Linear |
336 |
2023/12 |
Osprey: Pixel Understanding with Visual Instruction Tuning |
Github |
A |
1 |
Vicuna |
CLIP ConvNeXt-L |
MLP |
512 |
2023/12 |
Honeybee: Locality-enhanced Projector for Multimodal LLM |
Github |
A |
1 |
Vicuna-1.5 |
CLIP ViT-L/14 |
C-Abstractor / D -Abstractor |
336 |
2023/12 |
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts |
- |
A |
1 |
UL2 |
Siglip ViT-G/14 |
Linear |
1064 |
2023/12 |
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge |
Github |
A |
1 |
Vicuna / Mistral / Hermes-2-Yi |
CLIP ViT-L/14 |
MLP |
672 |
2024/01 |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model |
Github |
A |
1 |
InternLM-2 |
CLIP ViT-L/14 |
MLP |
490 |
2024/01 |
MouSi: Poly-Visual-Expert Vision-Language Models |
Github |
A |
1 |
Vicuna-1.5 |
CLIP ViT-L/14 + MAE + LayoutLMv3 + ConvNeXt + SAM + DINOv2 ViT-G |
Poly-Expert Fusion |
1024 |
2024/01 |
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs |
Github |
A |
1 |
Vicuna1.5 |
CLIP ViT-L/14 |
MLP |
336 |
2024/01 |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models |
Github |
A |
1 |
StableL / Qwen / Phi-2 |
CLIP ViT-L/14 |
MLP |
336 |
2024/01 |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model |
Github |
A |
1 |
MobileLLaMA |
CLIP ViT-L/14 |
LDP v2 |
336 |
2024/02 |
Bunny: Efficient Multimodal Learning from Data-centric Perspective |
Github |
A |
1 |
Phi-1.5 / LLaMA-3 / StableLM-2 / Phi-2 |
SigLIP, EVA-CLIP |
MLP |
1152 |
2024/02 |
TinyLLaVA: A Framework of Small-scale Large Multimodal Models |
Github |
A |
1 |
TinyLLaMA / Phi-2 / StableLM-2 |
SigLIP-L, CLIP ViT-L |
MLP |
336/384 |
2024/02 |
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models |
Github |
A |
1 |
TinyLLaMA / InternLM2 / LLaMA2 / Mixtral |
CLIP ConvNeXt-XXL + DINOv2 ViT-G/14 |
Linear |
672 |
2024/02 |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models |
Github |
A |
1 |
Gemma / Vicuna / Mixtral / Hermes-2-Yi |
CLIP ViT-L + ConvNext-L |
Cross-Attention + MLP |
1536 |
2024/03 |
DeepSeek-VL: Towards Real-World Vision-Language Understanding |
Github |
A |
1 |
Deepseek LLM |
SigLIP-L, SAM-B |
MLP |
1024 |
2024/03 |
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images |
Github |
A |
1 |
Vicuna |
CLIP ViT-L/14 |
Perceiver |
336*6 |
2024/03 |
[Yi-VL] Yi: Open Foundation Models by 01.AI |
Github |
A |
1 |
Yi |
CLIP ViT-H/14 |
MLP |
448 |
2024/03 |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training |
Github |
A |
1 |
in-house LLM |
CLIP ViT-H* |
C-Abstractor |
1792 |
2024/03 |
VL-Mamba: Exploring State Space Models for Multimodal Learning |
Github |
A |
1 |
Mamba LLM |
CLIP-ViT-L / SigLIP-SO400M |
VSS + MLP |
384 |
2024/03 |
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference |
Github |
A |
1 |
Mamba-Zephyr |
DINOv2 + SigLIP |
MLP |
384 |
2024/03 |
[InternVL 1.5] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites |
Github |
A |
1 |
InternLM2 |
InternViT-6B |
MLP |
448*40 |
2024/04 |
[Phi-3-Vision] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone |
Github |
A |
1 |
Phi-3 |
CLIP ViT-L/14 |
MLP |
336*16 |
2024/04 |
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning |
Github |
A |
1 |
Vicuna / Mistral / Hermes-2-Yi |
CLIP ViT-L/14 |
MLP + Adaptive Pooling |
336 |
2024/04 |
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models |
Github |
A |
1 |
InternLM-1 |
SigLIP-SO400M/14 |
Resampler + MLP |
unlimited |
2024/04 |
Imp: Highly Capable Large Multimodal Models for Mobile Devices |
Github |
A |
1 |
Phi-2 |
SigLIP |
MLP |
384 |
2024/05 |
[IDEFICS2] What matters when building vision-language models? |
HF |
A |
1 |
Mistral-v0.1 |
SigLIP-SO400M/14 |
Perceiver + MLP |
384*4 |
2024/05 |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models |
Github |
A |
1 |
Vicuna- |
CLIP-ConvNeXt-L* |
MLP |
1536 |
2024/05 |
Ovis: Structural Embedding Alignment for Multimodal Large Language Model |
Github |
A |
1 |
LLaMA3 / Qwen1.5 |
CLIP ViT-L + Visual Embedding |
- |
336 |
2024/05 |
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models |
Github |
A |
1 |
Vicuna-1.5 |
CLIP ViT-L/14 |
MLP + Adaptive Pooling |
336 |
2024/05 |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts |
Github |
A |
1 |
Mistral / Mixtral |
CLIP ViT-L/14 |
MLP |
336 |
2024/05 |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs |
Github |
A |
1 |
Vicuna-1.5 / LLaMA-3 / Hermes-2-Yi |
CLIP ViT-L/14 + DINOv2 ViT-L/14 + SigLIP ViT-SO400M + OpenCLIP ConvNeXt-XXL |
Spatial Vision Aggregator |
1024 |
2024/06 |
GLM-4v |
Github |
A |
1 |
GLM4 |
EVA-CLIP-E |
Conv + SwiGLU |
1120 |
2024/06 |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output |
Github |
A |
1 |
InternLM-2 |
CLIP ViT-L/14 |
MLP |
560*24 |
2024/07 |
[IDEFICS3] Building and better understanding vision-language models: insights and future directions |
HF |
A |
1 |
LLaMA 3.1 |
SigLIP-SO400M/14 |
Perceiver + MLP |
1820 |
2024/08 |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models |
Github |
A |
1 |
Qwen2 |
SigLIP-SO400M/14 |
Linear |
384*6 |
2024/08 |
CogVLM2: Visual Language Models for Image and Video Understanding |
Github |
A |
1 |
LLaMA3 |
EVA-CLIP-E |
Conv + SwiGLU |
1344 |
2024/08 |
CogVLM2-vedio: Visual Language Models for Image and Video Understanding |
Github |
A |
1 |
LLaMA3 |
EVA-CLIP-E |
Conv + SwiGLU |
224 |
2024/08 |
LLaVA-OneVision: Easy Visual Task Transfer |
Github |
A |
1 |
Qwen-2 |
SigLIP-SO400M/14 |
MLP |
384*36 |
2024/09 |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution |
Github |
A |
1 |
Qwen-2 |
ViT-675M |
MLP |
unlimited |
2024/09 |