Awesome-Large-Multimodal-Models

This repo summarizes the construction of current LMMs from the perspective of

input-output representation space extension

Based on the structure of input-output spaces, we systematically review the existing models, including main-stream models based on discrete-continuous hybrid spaces and models with unified multi-modal discrete representations.
Readers can refer to our [📖 Preprint Paper] for detailed explanations.

Preliminary

As presented in Figure below, the evolution of multi-modal research paradigms could be divided into three stages.

For readers to have a general picture about the development, we provide a tutorial here. The contents are summarized as follows:

Part 1: Vision-Language Pre-Training
Part 2: Architectures and Traning of LMMs
Part 3: Evaluation of LMMs
Part 4: Further Capability of LMMs
Part 5: Extension to Embodied Agents

Awesome Models

Large Vision-Language Models

With Text-only Output

Large_Vision_Language_Model	Code	Input Type	Output Type	LLM Backbone	Modality Encoder	Connection	Max Res.	Date
Flamingo: a Visual Language Model for Few-Shot Learning	Github	A	1	Chinchilla	NFNet	Perceiver	480	2022/04
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Github	A	1	Flan-T5 / OPT	CLIP ViT-L/14 / Eva-CLIP ViT-G/14	Q-Former	224	2023/01
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	Github	A	1	LLaMA	CLIP-ViT-L/14	MLP	224	2023/03
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Github	A	1	Vicuna	Eva-CLIP ViT-G/14	Q-Former	224	2023/04
Visual Instruction Tuning	Github	A	1	Vicuna	CLIP ViT-L/14	Linear	224	2023/04
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Github	A	1	LLaMA	CLIP ViT-L/14	Abstractor	224	2023/04
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	Github	A	1	LLaMA	CLIP-ViT-L/14	MLP	224	2023/04
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	Github	A	1	Flan-T5 / Vicuna	Eva-CLIP ViT-G/14	Q-Former	224	2023/05
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Github	A	1	LLaMA	CLIP ViT-L/14	Perceiver	224	2023/05
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	Github	A	1	LLaMA	CLIP ViT-L/14	MLP	224	2023/05
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	Github	A	1	LLaMA	CLIP ViT-L/14	Perceiver	224	2023/05
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	Github	A	1	Vicuna	CLIP ViT-L/14	Linear	224	2023/06
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Github	A	1	Vicuna	CLIP ViT-L/14	Linear	224	2023/06
Valley: Video Assistant with Large Language model Enhanced abilitY	Github	A	1	Stable-Vicuna	CLIP ViT-L/14	Temporal Module + Linear	224	2023/06
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	Github	A	1	Vicuna	EVA-1B	Resampler	420	2023/07
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Github	A	1	Qwen	OpenCLIP ViT-bigG	Cross-Attention	448	2023/08
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions	Github	A	1	Flan-T5 / Vicuna	Eva-CLIP ViT-G/14	Q-Former + MLP	224	2023/08
IDEFICS	Huggingface	A	1	LLaMA	OpenCLIP ViT-H/14	Perceiver	224	2023/08
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	Github	A	1	LLaMA, MPT	CLIP ViT-L/14	Perceiver	224	2023/08
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	Github	A	1	InternLM	Eva-CLIP ViT-G/14	Perceiver	224	2023/09
Improved Baselines with Visual Instruction Tuning	Github	A	1	Vicuna 1.5	CLIP ViT-L/14	MLP	336	2023/10
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	Github	A	1	LLaMA-2	EVA	Linear	448	2023/10
Fuyu-8B: A Multimodal Architecture for AI Agents	HF	A	1	Persimmon	-	Linear	unlimited	2023/10
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model	Github	A	1	LLaMA	CLIP ViT-L/14	Abstractor	224*20	2023/10
CogVLM: Visual Expert for Pretrained Language Models	Github	A	1	Vicuna 1.5	EVA2-CLIP-E	MLP	490	2023/11
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	Github	A	1	Qwen	OpenCLIP ViT-bigG	Cross-Attention	896	2023/11
ShareGPT4V:ImprovingLargeMulti-Modal Models with Better Captions	Github	A	1	Vicuna-1.5	CLIP ViT-L/14	MLP	336	2023/11
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Github	A	1	LLaMA-2	CLIP ViT-L/14	Abstractor	448	2023/11
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	Github	A	1	LLaMA-2	CLIP ViT-L/14 + CLIP ConvNeXt-XXL + DINOv2 ViT-G/14	Linear + Q-Former	672	2023/11
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	Github	A	1	Vicuna	InternViT	QLLaMA / MLP	336	2023/12
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices	Github	A	1	MobileLLaMA	CLIP ViT-L/14	LDP (conv-based)	336	2023/12
VILA: On Pre-training for Visual Language Models	Github	A	1	LLaMA-2	CLIP ViT-L	Linear	336	2023/12
Osprey: Pixel Understanding with Visual Instruction Tuning	Github	A	1	Vicuna	CLIP ConvNeXt-L	MLP	512	2023/12
Honeybee: Locality-enhanced Projector for Multimodal LLM	Github	A	1	Vicuna-1.5	CLIP ViT-L/14	C-Abstractor / D -Abstractor	336	2023/12
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts	-	A	1	UL2	Siglip ViT-G/14	Linear	1064	2023/12
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	Github	A	1	Vicuna / Mistral / Hermes-2-Yi	CLIP ViT-L/14	MLP	672	2024/01
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	Github	A	1	InternLM-2	CLIP ViT-L/14	MLP	490	2024/01
MouSi: Poly-Visual-Expert Vision-Language Models	Github	A	1	Vicuna-1.5	CLIP ViT-L/14 + MAE + LayoutLMv3 + ConvNeXt + SAM + DINOv2 ViT-G	Poly-Expert Fusion	1024	2024/01
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs	Github	A	1	Vicuna1.5	CLIP ViT-L/14	MLP	336	2024/01
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	Github	A	1	StableL / Qwen / Phi-2	CLIP ViT-L/14	MLP	336	2024/01
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	Github	A	1	MobileLLaMA	CLIP ViT-L/14	LDP v2	336	2024/02
Bunny: Efficient Multimodal Learning from Data-centric Perspective	Github	A	1	Phi-1.5 / LLaMA-3 / StableLM-2 / Phi-2	SigLIP, EVA-CLIP	MLP	1152	2024/02
TinyLLaVA: A Framework of Small-scale Large Multimodal Models	Github	A	1	TinyLLaMA / Phi-2 / StableLM-2	SigLIP-L, CLIP ViT-L	MLP	336/384	2024/02
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	Github	A	1	TinyLLaMA / InternLM2 / LLaMA2 / Mixtral	CLIP ConvNeXt-XXL + DINOv2 ViT-G/14	Linear	672	2024/02
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	Github	A	1	Gemma / Vicuna / Mixtral / Hermes-2-Yi	CLIP ViT-L + ConvNext-L	Cross-Attention + MLP	1536	2024/03
DeepSeek-VL: Towards Real-World Vision-Language Understanding	Github	A	1	Deepseek LLM	SigLIP-L, SAM-B	MLP	1024	2024/03
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images	Github	A	1	Vicuna	CLIP ViT-L/14	Perceiver	336*6	2024/03
[Yi-VL] Yi: Open Foundation Models by 01.AI	Github	A	1	Yi	CLIP ViT-H/14	MLP	448	2024/03
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	Github	A	1	in-house LLM	CLIP ViT-H*	C-Abstractor	1792	2024/03
VL-Mamba: Exploring State Space Models for Multimodal Learning	Github	A	1	Mamba LLM	CLIP-ViT-L / SigLIP-SO400M	VSS + MLP	384	2024/03
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference	Github	A	1	Mamba-Zephyr	DINOv2 + SigLIP	MLP	384	2024/03
[InternVL 1.5] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	Github	A	1	InternLM2	InternViT-6B	MLP	448*40	2024/04
[Phi-3-Vision] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone	Github	A	1	Phi-3	CLIP ViT-L/14	MLP	336*16	2024/04
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	Github	A	1	Vicuna / Mistral / Hermes-2-Yi	CLIP ViT-L/14	MLP + Adaptive Pooling	336	2024/04
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models	Github	A	1	InternLM-1	SigLIP-SO400M/14	Resampler + MLP	unlimited	2024/04
Imp: Highly Capable Large Multimodal Models for Mobile Devices	Github	A	1	Phi-2	SigLIP	MLP	384	2024/05
[IDEFICS2] What matters when building vision-language models?	HF	A	1	Mistral-v0.1	SigLIP-SO400M/14	Perceiver + MLP	384*4	2024/05
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	Github	A	1	Vicuna-	CLIP-ConvNeXt-L*	MLP	1536	2024/05
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	Github	A	1	LLaMA3 / Qwen1.5	CLIP ViT-L + Visual Embedding	-	336	2024/05
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	Github	A	1	Vicuna-1.5	CLIP ViT-L/14	MLP + Adaptive Pooling	336	2024/05
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	Github	A	1	Mistral / Mixtral	CLIP ViT-L/14	MLP	336	2024/05
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	Github	A	1	Vicuna-1.5 / LLaMA-3 / Hermes-2-Yi	CLIP ViT-L/14 + DINOv2 ViT-L/14 + SigLIP ViT-SO400M + OpenCLIP ConvNeXt-XXL	Spatial Vision Aggregator	1024	2024/06
GLM-4v	Github	A	1	GLM4	EVA-CLIP-E	Conv + SwiGLU	1120	2024/06
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	Github	A	1	InternLM-2	CLIP ViT-L/14	MLP	560*24	2024/07
[IDEFICS3] Building and better understanding vision-language models: insights and future directions	HF	A	1	LLaMA 3.1	SigLIP-SO400M/14	Perceiver + MLP	1820	2024/08
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	Github	A	1	Qwen2	SigLIP-SO400M/14	Linear	384*6	2024/08
CogVLM2: Visual Language Models for Image and Video Understanding	Github	A	1	LLaMA3	EVA-CLIP-E	Conv + SwiGLU	1344	2024/08
CogVLM2-vedio: Visual Language Models for Image and Video Understanding	Github	A	1	LLaMA3	EVA-CLIP-E	Conv + SwiGLU	224	2024/08
LLaVA-OneVision: Easy Visual Task Transfer	Github	A	1	Qwen-2	SigLIP-SO400M/14	MLP	384*36	2024/09
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Github	A	1	Qwen-2	ViT-675M	MLP	unlimited	2024/09

With Vision and Text Output

Large_Vision_Language_Model	Code	Input Type	Output Type	LLM Backbone	Modality Encoder	Modality Decoder	Date
GILL: Generating Images with Multimodal Language Models	Github	A	2	OPT	CLIP ViT-L	SD	2023/05
Emu: Generative Pretraining in Multimodality	Github	A	2	LLaMA	EVA-02-CLIP-1B	SD	2023/07
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	Github	A	3	LLaMA	Eva-CLIP ViT-G/14 + LaVIT Tokenizer	LaVIT D e-Tokenizer	2023/09
[CM3Leon] Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	Github	B	3	CM3Leon	Make-A-Scene	Make-A-Scene	2023/09
DreamLLM: Synergistic Multimodal Comprehension and Creation	Github	A	2	Vicuna	CLIP ViT-L/14	SD	2023/09
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	Github	A	2	MAGNETO	CLIP ViT-L/14	SD	2023/10
SEED-LLaMA: Making LLaMA SEE and Draw with SEED Tokenizer	Github	B	3	Vicuna / LLaMA-2	SEED Tokenizer	SEED D e-Tokenizer	2023/10
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	Github	A	2	Vicuna	Eva-CLIP ViT-G/14	SD	2023/10
Emu-2: Generative Multimodal Models are In-Context Learners	Github	A	2	LLaMA	EVA-02-CLIP-E-plus	SDXL	2023/12
Chameleon: Mixed-Modal Early-Fusion Foundation Models	Github	B	3	Chameleon	Make-A-Scene	Make-A-Scene	2024/05
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts	-	B	3	Chamelon	Make-A-Scene	Make-A-Scene	2024/07
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	Github	B	3	LLaMA-2	SigLIP + RQ-VAE	RQ-VAE	2024/09

Large Audio-Language Models

Large_Audio_Language_Model	Code	Input Type	Output Type	Output Modality	Backbone	Modality Encoder	Modality Decoder	Date
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	Github	B	3	Text, Audio	LLaMA	HuBERT	Unit Vocoder	2023/05
Speech-LLaMA: On decoder-only architecture for speech-to-text and large language model integration	-	A	1	Text	LLaMA	CTC compressor	-	2023/07
SALMONN: Towards Generic Hearing Abilities for Large Language Models	Github	A	1	Text	Vicuna	Whisper-Large-v2 + BEATs	-	2023/10
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models	Github	A	1	Text	Qwen	Whisper-Large-v2	-	2023/11
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation	Github	B	3	Text, Audio	LLaMA-2	SpeechTokenizer	SpeechTokenizer	2024/01
SLAM-ASR: An Embarrassingly Simple Approach for LLM with Strong ASR Capacity	Github	A	1	Text	LLaMA-2	HuBERT	-	2024/02
WavLLM: Towards Robust and Adaptive Speech Large Language Model	Github	A	1	Text	LLaMA-2	Whisper-Large-v2 + WavLM-Base	-	2024/04
SpeechVerse: A Large-scale Generalizable Audio Language Model	-	A	1	Text	Flan-T5-XL	WavLM-Large / Best-RQ	-	2024/05
Qwen2-Audio Technical Report	Github	A	1	Text	Qwen	Whisper-Large-v3	-	2024/07
LLaMA-Omni: Seamless Speech Interaction with Large Language Models	Github	A	2	Text, Audio	LLaMA-3.1	Whisper-Large-v3	Unit Vocoder	2024/09

Any Modality Models

Any_Modality_Model	Code	Input Type	Output Type	Output Modality	Backbone	Modality Encoder	Modality Decoder	Date
PandaGPT: One Model To Instruction-Follow Them All	Github	A	1	Text	Vicuna	ImageBind	-	2023/05
ImageBind-LLM: Multi-modality Instruction Tuning	Github	A	1	Text	Chinese-LLaMA	ImageBind + PointBind	-	2023/09
NExT-GPT: Any-to-Any Multimodal LLM	Github	A	2	Text, Vision, Audio	Vicuna	ImageBind	SD + AudioLDM + Zeriscope	2023/09
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	Github	A	2	Text, Vision, Audio	LLaMA-2	ImageBind	SD + AudioLDM2 + zeroscope v2	2023/11
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	Github	A	3	Text, Vision, Audio	UnifiedIO2	OpenCLIP ViT-B + AST	VQ-GAN + V iT-VQGAN	2023/12
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	Github	B	3	Text, Vision, Audio	LLaMA-2	SEED + Encodec + SpeechTokenizer	SEED + Encodec + SpeechTokenizer	2024/02
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts	Github	A	1	Text	LLaMA	CLIP ViT-L/14 + Whisper-small + BEATs	-	2024/05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome-Large-Multimodal-Models

Table of Contents

Preliminary

Awesome Models

Large Vision-Language Models

With Text-only Output

With Vision and Text Output

Large Audio-Language Models

Any Modality Models

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome-Large-Multimodal-Models

Table of Contents

Preliminary

Awesome Models

Large Vision-Language Models

With Text-only Output

With Vision and Text Output

Large Audio-Language Models

Any Modality Models