Awesome-CVPR2024-AIGC

A Collection of Papers and Codes for CVPR2024 AIGC

整理汇总下今年CVPR AIGC相关的论文和代码，具体如下。

欢迎star，fork和PR~

Please feel free to star, fork or PR if helpful~

参考或转载请注明出处

CVPR2024官网：https://cvpr.thecvf.com/Conferences/2024

CVPR接收论文列表：https://cvpr.thecvf.com/Conferences/2024/AcceptedPapers

CVPR完整论文库：https://openaccess.thecvf.com/CVPR2024

开会时间：2024年6月17日-6月21日

论文接收公布时间：2024年2月27日

【Contents】

1.图像生成(Image Generation/Image Synthesis)
2.图像编辑（Image Editing)
3.视频生成(Video Generation/Image Synthesis)
4.视频编辑(Video Editing)
5.3D生成(3D Generation/3D Synthesis)
6.3D编辑(3D Editing)
7.多模态大语言模型(Multi-Modal Large Language Model)
8.其他多任务(Others)

1.图像生成(Image Generation/Image Synthesis)

Accelerating Diffusion Sampling with Optimized Time Steps

Paper: https://arxiv.org/abs/2402.17376
Code: https://github.com/scxue/DM-NonUniform

Adversarial Score Distillation: When score distillation meets GAN

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Wei_Adversarial_Score_Distillation_When_score_distillation_meets_GAN_CVPR_2024_paper.html
Code: https://github.com/2y7c3/ASD

Adversarial Text to Continuous Image Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Haydarov_Adversarial_Text_to_Continuous_Image_Generation_CVPR_2024_paper.html
Code:

Amodal Completion via Progressive Mixed Context Diffusion

Paper: https://arxiv.org/abs/2312.15540
Code: https://github.com/k8xu/amodal

Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder

Paper: https://arxiv.org/abs/2403.10255
Code: https://github.com/zhenshij/arbitrary-scale-diffusion

Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion

Paper: https://arxiv.org/abs/2312.12471
Code: https://github.com/zkawfanx/Atlantis

Attention Calibration for Disentangled Text-to-Image Personalization

Paper: https://arxiv.org/abs/2403.18551
Code: https://github.com/Monalissaa/DisenDiff

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

Paper: https://arxiv.org/abs/2405.05252
Code:

CapHuman: Capture Your Moments in Parallel Universes

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liang_CapHuman_Capture_Your_Moments_in_Parallel_Universes_CVPR_2024_paper.html
Code: https://github.com/VamosC/CapHuman

CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization

Paper: https://arxiv.org/abs/2404.00521
Code:

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Paper: https://arxiv.org/abs/2311.15773
Code:

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

Paper: https://arxiv.org/abs/2402.00627
Code: https://github.com/YanzuoLu/CFLD

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation

Paper: https://arxiv.org/abs/2310.01407
Code: https://github.com/fast-codi/CoDi

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Tang_CoDi-2_In-Context_Interleaved_and_Interactive_Any-to-Any_Generation_CVPR_2024_paper.html
Code: https://github.com/microsoft/i-Code/tree/main/CoDi-2

Condition-Aware Neural Network for Controlled Image Generation

Paper: https://arxiv.org/abs/2404.01143v1
Code:

CosmicMan: A Text-to-Image Foundation Model for Humans

Paper: https://arxiv.org/abs/2404.01294
Code: https://github.com/cosmicman-cvpr2024/CosmicMan

Countering Personalized Text-to-Image Generation with Influence Watermarks

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liu_Countering_Personalized_Text-to-Image_Generation_with_Influence_Watermarks_CVPR_2024_paper.html
Code:

Cross Initialization for Face Personalization of Text-to-Image Models

Paper: https://arxiv.org/abs/2312.15905
Code: https://github.com/lyuPang/CrossInitialization

Customization Assistant for Text-to-image Generation

Paper: https://arxiv.org/abs/2312.03045
Code:

DeepCache: Accelerating Diffusion Models for Free

Paper: https://arxiv.org/abs/2312.00858
Code: https://github.com/horseee/DeepCache

DemoFusion: Democratising High-Resolution Image Generation With No $

Paper: https://arxiv.org/abs/2311.16973
Code: https://github.com/PRIS-CV/DemoFusion

Desigen: A Pipeline for Controllable Design Template Generation

Paper: https://arxiv.org/abs/2403.09093
Code: https://github.com/whaohan/desigen

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

Paper: https://arxiv.org/abs/2404.01342
Code: https://github.com/OpenGVLab/DiffAgent

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Paper: https://arxiv.org/abs/2405.04356v1
Code:

Diffusion Models Without Attention

Paper: https://arxiv.org/abs/2311.18257
Code:

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Paper: https://arxiv.org/abs/2402.19481
Code: https://github.com/mit-han-lab/distrifuser

Diversity-aware Channel Pruning for StyleGAN Compression

Paper: https://arxiv.org/abs/2403.13548
Code: https://github.com/jiwoogit/DCP-GAN

Discriminative Probing and Tuning for Text-to-Image Generation

Paper: https://www.arxiv.org/abs/2403.04321
Code: https://github.com/LgQu/DPT-T2I

Domain Gap Embeddings for Generative Dataset Augmentation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Wang_Domain_Gap_Embeddings_for_Generative_Dataset_Augmentation_CVPR_2024_paper.html
Code: https://github.com/humansensinglab/DoGE

Don’t drop your samples! Coherence-aware training benefits Conditional diffusion

Paper: https://arxiv.org/abs/2405.20324
Code: https://github.com/nicolas-dufour/CAD

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

Paper: https://arxiv.org/abs/2404.01050
Code: https://github.com/haofengl/DragNoise

DREAM: Diffusion Rectification and Estimation-Adaptive Models

Paper: https://arxiv.org/abs/2312.00210
Code: https://github.com/jinxinzhou/DREAM

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

Paper: https://arxiv.org/abs/2402.09812
Code: https://github.com/KU-CVLAB/DreamMatcher

Dynamic Prompt Optimizing for Text-to-Image Generation

Paper: https://arxiv.org/abs/2404.04095
Code: https://github.com/Mowenyii/PAE

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

Paper: https://arxiv.org/abs/2312.04655
Code: https://github.com/eclipse-t2i/eclipse-inference

Efficient Dataset Distillation via Minimax Diffusion

Paper: https://arxiv.org/abs/2311.15529
Code: https://github.com/vimar-gu/MinimaxDiffusion

ElasticDiffusion: Training-free Arbitrary Size Image Generation

Paper: https://arxiv.org/abs/2311.18822
Code: https://github.com/MoayedHajiAli/ElasticDiffusion-official

EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models

Paper: https://arxiv.org/abs/2401.04608
Code: https://github.com/JingyuanYY/EmoGen

Enabling Multi-Concept Fusion in Text-to-Image Models

Paper: https://arxiv.org/abs/2404.03913v1
Code:

Exact Fusion via Feature Distribution Matching for Few-shot Image Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zhou_Exact_Fusion_via_Feature_Distribution_Matching_for_Few-shot_Image_Generation_CVPR_2024_paper.html
Code:

FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

Paper: https://arxiv.org/abs/2403.06775
Code:

Fast ODE-based Sampling for Diffusion Models in Around 5 Steps

Paper: https://arxiv.org/abs/2312.00094
Code: https://github.com/zju-pi/diff-sampler

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

Paper: https://arxiv.org/abs/2312.07536
Code: https://github.com/genforce/freecontrol

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

Paper: https://arxiv.org/abs/2405.13870
Code: https://github.com/aim-uofa/FreeCustom

FreeU: Free Lunch in Diffusion U-Net

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Si_FreeU_Free_Lunch_in_Diffusion_U-Net_CVPR_2024_paper.html
Code: https://github.com/ChenyangSi/FreeU

Generalizable Tumor Synthesis

Paper: https://www.cs.jhu.edu/~alanlab/Pubs24/chen2024towards.pdf
Code: https://github.com/MrGiovanni/DiffTumor

Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Fu_Generate_Like_Experts_Multi-Stage_Font_Generation_by_Incorporating_Font_Transfer_CVPR_2024_paper.html
Code: https://github.com/fubinfb/MSD-Font

Generating Daylight-driven Architectural Design via Diffusion Models

Paper: https://arxiv.org/abs/2404.13353
Code: https://github.com/unlimitedli/DDADesign

Generative Unlearning for Any Identity

Paper: https://arxiv.org/abs/2405.09879
Code: https://github.com/JJuOn/GUIDE

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

Paper: https://arxiv.org/abs/2403.01693
Code: https://github.com/JJuOn/GUIDE

High-fidelity Person-centric Subject-to-Image Synthesis

Paper: https://arxiv.org/abs/2311.10329
Code: https://github.com/CodeGoat24/Face-diffuser?tab=readme-ov-file

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

Paper: https://arxiv.org/abs/2404.04650
Code: https://github.com/xiefan-guo/initno

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Paper: https://arxiv.org/abs/2304.03411
Code:

InstanceDiffusion: Instance-level Control for Image Generation

Paper: https://arxiv.org/abs/2402.03290
Code: https://github.com/frank-xwang/InstanceDiffusion

Instruct-Imagen: Image Generation with Multi-modal Instruction

Paper: https://arxiv.org/abs/2401.01952
Code:

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Paper: https://arxiv.org/abs/2306.00973
Code: https://github.com/haoningwu3639/StoryGen

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

Paper: https://arxiv.org/abs/2312.05849
Code: https://github.com/jiuntian/interactdiffusion

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models

Paper: https://arxiv.org/abs/2308.15692
Code:

Inversion-Free Image Editing with Natural Language

Paper: https://arxiv.org/abs/2312.04965
Code: https://github.com/sled-group/InfEdit

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zeng_JeDi_Joint-Image_Diffusion_Models_for_Finetuning-Free_Personalized_Text-to-Image_Generation_CVPR_2024_paper.html
Code:

LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion

Paper: https://arxiv.org/abs/2404.00292
Code: https://github.com/PanchengZhao/LAKE-RED

Learned representation-guided diffusion models for large-image generation

Paper: https://arxiv.org/abs/2312.07330
Code: https://github.com/cvlab-stonybrook/Large-Image-Diffusion

Learning Continuous 3D Words for Text-to-Image Generation

Paper: https://arxiv.org/abs/2402.08654
Code: https://github.com/ttchengab/continuous_3d_words_code/

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Paper: https://arxiv.org/abs/2311.15841
Code:

Learning Multi-dimensional Human Preference for Text-to-Image Generation

Paper: https://arxiv.org/abs/2405.14705
Code: https://github.com/Kwai-Kolors/MPS

LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model

Paper: https://arxiv.org/abs/2305.11577
Code: https://github.com/ewrfcas/LeftRefill

MACE: Mass Concept Erasure in Diffusion Models

Paper: https://arxiv.org/abs/2402.05408
Code: https://github.com/Shilin-LU/MACE

MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

Paper: https://arxiv.org/abs/2308.10997
Code:

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

Paper: https://arxiv.org/abs/2403.04290
Code:

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Paper: https://arxiv.org/abs/2402.05408
Code: https://github.com/limuloo/MIGC

MindBridge: A Cross-Subject Brain Decoding Framework

Paper: https://arxiv.org/abs/2404.07850
Code: https://github.com/littlepure2333/MindBridge

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

Paper: https://arxiv.org/abs/2404.02790
Code: https://huggingface.co/datasets/mulan-dataset/v1.0

On the Scalability of Diffusion-based Text-to-Image Generation

Paper: https://arxiv.org/abs/2404.02883
Code:

OpenBias: Open-set Bias Detection in Text-to-Image Generative Models

Paper: https://arxiv.org/abs/2404.07990
Code: https://github.com/Picsart-AI-Research/OpenBias

Personalized Residuals for Concept-Driven Text-to-Image Generation

Paper: https://arxiv.org/abs/2405.12978
Code:

Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models

Paper: https://arxiv.org/abs/2404.15081
Code:

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Paper: https://arxiv.org/abs/2312.04461
Code: https://github.com/TencentARC/PhotoMaker

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

Paper: https://arxiv.org/abs/2403.01852
Code: https://github.com/cszy98/PLACE

Plug-and-Play Diffusion Distillation

Paper: https://arxiv.org/abs/2406.01954
Code:

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models

Paper: https://arxiv.org/abs/2305.16223
Code: https://github.com/SHI-Labs/Prompt-Free-Diffusion

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Paper: https://arxiv.org/abs/2311.17002
Code: https://github.com/ali-vilab/Ranni

Readout Guidance: Learning Control from Diffusion Features

Paper: https://arxiv.org/abs/2312.02150
Code: https://github.com/google-research/readout_guidance

Relation Rectification in Diffusion Model

Paper: https://arxiv.org/abs/2403.20249
Code: https://github.com/WUyinwei-hah/RRNet

Residual Denoising Diffusion Models

Paper: https://arxiv.org/abs/2308.13712
Code: https://github.com/nachifur/RDDM

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Paper: https://arxiv.org/abs/2401.09603
Code: https://github.com/google-research/google-research/tree/master/cmmd

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

Paper: https://arxiv.org/abs/2404.05384
Code: https://github.com/SmilesDZgk/S-CFG

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

Paper: https://arxiv.org/abs/2311.13602
Code: https://github.com/CyberAgentAILab/RALF

Rich Human Feedback for Text-to-Image Generation

Paper: https://arxiv.org/abs/2312.10240
Code:

SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

Paper: https://arxiv.org/abs/2401.08053
Code: https://github.com/cmubig/SCoFT

Self-correcting LLM-controlled Diffusion Models

Paper: https://arxiv.org/abs/2311.16090
Code: https://github.com/tsunghan-wu/SLD

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

Paper: https://arxiv.org/abs/2311.17216
Code: https://github.com/hangligit/InterpretDiffusion

Shadow Generation for Composite Image Using Diffusion Model

Paper: https://arxiv.org/abs/2308.09972
Code: https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

Paper: https://arxiv.org/abs/2312.04410
Code: https://github.com/SHI-Labs/Smooth-Diffusion

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

Paper: https://arxiv.org/abs/2312.16272
Code: https://github.com/Xiaojiu-z/SSR_Encoder

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Paper: https://arxiv.org/abs/2312.01725
Code: https://github.com/rlawjdghek/StableVITON

Structure-Guided Adversarial Training of Diffusion Models

Paper: https://arxiv.org/abs/2402.17563
Code:

Style Aligned Image Generation via Shared Attention

Paper: https://arxiv.org/abs/2312.02133
Code: https://github.com/google/style-aligned/

SVGDreamer: Text Guided SVG Generation with Diffusion Model

Paper: https://arxiv.org/abs/2312.16476
Code: https://github.com/ximinng/SVGDreamer

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

Paper: https://arxiv.org/abs/2312.05239
Code: https://github.com/VinAIResearch/SwiftBrush

Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting

Paper: https://arxiv.org/abs/2310.08129
Code: https://github.com/zzjchen/Tailored-Visions

Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models

Paper: https://arxiv.org/abs/2403.08381
Code: https://github.com/PangzeCheung/SingDiffusion

Taming Stable Diffusion for Text to 360∘ Panorama Image Generation

Paper: https://arxiv.org/abs/2404.07949
Code: https://github.com/chengzhag/PanFusion

TextCraftor: Your Text Encoder Can be Image Quality Controller

Paper: https://arxiv.org/abs/2403.18978
Code:

Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

Paper: https://arxiv.org/abs/2403.06247
Code: https://github.com/MingyuLee82/TGI_AD_v1

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

Paper: https://arxiv.org/abs/2311.16503
Code: https://github.com/ModelTC/TFMQ-DM

TokenCompose: Grounding Diffusion with Token-level Supervision

Paper: https://arxiv.org/abs/2312.03626
Code: https://github.com/mlpc-ucsd/TokenCompose

Towards Accurate Post-training Quantization for Diffusion Models

Paper: https://arxiv.org/abs/2305.18723
Code: https://github.com/ChangyuanWang17/APQ-DM

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Paper: https://arxiv.org/abs/2403.05239
Code:

Towards Memorization-Free Diffusion Models

Paper: https://arxiv.org/abs/2404.00922
Code: https://github.com/chenchen-usyd/AMG

Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Xia_Towards_More_Accurate_Diffusion_Model_Acceleration_with_A_Timestep_Tuner_CVPR_2024_paper.html
Code: https://github.com/THU-LYJ-Lab/time-tuner

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Miao_Training_Diffusion_Models_Towards_Diverse_Image_Generation_with_Reinforcement_Learning_CVPR_2024_paper.html
Code:

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

Paper: https://arxiv.org/abs/2311.09257
Code:

UniGS: Unified Representation for Image Generation and Segmentation

Paper: https://arxiv.org/abs/2312.01985
Code: https://github.com/qqlu/Entity

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Paper: https://arxiv.org/abs/2311.13231
Code: https://github.com/yk7333/d3po

U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation

Paper: https://arxiv.org/abs/2403.20231
Code: https://github.com/ICTMCG/U-VAP

ViewDiff: 3D-Consistent Image Generation with Text-To-Image Models

Paper: https://arxiv.org/abs/2403.01807
Code: https://github.com/facebookresearch/ViewDiff

When StyleGAN Meets Stable Diffusion: a 𝒲+ Adapter for Personalized Image Generation

Paper: https://arxiv.org/abs/2311.17461
Code: https://github.com/csxmli2016/w-plus-adapter

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

Paper: https://arxiv.org/abs/2312.02238
Code: https://github.com/showlab/X-Adapter

Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Starodubcev_Your_Student_is_Better_Than_Expected_Adaptive_Teacher-Student_Collaboration_for_CVPR_2024_paper.html
Code: https://github.com/yandex-research/adaptive-diffusion

2.图像编辑(Image Editing)

3D-Aware Face Editing via Warping-Guided Latent Direction Learning

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Cheng_3D-Aware_Face_Editing_via_Warping-Guided_Latent_Direction_Learning_CVPR_2024_paper.html
Code: https://github.com/cyh-sj/FaceEdit3D

An Edit Friendly DDPM Noise Space: Inversion and Manipulations

Paper: https://arxiv.org/abs/2304.06140
Code: https://github.com/inbarhub/DDPM_inversion

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Yin_Benchmarking_Segmentation_Models_with_Mask-Preserved_Attribute_Editing_CVPR_2024_paper.html
Code: https://github.com/PRIS-CV/Pascal-EA

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing

Paper: https://arxiv.org/abs/2405.04377
Code:

Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Sun_Content-Style_Decoupling_for_Unsupervised_Makeup_Transfer_without_Generating_Pseudo_Ground_CVPR_2024_paper.html
Code: https://github.com/Snowfallingplum/CSD-MT

Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

Paper: https://arxiv.org/abs/2311.18608
Code: https://github.com/HyelinNAM/ContrastiveDenoisingScore

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

Paper: https://arxiv.org/abs/2403.06951
Code: https://github.com/Tianhao-Qi/DEADiff_code

Deformable One-shot Face Stylization via DINO Semantic Guidance

Paper: https://arxiv.org/abs/2403.00459
Code: https://github.com/zichongc/DoesFS

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

Paper: https://arxiv.org/abs/2312.04364
Code: https://github.com/ChenDarYen/DemoCaricature

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Sun_DiffAM_Diffusion-based_Adversarial_Makeup_Transfer_for_Facial_Privacy_Protection_CVPR_2024_paper.html
Code: https://github.com/HansSunY/DiffAM

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Mou_DiffEditor_Boosting_Accuracy_and_Flexibility_on_Diffusion-based_Image_Editing_CVPR_2024_paper.html
Code: https://github.com/MC-E/DragonDiffusion

DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

Paper: https://arxiv.org/abs/2312.07409
Code: https://github.com/Kevin-thu/DiffMorpher

Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D

Paper: https://arxiv.org/abs/2312.02190
Code: https://github.com/adobe-research/DiffusionHandles

DiffusionLight: Light Probes for Free by Painting a Chrome Ball

Paper: https://arxiv.org/abs/2312.09168
Code: https://github.com/DiffusionLight/DiffusionLight

Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Lo_Distraction_is_All_You_Need_Memory-Efficient_Image_Immunization_against_Diffusion-Based_CVPR_2024_paper.html
Code:

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Paper: https://arxiv.org/abs/2403.02981
Code: https://github.com/xuesong39/DAC

Edit One for All: Interactive Batch Image Editing

Paper: https://arxiv.org/abs/2401.10219
Code: https://github.com/thaoshibe/edit-one-for-all

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Sheynin_Emu_Edit_Precise_Image_Editing_via_Recognition_and_Generation_Tasks_CVPR_2024_paper.html
Code:

Face2Diffusion for Fast and Editable Face Personalization

Paper: https://arxiv.org/abs/2403.05094
Code: https://github.com/mapooon/Face2Diffusion

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

Paper: https://arxiv.org/abs/2312.10113
Code: https://github.com/guoqincode/Focus-on-Your-Instruction

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

Paper: https://arxiv.org/abs/2307.04684
Code: https://github.com/LPengYang/FreeDrag

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zhang_HIVE_Harnessing_Human_Feedback_for_Instructional_Visual_Editing_CVPR_2024_paper.html
Code: https://github.com/salesforce/HIVE

Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Paper: https://arxiv.org/abs/2403.09632
Code: https://github.com/guoqincode/Focus-on-Your-Instruction

IDGuard: Robust General Identity-centric POI Proactive Defense Against Face Editing Abuse

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Dai_IDGuard_Robust_General_Identity-centric_POI_Proactive_Defense_Against_Face_Editing_CVPR_2024_paper.html
Code:

Image Sculpting: Precise Object Editing with 3D Geometry Control

Paper: https://arxiv.org/abs/2401.01702
Code: https://github.com/vision-x-nyu/image-sculpting

In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing

Paper: hhttps://arxiv.org/abs/2312.04965
Code: https://github.com/Twizwei/in-n-out

Inversion-Free Image Editing with Language-Guided Diffusion Models

Paper: hhttps://arxiv.org/abs/2312.04965
Code: https://github.com/sled-group/InfEdit

LEDITS++: Limitless Image Editing using Text-to-Image Models

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Brack_LEDITS_Limitless_Image_Editing_using_Text-to-Image_Models_CVPR_2024_paper.html
Code: https://github.com/ml-research/ledits_pp

M&M VTO: Multi-Garment Virtual Try-On and Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zhu_MM_VTO_Multi-Garment_Virtual_Try-On_and_Editing_CVPR_2024_paper.html
Code:

PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models

Paper: https://arxiv.org/abs/2303.17546
Code: https://github.com/Picsart-AI-Research/PAIR-Diffusion

Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing

Paper: https://arxiv.org/abs/2303.17546
Code: https://github.com/YangChangHee/CVPR2024_Person-In-Place_RELEASE

Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network

Paper: https://arxiv.org/abs/2405.19775
Code:

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

Paper: https://arxiv.org/abs/2312.13964
Code: https://github.com/open-mmlab/PIA

Referring Image Editing: Object-level Image Editing via Referring Expressions

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liu_Referring_Image_Editing_Object-level_Image_Editing_via_Referring_Expressions_CVPR_2024_paper.html
Code:

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

Paper: https://arxiv.org/abs/2403.00483
Code:

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Jiang_SCEdit_Efficient_and_Controllable_Image_Diffusion_Generation_via_Skip_Connection_CVPR_2024_paper.html
Code: https://github.com/ali-vilab/SCEdit

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Paper: https://arxiv.org/abs/2312.06739
Code: https://github.com/TencentARC/SmartEdit

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

Paper: https://arxiv.org/abs/2312.09008
Code: https://github.com/jiwoogit/StyleID

SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting

Paper: https://arxiv.org/abs/2402.18848
Code:

Text-Driven Image Editing via Learnable Regions

Paper: https://arxiv.org/abs/2311.16432
Code: https://github.com/yuanze-lin/Learnable_Regions

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

Paper: https://arxiv.org/abs/2404.01089
Code: https://github.com/Gal4way/TPD

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Bobkov_The_Devil_is_in_the_Details_StyleFeatureEditor_for_Detail-Rich_StyleGAN_CVPR_2024_paper.html
Code: https://github.com/FusionBrainLab/StyleFeatureEditor

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing

Paper: https://arxiv.org/abs/2404.11120
Code: https://github.com/SherryXTChen/TiNO-Edit

ToonerGAN: Reinforcing GANs for Obfuscating Automated Facial Indexing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Thakral_ToonerGAN_Reinforcing_GANs_for_Obfuscating_Automated_Facial_Indexing_CVPR_2024_paper.html
Code:

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liu_Towards_Understanding_Cross_and_Self-Attention_in_Stable_Diffusion_for_Text-Guided_CVPR_2024_paper.html
Code: https://github.com/alibaba/EasyNLP/tree/master/diffusion/FreePromptEditing

UniHuman: A Unified Model For Editing Human Images in the Wild

Paper: https://arxiv.org/abs/2312.14985
Code: https://github.com/NannanLi999/UniHuman

Z*: Zero-shot Style Transfer via Attention Reweighting

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Deng_Z_Zero-shot_Style_Transfer_via_Attention_Reweighting_CVPR_2024_paper.html
Code: https://github.com/HolmesShuan/Zero-shot-Style-Transfer-via-Attention-Rearrangement

ZONE: Zero-Shot Instruction-Guided Local Editing

Paper: https://arxiv.org/abs/2312.16794
Code: https://github.com/lsl001006/ZONE

3.视频生成(Video Generation/Video Synthesis)

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

Paper: https://arxiv.org/abs/2401.06578
Code: https://github.com/Akaneqwq/360DVD

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Paper: https://arxiv.org/abs/2312.15770
Code: https://github.com/ali-vilab/VGen

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Paper: https://arxiv.org/abs/2312.02813
Code: https://github.com/MCG-NJU/BIVDiff

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Paper: https://arxiv.org/abs/2403.17936
Code: https://github.com/m-hamza-mughal/convofusion

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Paper: https://arxiv.org/abs/2404.01862
Code: https://github.com/thuhcsi/S2G-MDDiffusion

DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation

Paper:
Code:

DisCo: Disentangled Control for Realistic Human Dance Generation

Paper: https://arxiv.org/abs/2307.00040
Code: https://github.com/Wangt-CN/DisCo

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Paper: https://arxiv.org/abs/2403.01901
Code:

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Paper: https://arxiv.org/abs/2405.10272
Code: https://github.com/Wangt-CN/DisCo

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liang_FlowVid_Taming_Imperfect_Optical_Flows_for_Consistent_Video-to-Video_Synthesis_CVPR_2024_paper.html
Code:

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Cai_Generative_Rendering_Controllable_4D-Guided_Video_Generation_with_2D_Diffusion_Models_CVPR_2024_paper.html
Code:

GenTron: Diffusion Transformers for Image and Video Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Chen_GenTron_Diffusion_Transformers_for_Image_and_Video_Generation_CVPR_2024_paper.html
Code:

Grid Diffusion Models for Text-to-Video Generation

Paper: https://arxiv.org/abs/2404.00234
Code: https://github.com/taegyeong-lee/Grid-Diffusion-Models-for-Text-to-Video-Generation

Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Skorokhodov_Hierarchical_Patch_Diffusion_Models_for_High-Resolution_Video_Generation_CVPR_2024_paper.html
Code:

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Qing_Hierarchical_Spatio-temporal_Decoupling_for_Text-to-Video_Generation_CVPR_2024_paper.html
Code:

LAMP: Learn A Motion Pattern for Few-Shot Video Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Wu_LAMP_Learn_A_Motion_Pattern_for_Few-Shot_Video_Generation_CVPR_2024_paper.html
Code: https://github.com/RQ-Wu/LAMP

Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

Paper: https://arxiv.org/abs/2402.17364
Code: https://github.com/zhangzc21/DynTet

Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance Primitives

Paper: https://arxiv.org/abs/2403.10518
Code: https://github.com/li-ronghui/LODGE

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Paper: https://arxiv.org/abs/2311.16498
Code: https://github.com/magic-research/magic-animate

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Paper: https://arxiv.org/abs/2403.16510
Code: https://github.com/ICTMCG/Make-Your-Anchor

Make Your Dream A Vlog

Paper: https://arxiv.org/abs/2401.09414
Code: https://github.com/Vchitect/Vlogger

Make Pixels Dance: High-Dynamic Video Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zeng_Make_Pixels_Dance_High-Dynamic_Video_Generation_CVPR_2024_paper.html
Code:

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Wang_MicroCinema_A_Divide-and-Conquer_Approach_for_Text-to-Video_Generation_CVPR_2024_paper.html
Code:

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

Paper: https://arxiv.org/abs/2311.16813
Code: https://github.com/wenyuqing/panacea

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Jain_PEEKABOO_Interactive_Video_Generation_via_Masked-Diffusion_CVPR_2024_paper.html
Code:

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Paper: https://arxiv.org/abs/2308.13712
Code: https://github.com/yzxing87/Seeing-and-Hearing

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Paper: https://arxiv.org/abs/2308.09710
Code: https://github.com/ChenHsing/SimDA

StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN

Paper: https://arxiv.org/abs/2403.14186
Code: https://github.com/jeolpyeoni/StyleCineGAN

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

Paper: https://arxiv.org/abs/2311.17590
Code: https://github.com/ZiqiaoPeng/SyncTalk

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Paper: https://arxiv.org/abs/2311.17590
Code: https://github.com/merlresearch/TI2V-Zero

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Paper: https://arxiv.org/abs/2404.16306
Code: https://github.com/showlab/Tune-A-Video

VideoBooth: Diffusion-based Video Generation with Image Prompts

Paper: https://arxiv.org/abs/2312.00777
Code: https://github.com/Vchitect/VideoBooth

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Paper: https://arxiv.org/abs/2401.09047
Code: https://github.com/AILab-CVC/VideoCrafter

Video-P2P: Video Editing with Cross-attention Control

Paper: https://arxiv.org/abs/2303.04761
Code: https://github.com/dvlab-research/Video-P2P

4.视频编辑(Video Editing)

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing

Paper: https://arxiv.org/abs/2312.05856
Code: https://github.com/STEM-Inv/stem-inv

CAMEL: Causal Motion Enhancement tailored for lifting text-driven video editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zhang_CAMEL_CAusal_Motion_Enhancement_Tailored_for_Lifting_Text-driven_Video_Editing_CVPR_2024_paper.html
Code: https://github.com/zhangguiwei610/CAMEL

CCEdit: Creative and Controllable Video Editing via Diffusion Models

Paper: https://arxiv.org/abs/2309.16496
Code: https://github.com/RuoyuFeng/CCEdit

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Paper: https://arxiv.org/abs/2308.07926
Code: https://github.com/qiuyu96/CoDeF

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liu_DynVideo-E_Harnessing_Dynamic_NeRF_for_Large-Scale_Motion-_and_View-Change_Human-Centric_CVPR_2024_paper.html
Code: https://github.com/qiuyu96/CoDeF

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Paper: https://arxiv.org/abs/2403.12962
Code: https://github.com/williamyang1991/FRESCO/tree/main

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Ma_MaskINT_Video_Editing_via_Interpolative_Non-autoregressive_Masked_Transformers_CVPR_2024_paper.html
Code:

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Paper: https://arxiv.org/abs/2312.04524
Code: https://github.com/rehg-lab/RAVE

VidToMe: Video Token Merging for Zero-Shot Video Editing

Paper: https://arxiv.org/abs/2312.10656
Code: https://github.com/lixirui142/VidToMe

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

Paper: https://arxiv.org/abs/2312.00845
Code: https://github.com/HyeonHo99/Video-Motion-Customization

5.3D生成(3D Generation/3D Synthesis)

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Paper: https://arxiv.org/abs/2310.08528
Code: https://github.com/hustvl/4DGaussians

Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling

Paper: https://arxiv.org/abs/2311.16096
Code: https://github.com/lizhe00/AnimatableGaussians

A Unified Approach for Text- and Image-guided 4D Scene Generation

Paper: https://arxiv.org/abs/2311.16854
Code: https://github.com/NVlabs/dream-in-4d

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Paper: https://arxiv.org/abs/2405.09546
Code: https://github.com/behavior-vision-suite/behavior-vision-suite.github.io

BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

Paper: https://arxiv.org/abs/2312.02136
Code: https://github.com/zqh0253/BerfScene

CAD: Photorealistic 3D Generation via Adversarial Distillation

Paper: https://arxiv.org/abs/2312.06663
Code: https://github.com/raywzy/CAD

CAGE: Controllable Articulation GEneration

Paper: https://arxiv.org/abs/2312.09570
Code: https://github.com/3dlg-hcvc/cage

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Paper: https://arxiv.org/abs/2309.00610
Code: https://github.com/hzxie/CityDreamer

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

Paper: https://arxiv.org/abs/2401.09050
Code: https://github.com/sail-sg/Consistent3D

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

Paper: https://arxiv.org/abs/2311.17123
Code: https://github.com/gaoxiangjun/ConTex-Human

ControlRoom3D: Room Generation using Semantic Proxy Rooms

Paper： https://arxiv.org/abs/2312.05208
Code:

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Paper: https://arxiv.org/abs/2403.13667
Code: https://github.com/Carmenw1203/DanceCamera3D-Official

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

Paper: https://arxiv.org/abs/2312.13016
Code: https://github.com/FreedomGu/DiffPortrait3D

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Paper: https://arxiv.org/abs/2401.04747
Code: https://github.com/JeremyCJM/DiffSHEG

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Paper: https://arxiv.org/abs/2303.14207
Code: https://github.com/tangjiapeng/DiffuScene

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

Paper: https://arxiv.org/abs/2311.17024
Code: https://github.com/niladridutt/Diffusion-3D-Features

Diffusion Time-step Curriculum for One Image to 3D Generation

Paper: https://paperswithcode.com/paper/diffusion-time-step-curriculum-for-one-image
Code: https://github.com/yxymessi/DTC123

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

Paper: https://arxiv.org/abs/2304.00916
Code: https://github.com/yukangcao/DreamAvatar

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

Paper: https://arxiv.org/abs/2312.03611
Code: https://github.com/yhyang-myron/DreamComposer

DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior

Paper: https://arxiv.org/abs/2312.06439
Code: https://github.com/tyhuang0428/DreamControl

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Paper: https://arxiv.org/abs/2312.04466
Code: https://github.com/kiranchhatre/amuse

EscherNet: A Generative Model for Scalable View Synthesis

Paper: https://arxiv.org/abs/2402.03908
Code: https://github.com/hzxie/city-dreamer

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Paper: https://arxiv.org/abs/2310.08529
Code: https://github.com/hustvl/GaussianDreamer

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Paper: https://arxiv.org/abs/2401.04092
Code: https://github.com/3DTopia/GPTEval3D

Gaussian Shell Maps for Efficient 3D Human Generation

Paper: https://arxiv.org/abs/2311.17857
Code: https://github.com/computational-imaging/GSM

HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D

Paper: https://arxiv.org/abs/2312.15980
Code: https://github.com/byeongjun-park/HarmonyView

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

Paper: https://arxiv.org/abs/2312.03050
Code:

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Paper: https://arxiv.org/abs/2312.09067
Code: https://github.com/allenai/Holodeck

HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Paper: https://arxiv.org/abs/2310.01406
Code:

Interactive3D: Create What You Want by Interactive 3D Generation

Paper: https://hub.baai.ac.cn/paper/494efc8d-f4ed-4ca4-8469-b882f9489f5e
Code:

InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusio

Paper: https://arxiv.org/abs/2403.17422
Code: https://github.com/jyunlee/InterHandGen

Intrinsic Image Diffusion for Single-view Material Estimation

Paper: https://arxiv.org/abs/2312.12274
Code: https://github.com/Peter-Kocsis/IntrinsicImageDiffusion

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

Paper: https://arxiv.org/abs/2403.16897
Code: https://github.com/junshutang/Make-It-Vivid

MoMask: Generative Masked Modeling of 3D Human Motions

Paper: https://arxiv.org/abs/2312.00063
Code: https://github.com/EricGuo5513/momask-codes

MotionEditor: Editing Video Motion via Content-Aware Diffusion

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Tu_MotionEditor_Editing_Video_Motion_via_Content-Aware_Diffusion_CVPR_2024_paper.html
Code: https://github.com/Francis-Rings/MotionEditor

Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration

Paper: https://arxiv.org/abs/2402.05746
Code: https://github.com/yifanlu0227/ChatSim?tab=readme-ov-file

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

Paper: https://arxiv.org/abs/2312.06725
Code: https://github.com/huanngzh/EpiDiff

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Paper: https://arxiv.org/abs/2405.16925
Code:

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Paper: https://arxiv.org/abs/2311.07885
Code: https://github.com/SUDO-AI-3D/One2345plus

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

Paper: https://arxiv.org/abs/2312.11360
Code: https://github.com/postech-ami/Paint-it

PEGASUS: Personalized Generative 3D Avatars with Composable Attributes

Paper: https://arxiv.org/abs/2402.10636
Code: https://github.com/snuvclab/pegasus

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics

Paper: https://arxiv.org/abs/2311.12198
Code: https://github.com/XPandora/PhysGaussian

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D.

Paper: https://arxiv.org/abs/2311.16918
Code: https://github.com/modelscope/richdreamer

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Paper: https://arxiv.org/abs/2311.17261
Code: https://github.com/daveredrum/SceneTex

SceneWiz3D: Towards Text-guided 3D Scene Composition

Paper: https://arxiv.org/abs/2312.08885
Code: https://github.com/zqh0253/SceneWiz3D

SemCity: Semantic Scene Generation with Triplane Diffusion

Paper: https://arxiv.org/abs/2403.07773
Code: https://github.com/zoomin-lee/SemCity?tab=readme-ov-file

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Paper: https://arxiv.org/abs/2312.06655
Code: https://github.com/liuff19/Sherpa3D

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

Paper: https://arxiv.org/abs/2401.01647
Code: https://github.com/cgtuebingen/SIGNeRF

Single Mesh Diffusion Models with Field Latents for Texture Generation

Paper: https://arxiv.org/abs/2312.09250
Code: https://github.com/google-research/google-research/tree/master/mesh_diffusion

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Paper: https://arxiv.org/abs/2311.15855
Code: https://github.com/SiTH-Diffusion/SiTH

SPAD: Spatially Aware Multiview Diffusers

Paper: https://arxiv.org/abs/2402.05235
Code: https://github.com/yashkant/spad

Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors

Paper: https://arxiv.org/abs/2312.04963
Code: https://github.com/BiDiff/bidiff

Text-to-3D using Gaussian Splatting

Paper: https://arxiv.org/abs/2309.16585
Code: https://github.com/gsgen3d/gsgen

The More You See in 2D, the More You Perceive in 3D

Paper: https://arxiv.org/abs/2404.03652
Code: https://github.com/sap3d/sap3d

Tiger: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process

Paper: https://cvlab.cse.msu.edu/pdfs/Ren_Kim_Liu_Liu_TIGER_supp.pdf
Code: https://github.com/Zhiyuan-R/Tiger-Diffusion

Towards Realistic Scene Generation with LiDAR Diffusion Models

Paper: https://arxiv.org/abs/2404.00815
Code: https://github.com/hancyran/LiDAR-Diffusion

UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion

Paper: https://arxiv.org/abs/2404.06851
Code: https://github.com/weiqi-zhang/UDiFF

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models

Paper: https://arxiv.org/abs/2312.01305
Code: https://github.com/ubc-vision/vivid123

6.3D编辑(3D Editing)

Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Song_Arbitrary_Motion_Style_Transfer_with_Multi-condition_Motion_Latent_Diffusion_Model_CVPR_2024_paper.html
Code: https://github.com/XingliangJin/MCM-LDM

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Chen_ConsistDreamer_3D-Consistent_2D_Diffusion_for_High-Fidelity_Scene_Editing_CVPR_2024_paper.html
Code:

Control4D: Efficient 4D Portrait Editing with Text

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Shao_Control4D_Efficient_4D_Portrait_Editing_with_Text_CVPR_2024_paper.html
Code:

Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/He_Customize_your_NeRF_Adaptive_Source_Driven_3D_Scene_Editing_via_CVPR_2024_paper.html
Code: https://github.com/hrz2000/CustomNeRF

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Paper: https://arxiv.org/abs/2311.14521
Code: https://github.com/buaacyw/GaussianEditor

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Bao_GeneAvatar_Generic_Expression-Aware_Volumetric_Head_Avatar_Editing_from_a_Single_CVPR_2024_paper.html
Code: https://github.com/zju3dv/GeneAvatar

GenN2N: Generative NeRF2NeRF Translation

Paper: https://arxiv.org/abs/2404.02788
Code: https://github.com/Lxiangyue/GenN2N

Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Mou_Instruct_4D-to-4D_Editing_4D_Scenes_as_Pseudo-3D_Scenes_Using_2D_CVPR_2024_paper.html
Code: https://github.com/Friedrich-M/Instruct-4D-to-4D

LAENeRF: Local Appearance Editing for Neural Radiance Fields

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Radl_LAENeRF_Local_Appearance_Editing_for_Neural_Radiance_Fields_CVPR_2024_paper.html
Code: https://github.com/r4dl/LAENeRF

Makeup Prior Models for 3D Facial Makeup Estimation and Applications

Paper: https://arxiv.org/abs/2403.17761
Code: https://github.com/YangXingchao/makeup-priors

SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Chen_SHAP-EDITOR_Instruction-Guided_Latent_3D_Editing_in_Seconds_CVPR_2024_paper.html
Code: https://github.com/silent-chen/Shap-Editor

ShapeWalk: Compositional Shape Editing through Language-Guided Chains

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Slim_ShapeWalk_Compositional_Shape_Editing_Through_Language-Guided_Chains_CVPR_2024_paper.html
Code:

StrokeFaceNeRF: Stroke-based Facial Appearance Editing in Neural Radiance Field

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Li_StrokeFaceNeRF_Stroke-based_Facial_Appearance_Editing_in_Neural_Radiance_Field_CVPR_2024_paper.html
Code:

Text-Guided 3D Face Synthesis - From Generation to Editing

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Wu_Text-Guided_3D_Face_Synthesis_-_From_Generation_to_Editing_CVPR_2024_paper.html
Code: https://github.com/JiejiangWu/FaceG2E

7.多模态大语言模型(Multi-Modal Large Language Models)

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Paper: https://arxiv.org/abs/2312.03818
Code: https://github.com/SunzeY/AlphaCLIP

Anchor-based Robust Finetuning of Vision-Language Models

Paper: https://arxiv.org/abs/2404.06244
Code: https://github.com/LixDemon/ARF

BioCLIP: A Vision Foundation Model for the Tree of Life

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Stevens_BioCLIP_A_Vision_Foundation_Model_for_the_Tree_of_Life_CVPR_2024_paper.html
Code: https://github.com/Imageomics/bioclip

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Paper: https://arxiv.org/abs/2403.11549
Code: https://github.com/JiazuoYu/MoE-Adapters4CL

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

Paper: https://arxiv.org/abs/2403.18447
Code: https://github.com/InhwanBae/LMTrajectory

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Paper: https://arxiv.org/abs/2405.20305
Code:

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Paper: https://arxiv.org/abs/2311.08046
Code: https://github.com/PKU-YuanGroup/Chat-UniVi

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Paper: https://arxiv.org/abs/2311.17076
Code: https://github.com/chancharikmitra/CCoT

Describing Differences in Image Sets with Natural Language

Paper: https://arxiv.org/abs/2312.02974
Code: https://github.com/Understanding-Visual-Datasets/VisDiff

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Paper: https://arxiv.org/abs/2403.17589
Code: https://github.com/YBZh/DMN

Efficient Stitchable Task Adaptation

Paper: https://arxiv.org/abs/2311.17352
Code: https://github.com/ziplab/Stitched_LLaMA

Efficient Test-Time Adaptation of Vision-Language Models

Paper: https://arxiv.org/abs/2403.18293
Code: https://github.com/kdiAAA/TDA

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Paper: https://arxiv.org/abs/2404.11207
Code: https://github.com/zycheiheihei/transferable-visual-prompting

FairCLIP: Harnessing Fairness in Vision-Language Learning

Paper: https://arxiv.org/abs/2403.19949
Code: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Paper: https://arxiv.org/abs/2404.16123
Code:

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Paper: https://arxiv.org/abs/2404.16123
Code:

Generative Multimodal Models are In-Context Learners

Paper: https://arxiv.org/abs/2312.13286
Code: https://github.com/baaivision/Emu/tree/main/Emu2

GLaMM: Pixel Grounding Large Multimodal Model

Paper: https://arxiv.org/abs/2311.03356
Code: https://github.com/mbzuai-oryx/groundingLMM

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Paper: https://arxiv.org/abs/2312.02980
Code: https://github.com/Pointcept/GPT4Point

Improved Baselines with Visual Instruction Tuning

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liu_Improved_Baselines_with_Visual_Instruction_Tuning_CVPR_2024_paper.html
Code: https://github.com/haotian-liu/LLaVA

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper: https://arxiv.org/abs/2312.14238
Code: https://github.com/OpenGVLab/InternVL

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Paper: https://arxiv.org/abs/2404.00909
Code:

Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

Paper: https://arxiv.org/abs/2312.02439
Code: https://github.com/sail-sg/CLoT

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Paper: https://arxiv.org/abs/2311.11860
Code: https://github.com/rshaojimmy/JiuTian

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

Paper: https://arxiv.org/abs/2311.18651
Code: https://github.com/Open3DA/LL3DA

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Paper: https://arxiv.org/abs/2311.16922
Code: https://github.com/DAMO-NLP-SG/VCD

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper: https://arxiv.org/abs/2311.16502
Code: https://github.com/MMMU-Benchmark/MMMU

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Paper: https://arxiv.org/abs/2311.17049
Code: https://github.com/apple/ml-mobileclip

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

Paper: https://arxiv.org/abs/2403.07839
Code:

Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

Paper: https://arxiv.org/abs/2404.14471
Code: https://github.com/shiyi-zh0408/NAE_CVPR2024

OneLLM: One Framework to Align All Modalities with Language

Paper: https://arxiv.org/abs/2312.03700
Code: https://github.com/csuhan/OneLLM

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

Paper: https://arxiv.org/abs/2403.01849
Code: https://github.com/TreeLLi/APT

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Paper: https://arxiv.org/abs/2402.19479
Code: https://github.com/shikiw/OPERA

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper: https://arxiv.org/abs/2311.17911
Code: https://github.com/snap-research/Panda-70M

PixelLM: Pixel Reasoning with Large Multimodal Model

Paper: https://arxiv.org/abs/2312.02228
Code: https://github.com/MaverickRen/PixelLM

PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization

Paper: https://arxiv.org/abs/2404.09011
Code:

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

Paper: https://arxiv.org/abs/2312.04302
Code: https://github.com/dvlab-research/Prompt-Highlighter

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Paper: https://arxiv.org/abs/2403.02781
Code: https://github.com/zhengli97/PromptKD

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Paper: https://arxiv.org/abs/2311.06783
Code: https://github.com/Q-Future/Q-Instruct

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

Paper: https://arxiv.org/abs/2403.13263
Code: https://github.com/ivattyue/SC-Tune

SEED-Bench: Benchmarking Multimodal Large Language Models

Paper: https://arxiv.org/abs/2311.17092
Code: https://github.com/AILab-CVC/SEED-Bench

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Paper: https://arxiv.org/abs/2404.01156
Code:

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Paper: https://arxiv.org/abs/2401.10224
Code: https://github.com/ragavsachdeva/magi

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Paper: https://arxiv.org/abs/2403.12532
Code:

VBench: Comprehensive Benchmark Suite for Video Generative Models

Paper: https://arxiv.org/abs/2311.17982
Code: https://github.com/Vchitect/VBench

VideoChat: Chat-Centric Video Understanding

Paper: https://arxiv.org/abs/2305.06355
Code: https://github.com/OpenGVLab/Ask-Anything

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Paper: https://arxiv.org/abs/2312.00784
Code: https://github.com/mu-cai/ViP-LLaVA

ViTamin: Designing Scalable Vision Models in the Vision-language Era

Paper: https://arxiv.org/abs/2404.02132
Code: https://github.com/Beckschen/ViTamin

ViT-Lens: Towards Omni-modal Representations

Paper: https://github.com/TencentARC/ViT-Lens
Code: https://arxiv.org/abs/2308.10185

8.其他任务(Others)

AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error

Paper: https://arxiv.org/abs/2401.17879
Code: https://github.com/jonasricker/aeroblade

Diff-BGM: A Diffusion Model for Video Background Music Generation

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Li_Diff-BGM_A_Diffusion_Model_for_Video_Background_Music_Generation_CVPR_2024_paper.html
Code: https://github.com/sizhelee/Diff-BGM

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Paper: https://arxiv.org/abs/2310.11440
Code: https://github.com/evalcrafter/EvalCrafter

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Zhao_FlashEval_Towards_Fast_and_Accurate_Evaluation_of_Text-to-image_Diffusion_Generative_CVPR_2024_paper.html
Code:

InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Liang_InfLoRA_Interference-Free_Low-Rank_Adaptation_for_Continual_Learning_CVPR_2024_paper.html
Code: https://github.com/liangyanshuo/InfLoRA

On the Content Bias in Fréchet Video Distance

Paper: https://arxiv.org/abs/2404.12391
Code: https://github.com/songweige/content-debiased-fvd

Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry...for now

Paper: https://openaccess.thecvf.com/content/CVPR2024/html/Sarkar_Shadows_Dont_Lie_and_Lines_Cant_Bend_Generative_Models_dont_CVPR_2024_paper.html
Code: https://github.com/hanlinm2/projective-geometry

TexTile: A Differentiable Metric for Texture Tileability

Paper: https://arxiv.org/abs/2403.12961v1
Code: https://github.com/crp94/textile

持续更新~

参考

CVPR 2024 论文和开源项目合集(Papers with Code)

Files

CVPR2024.md

Latest commit

History