Skip to content

Latest commit

 

History

History
58 lines (50 loc) · 7.17 KB

README.md

File metadata and controls

58 lines (50 loc) · 7.17 KB

Area for research:

  1. CLIP in point-cloud/3D.
  2. Open-Vocabulary Object Detection (OVD)
  3. efficient CLIP training (better use of computation or data)
  4. applying CLIP models in narrow fields; such as Human Object Interaction detection, crowd counting...etc

Papers from CVPR2023:

(might missed some papers)

pretraining CLIP models:

Title Description Code
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training Reducing memory consumption through decomposing the gradient code
Scaling Language-Image Pre-training via Masking by adding masked image modelling to the image branch of clip it improved speed, memory, and performance code
Non-Contrastive Learning Meets Language-Image Pre-Training added the loss introduced in SwAv (based on cluster assignment agreement) in addition to the contrastive loss of CLIP. interestingly, if non-Contrastive loss is used alone the zero-shot performance is bad but when used with contrastive loss (0.7swav + 0.3contrastive) it over perform the contrastive loss. Additionally, it helped the need for data (trained on 35-million only) and small batch size (4096 combared to 32K) code

Finetuning CLIP models:

Title Description Code
Learning to Name Classes for Vision and Language Models created a learnable token embedding for the class names in otherwise frozen clip model, reduce the need for prompt engineering NA
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models when fine-tuning the model with linear classifier it is useful to train it from multi modality NA
MaPLe: Multi-modal Prompt Learning learnable prompts on both the image and text branches, image prompt are derived from a linear layer that takes the text prompt as input code

CLIP in video:

Title Description Code
Fine-Tuned CLIP Models Are Efficient Video Learners Adapts clip for videos. Claims that frame level clip embeddings from the videos though processed independently can still show temporal dependencies. Claims that instead of devising certain specific modules to address the temporal dependency in videos, simply fine-tuning ViFiCLIP can generalise to good performance. They do temporal pooling meaning pool embeddings from T frames and use that embedding in the contrastive learning process. This is probably why the embeddings are consistent with image based CLIP. code
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting Performs prompt learning on the video data to better fine tune image based CLIP model for videos. Same authors as of ViFi CLIP (above) Need to look into how the prompts are actually learned. code

Crowd Counting:

Title Description Code
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model crowd counting with clip. fine-tune clip for the counting task using ranking loss. Does not use labels of people counts as ground truth for training. uses a sequential prompting setting to filter parts that only contain people heads for counting code

Generative:

Title Description Code
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency ... ...
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language ... ...
CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search ... ...
Local 3D Editing via 3D Distillation of CLIP Knowledge ... ...

Continual learning:

Title Description Code
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning Used prompt tuning with CLIP to solve the problem of Continual learning, heavily inspired by CoOp code

3D and Point-cloud:

Title Description Code
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data ... ...

Detection:

Title Description Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models