Skip to content

Research Ideas

sara edited this page Jul 16, 2023 · 3 revisions

Introduction:

The research area should include multi-modality, such as CLIP based methods. In this page we will write down multiple research ideas and explaining it.

Grounding CrowdCLIP (using SAM, or detection methods ?):

CrowdCLIP is used in crowd counting. However, it lacks localisation. The model achieves decent performance, but when compared to supervised methods (such as CLTR) it under performs significantly (MSE: 283.3, vs 85.3). The Idea used in CrowdCLIP was based on multiple text prompts and a fine-tune phase. can we use methods such as WinCLIP or SAM to achive localisation. Additionally, can we use it to create a large noisy data set to train on and fine-tune on downstream datasets ?

A study of prompt tuning compatibility:

There's many prompt tuning strategies used and introduced in CVPR this year and previous years. These, method aren't mutually exclusive. Meaning that we can use, if we want, all the prompt tuning strategies together. However, this might (or might not) be a naive implementation of it. can we study compatibility between them ?

Aligning multi modality through dependency graphs:

DepGraph introduced a method to find and model dependency between weights. We can extend this to multi modality either for pruning or for to find dependant weights and enforce similarity within the CLIP like model instead of at prediction level.

Can LLMs See ?

we can use a frozen LMM to interact with images, we embed the image into the text-embedding space. what would the performance be ? we try to embed the text and images into the same space, and then use the image encoder. but can we use LLMs ?

adding a positive ?

generate new data for both modality (text, and image) and then use the generated examples as additional positives for each text and image.

using cv tircks

converting the textual input into a an RGB image and then train it with the latest feature extractions methods and maybe change the loss function?