You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are use cases for being able to do client-side manipulation of the various intermediate results of the clip interrogation process.
To compare an image to text via CLIP, the following happens:
The text is encoded into features. open_clip uses clip_model.encode_text(text_tokens). This returns a tensor.
The image "features" are extracted by using the CLIP model. open_clip uses clip_model.encode_image(...). This returns a tensor.
The tensors are normalized.
The image features and the text features are compared.
A similarity score is assigned and returned.
This feature request would allow steps 1 + 2 to be returned independently, optionally as part of a regular interrogate request, or separately on their own without the need to load a CLIP model locally - they could perform the math pertinent to their use case in slow/limited RAM environments. Certain types of image-searching/database schemes could benefit from this.
I propose the following forms be added:
encode_text
Accepts a list of strings and the value of a supported CLIP model.
For each string returns a .safetensors file containing the encoded text tensor and which model was used to encode it.
encode_image
Accepts a source_imageand the value of a supported CLIP model.
Returns a .safetensors file containing the encoded image features and which model was used to encode it.
This proposal has the obvious wrinkle of needing to support the upload of .safetensors files. The size of these files is on the order of magnitude of single-digit kilobytes.
I think we might avoid using R2 here and just b64 the safetensors in the DB. couple-kb data per file shouldn't be a terrible amount and if bandwidth starts being choked due to these I can always switch to R2 later.
There are use cases for being able to do client-side manipulation of the various intermediate results of the clip interrogation process.
To compare an image to text via CLIP, the following happens:
open_clip
usesclip_model.encode_text(text_tokens)
. This returns atensor
.open_clip
usesclip_model.encode_image(...)
. This returns atensor
.This feature request would allow steps 1 + 2 to be returned independently, optionally as part of a regular interrogate request, or separately on their own without the need to load a CLIP model locally - they could perform the math pertinent to their use case in slow/limited RAM environments. Certain types of image-searching/database schemes could benefit from this.
I propose the following forms be added:
encode_text
.safetensors
file containing the encoded text tensor and which model was used to encode it.encode_image
source_image
and the value of a supported CLIP model..safetensors
file containing the encoded image features and which model was used to encode it.This proposal has the obvious wrinkle of needing to support the upload of
.safetensors
files. The size of these files is on the order of magnitude of single-digit kilobytes.Related to Haidra-Org/horde-worker-reGen#9.
The text was updated successfully, but these errors were encountered: