You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can design one endpoint that accepts an array of interleaved content. That way, the client only makes one API call regardless of whether the content is text, image, or both. (Note that for images we only support base64 encoding rather than URLs.)
Endpoint Details
URL:POST /v1/embeddings
Headers:
Content-Type: application/json
Authorization: Bearer <API_KEY>
Request Body:
model(string, required):
The name of the model to use. For multimodal capabilities with aligned embeddings, use a model like "siglip-1".
input(array of objects, required):
A list of content items. Each item should have:
type(string, required): Must be either "text" or "image".
For "text" items:
text(string, required): The textual content to embed.
For "image" items:
base64(string, required): A base64-encoded representation of the image (PNG or JPEG). URLs are not supported.
options(object, optional):
normalize(boolean, default: true): Whether to normalize the embedding vector.
return_dims(boolean, default: false): Whether to include the dimensionality of the resulting embedding in the response.
Example Request:
{
"model": "siglip-1",
"input": [
{
"type": "text",
"text": "A photo of a white cat sitting on a chair."
},
{
"type": "image",
"base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
}
],
"options": {
"normalize": true,
"return_dims": false
}
}
Example Response
A sample response for the above unified request might look like this:
Thank you for putting up such a complete spec / proposal with all details! Much to learn from for us :)
Is there a reason why you should separate these endpoints though? Internally CLIP does have separate encoders but from an API perspective, a client is really embedding a mixture of various contents together. Our embeddings API for example works with an InterleavedContent type which is roughly something like
With your proposal, trying to embed this content will mean I will have to (a) make two API calls, but (b) more importantly, the client is now expected to figure out what to do with these embedding values -- should they be indexed separately? We can certainly make the call to do an "addition" within llama-stack (since the semantic is that the client wants to embed this stuff as one piece). Curious to know what you think.
Multimodal Embeddings Endpoint
We can design one endpoint that accepts an array of interleaved content. That way, the client only makes one API call regardless of whether the content is text, image, or both. (Note that for images we only support base64 encoding rather than URLs.)
Endpoint Details
URL:
POST /v1/embeddings
Headers:
Content-Type: application/json
Authorization: Bearer <API_KEY>
Request Body:
model (string, required):
The name of the model to use. For multimodal capabilities with aligned embeddings, use a model like
"siglip-1"
.input (array of objects, required):
A list of content items. Each item should have:
"text"
or"image"
."text"
items:"image"
items:options (object, optional):
Example Request:
Example Response
A sample response for the above unified request might look like this:
If the option
return_dims
is enabled, the response will also include:Error Handling
400 Bad Request:
Returned for issues like:
model
orinput
).text
(for type"text"
) orbase64
(for type"image"
).401 Unauthorized:
If the
Authorization
header is missing or invalid.415 Unsupported Media Type:
If the
Content-Type
is not set toapplication/json
.500 Internal Server Error:
For unexpected server issues.
Example Error Response:
The text was updated successfully, but these errors were encountered: