[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

benjibc · 2024-12-20T03:27:28Z

Multimodal Embeddings Endpoint

We can design one endpoint that accepts an array of interleaved content. That way, the client only makes one API call regardless of whether the content is text, image, or both. (Note that for images we only support base64 encoding rather than URLs.)

Endpoint Details

URL: POST /v1/embeddings

Headers:

Content-Type: application/json
Authorization: Bearer <API_KEY>

Request Body:

model (string, required):
The name of the model to use. For multimodal capabilities with aligned embeddings, use a model like "siglip-1".
input (array of objects, required):
A list of content items. Each item should have:
- type (string, required): Must be either "text" or "image".
- For "text" items:
  - text (string, required): The textual content to embed.
- For "image" items:
  - base64 (string, required): A base64-encoded representation of the image (PNG or JPEG). URLs are not supported.
options (object, optional):
- normalize (boolean, default: true): Whether to normalize the embedding vector.
- return_dims (boolean, default: false): Whether to include the dimensionality of the resulting embedding in the response.

Example Request:

{
  "model": "siglip-1",
  "input": [
    {
      "type": "text",
      "text": "A photo of a white cat sitting on a chair."
    },
    {
      "type": "image",
      "base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
    }
  ],
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response

A sample response for the above unified request might look like this:

{
  "model": "siglip-1",
  "embedding": [0.015, -0.060, 0.110, ...],
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

If the option return_dims is enabled, the response will also include:

{
  "model": "siglip-1",
  "embedding": [0.015, -0.060, 0.110, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

Error Handling

400 Bad Request:
Returned for issues like:
- Missing required fields (model or input).
- Any content item missing the necessary text (for type "text") or base64 (for type "image").
- Invalid base64 encoding.
401 Unauthorized:
If the Authorization header is missing or invalid.
415 Unsupported Media Type:
If the Content-Type is not set to application/json.
500 Internal Server Error:
For unexpected server issues.

Example Error Response:

{
  "error": {
    "message": "Invalid base64 image encoding",
    "type": "invalid_request_error"
  }
}

The text was updated successfully, but these errors were encountered:

ashwinb · 2024-12-21T04:18:45Z

Thank you for putting up such a complete spec / proposal with all details! Much to learn from for us :)

Is there a reason why you should separate these endpoints though? Internally CLIP does have separate encoders but from an API perspective, a client is really embedding a mixture of various contents together. Our embeddings API for example works with an InterleavedContent type which is roughly something like

[
   { type: "text", text: "hello" },
   { type: "image", url: "http://foo.bar.baz/lol.png" },
]

With your proposal, trying to embed this content will mean I will have to (a) make two API calls, but (b) more importantly, the client is now expected to figure out what to do with these embedding values -- should they be indexed separately? We can certainly make the call to do an "addition" within llama-stack (since the semantic is that the client wants to embed this stuff as one piece). Curious to know what you think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

benjibc commented Dec 20, 2024 •

edited

Loading

ashwinb commented Dec 21, 2024

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

Comments

benjibc commented Dec 20, 2024 • edited Loading

Multimodal Embeddings Endpoint

Endpoint Details

Example Response

Error Handling

ashwinb commented Dec 21, 2024

benjibc commented Dec 20, 2024 •

edited

Loading