Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

Open
benjibc opened this issue Dec 20, 2024 · 1 comment

Comments

@benjibc
Copy link
Contributor

benjibc commented Dec 20, 2024

Multimodal Embeddings Endpoint

We can design one endpoint that accepts an array of interleaved content. That way, the client only makes one API call regardless of whether the content is text, image, or both. (Note that for images we only support base64 encoding rather than URLs.)

Endpoint Details

URL: POST /v1/embeddings

Headers:

  • Content-Type: application/json
  • Authorization: Bearer <API_KEY>

Request Body:

  • model (string, required):
    The name of the model to use. For multimodal capabilities with aligned embeddings, use a model like "siglip-1".

  • input (array of objects, required):
    A list of content items. Each item should have:

    • type (string, required): Must be either "text" or "image".
    • For "text" items:
      • text (string, required): The textual content to embed.
    • For "image" items:
      • base64 (string, required): A base64-encoded representation of the image (PNG or JPEG). URLs are not supported.
  • options (object, optional):

    • normalize (boolean, default: true): Whether to normalize the embedding vector.
    • return_dims (boolean, default: false): Whether to include the dimensionality of the resulting embedding in the response.

Example Request:

{
  "model": "siglip-1",
  "input": [
    {
      "type": "text",
      "text": "A photo of a white cat sitting on a chair."
    },
    {
      "type": "image",
      "base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
    }
  ],
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response

A sample response for the above unified request might look like this:

{
  "model": "siglip-1",
  "embedding": [0.015, -0.060, 0.110, ...],
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

If the option return_dims is enabled, the response will also include:

{
  "model": "siglip-1",
  "embedding": [0.015, -0.060, 0.110, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

Error Handling

  • 400 Bad Request:
    Returned for issues like:

    • Missing required fields (model or input).
    • Any content item missing the necessary text (for type "text") or base64 (for type "image").
    • Invalid base64 encoding.
  • 401 Unauthorized:
    If the Authorization header is missing or invalid.

  • 415 Unsupported Media Type:
    If the Content-Type is not set to application/json.

  • 500 Internal Server Error:
    For unexpected server issues.

Example Error Response:

{
  "error": {
    "message": "Invalid base64 image encoding",
    "type": "invalid_request_error"
  }
}
@ashwinb
Copy link
Contributor

ashwinb commented Dec 21, 2024

Thank you for putting up such a complete spec / proposal with all details! Much to learn from for us :)

Is there a reason why you should separate these endpoints though? Internally CLIP does have separate encoders but from an API perspective, a client is really embedding a mixture of various contents together. Our embeddings API for example works with an InterleavedContent type which is roughly something like

[
   { type: "text", text: "hello" },
   { type: "image", url: "http://foo.bar.baz/lol.png" },
]

With your proposal, trying to embed this content will mean I will have to (a) make two API calls, but (b) more importantly, the client is now expected to figure out what to do with these embedding values -- should they be indexed separately? We can certainly make the call to do an "addition" within llama-stack (since the semantic is that the client wants to embed this stuff as one piece). Curious to know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants