Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design annotations abstraction for responses that are not just a stream of plain text #716

Open
simonw opened this issue Jan 24, 2025 · 17 comments

Comments

@simonw
Copy link
Owner

simonw commented Jan 24, 2025

LLM currently assumes that all responses from a model come in the form of a stream of text.

This assumption no longer holds!

  • Anthropic's new Citations API (API docs) returns responses that add citation details to some spans of text, like this.
  • DeepSeek Reasoner streams back two types of text - reasoning text and regular text - as seen here.

And that's just variants of text - multi-modal models need consideration as well. OpenAI have a model that can return snippets of audio already, and models that return images (from OpenAI and Gemini) are coming available very soon too.

@simonw simonw added the design label Jan 24, 2025
@simonw
Copy link
Owner Author

simonw commented Jan 24, 2025

I had thought that attachments would be the way to handle this, but they only work for audio/image outputs - the thing where Claude and DeepSeek can return annotated spans of text feels different.

@simonw
Copy link
Owner Author

simonw commented Jan 24, 2025

Here's an extract from that Claude citations example:

{
  "id": "msg_01P3zs4aYz2Baebumm4Fejoi",
  "content": [
    {
      "text": "Based on the document, here are the key trends in AI/LLMs from 2024:\n\n1. Breaking the GPT-4 Barrier:\n",
      "type": "text"
    },
    {
      "citations": [
        {
          "cited_text": "I’m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.\n\n",
          "document_index": 0,
          "document_title": "My Document",
          "end_char_index": 531,
          "start_char_index": 288,
          "type": "char_location"
        }
      ],
      "text": "The GPT-4 barrier was completely broken, with 18 organizations now having models that rank higher than the original GPT-4 from March 2023, with 70 models in total surpassing it.",
      "type": "text"
    },
    {
      "text": "\n\n2. Increased Context Lengths:\n",
      "type": "text"
    },
    {
      "citations": [
        {
          "cited_text": "Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google’s Gemini series accepts up to 2 million.\n\n",
          "document_index": 0,
          "document_title": "My Document",
          "end_char_index": 1680,
          "start_char_index": 1361,
          "type": "char_location"
        }
      ],
      "text": "A major theme was increased context lengths. While last year most models accepted 4,096 or 8,192 tokens (with Claude 2.1 accepting 200,000), today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.",
      "type": "text"
    },

And from the DeepSeek reasoner streamed response (pretty-printed here). First a reasoning content chunk:

{
    "id": "2cf23b27-2ba6-41dd-b484-358c486a1405",
    "object": "chat.completion.chunk",
    "created": 1737480272,
    "model": "deepseek-reasoner",
    "system_fingerprint": "fp_1c5d8833bc",
    "choices": [
        {
            "index": 0,
            "delta": {
                "content": null,
                "reasoning_content": "Okay"
            },
            "logprobs": null,
            "finish_reason": null
        }
    ]
}

Text content chunk:

{
    "id": "2cf23b27-2ba6-41dd-b484-358c486a1405",
    "object": "chat.completion.chunk",
    "created": 1737480272,
    "model": "deepseek-reasoner",
    "system_fingerprint": "fp_1c5d8833bc",
    "choices": [
        {
            "index": 0,
            "delta": {
                "content": " waves",
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": null
        }
    ]
}

@simonw
Copy link
Owner Author

simonw commented Jan 24, 2025

Meanwhile OpenAI audio responses look like this (truncated).I'm not sure if these can mix in text output as well, but in this case the audio does at least include a "transcript" key:

{
  "id": "chatcmpl-At42uKzhIMJfzGOwypiS9mMH3oaFG",
  "object": "chat.completion",
  "created": 1737686956,
  "model": "gpt-4o-audio-preview-2024-12-17",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "audio": {
          "id": "audio_6792ffad12f48190abab9d6b7d1a1bf7",
          "data": "UklGRkZLAABXQVZFZ...",
          "expires_at": 1737690557,
          "transcript": "Hi"
        }
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 13,
    "total_tokens": 35,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0,
      "text_tokens": 22,
      "image_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 8,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0,
      "text_tokens": 5
    }
  },
  "service_tier": "default",
  "system_fingerprint": "fp_58887f9c5a"
}

@simonw
Copy link
Owner Author

simonw commented Jan 24, 2025

@ericfeunekes
Copy link

I think a combination of pydantic object with some sort of templating language. E.g. for the Claude example you have this object:

from pydantic import BaseModel, Field
from typing import List, Optional, Literal

class TextRange(BaseModel):
   start: int
   end: int

class Citation(BaseModel):
   sourceDocument: str = Field(alias="document_title")
   documentIndex: int = Field(alias="document_index")
   textRange: TextRange = Field(...)
   citedText: str = Field(alias="cited_text")
   type: Literal["char_location"]

class ContentBlock(BaseModel):
   blockType: Literal["text", "heading"] = Field(alias="type")
   content: str = Field(alias="text") 
   hasCitation: bool = Field(default=False)
   citation: Optional[Citation] = None
   headingLevel: Optional[int] = None

class Message(BaseModel):
   messageId: str = Field(alias="id")
   contentBlocks: List[ContentBlock] = Field(alias="content")

and then you define a message template:

{% for block in contentBlocks if block.blockType == "text" %}
  {{ block.content }}
  {% if block.hasCitation %}
    > {{ block.citation.citedText }}
  {% endif %}
{% endfor %}

You could then create similar objects and templates for different model types. These could also be exposed to users to customize how data is shown for any model. Also - pydantic now supports partial validation so to the extent any of the json responses are streamed, this model should still work.

@ericfeunekes
Copy link

ericfeunekes commented Jan 24, 2025

I think a combination of pydantic object with some sort of templating language. E.g. for the Claude example you have this object:

from pydantic import BaseModel, Field
from typing import List, Optional, Literal

class TextRange(BaseModel):
   start: int
   end: int

class Citation(BaseModel):
   sourceDocument: str = Field(alias="document_title")
   documentIndex: int = Field(alias="document_index")
   textRange: TextRange = Field(...)
   citedText: str = Field(alias="cited_text")
   type: Literal["char_location"]

class ContentBlock(BaseModel):
   blockType: Literal["text", "heading"] = Field(alias="type")
   content: str = Field(alias="text") 
   hasCitation: bool = Field(default=False)
   citation: Optional[Citation] = None
   headingLevel: Optional[int] = None

class Message(BaseModel):
   messageId: str = Field(alias="id")
   contentBlocks: List[ContentBlock] = Field(alias="content")

and then you define a message template:

{% for block in contentBlocks if block.blockType == "text" %}
  {{ block.content }}
  {% if block.hasCitation %}
    > {{ block.citation.citedText }}
  {% endif %}
{% endfor %}

You could then create similar objects and templates for different model types. These could also be exposed to users to customize how data is shown for any model. Also - pydantic now supports partial validation so to the extent any of the json responses are streamed, this model should still work.

Thinking about this a bit more, if you want to go down this road or something similar, it would be great to have it as a separate package. This would let it be a plugin to this library, but also useable in others. I could definitely use something like this in some of my projects where I have LiteLLM, which lets me switch models easily and so would be great to be able to have output templates that I could define like this.

Not sure how hard this would be but I could probably contribute.

@banahogg
Copy link

Cohere is a bit outside the top tier models but probably worth considering their citation format as well when designing this: https://docs.cohere.com/docs/documents-and-citations

@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

That Cohere example is really interesting. It looks like they decided to have citations as a separate top-level key and then reference which bits of text the citations correspond to using start/end indexes:

# response.message.content
[AssistantMessageResponseContentItem_Text(text='The tallest penguins are the Emperor penguins. They only live in Antarctica.', type='text')]

# response.message.citations
[Citation(start=29, 
          end=46, 
          text='Emperor penguins.', 
          sources=[Source_Document(id='doc:0:0', 
                                   document={'id': 'doc:0:0', 
                                             'snippet': 'Emperor penguins are the tallest.', 
                                             'title': 'Tall penguins'}, 
                                   type='document')]), 
 Citation(start=65, 
          end=76, 
          text='Antarctica.', 
          sources=[Source_Document(id='doc:0:1', 
                                   document={'id': 'doc:0:1', 
                                             'snippet': 'Emperor penguins only live in Antarctica.', 
                                             'title': 'Penguin habitats'}, 
                                   type='document')])]

Note how that first citation is in a separate data structure and flags 29-46 - the text "Emperor penguins." - as the attachment point.

This might actually be a way to solve the general problem: I could take the Claude citations format and turn that into a separate stored piece of information, referring back to the original text using those indexes.

That way I could still store a string of text in the database / output that in the API, but additional annotations against that stream of text could be stored elsewhere.

For the DeepSeek reasoner case this would mean having a start-end indexed chunk of text that is labelled as coming from the <think> block.

I don't think this approach works for returning audio though - there's no text segment to attach that audio to, though I guess I could say "index 55:55 is where the audio chunk came in".

@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

I'm going to call this annotations for the moment - where an annotation is additional metadata attached to a portion of the text returned by an LLM.

The three things to consider are:

  • How are annotations represented in the LLM Python API? Presumably on the Response class?
  • How are they represented in the CLI tool (really a question about how they are rendered to a terminal)
  • How are they stored in the SQLite database tables, such that they can be re-hydrated into Response objects from the database?

@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

I think I'll treat audio/image responses separately from annotations - I'll use an expanded version of the existing attachments mechanism for that - including the existing attachments database table:

llm/docs/logging.md

Lines 181 to 194 in 656d8fa

CREATE TABLE [attachments] (
[id] TEXT PRIMARY KEY,
[type] TEXT,
[path] TEXT,
[url] TEXT,
[content] BLOB
);
CREATE TABLE [prompt_attachments] (
[response_id] TEXT REFERENCES [responses]([id]),
[attachment_id] TEXT REFERENCES [attachments]([id]),
[order] INTEGER,
PRIMARY KEY ([response_id],
[attachment_id])
);

I'll probably add a response_attachments many-to-many table to track attachments returned BY a response (as opposed to being attached to the prompt as input).

@simonw simonw changed the title Design an abstraction for responses that are not just a stream of text Design annotations abstraction for responses that are not just a stream of plain text Jan 25, 2025
@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

After brainstorming with Claude I think a solution to the terminal representation challenge could be to add markers around the annotated spans of text and then display those annotations below.

One neat option here is corner brackets - 「 and 」- for example:

Based on the document, here are the key trends in AI/LLMs from 2024:

1. Breaking the GPT-4 Barrier: 「The GPT-4 barrier was completely broken, with 18 organizations now having models that rank higher than the original GPT-4 from March 2023, with 70 models in total surpassing it.」

2. Increased Context Lengths: 「A major theme was increased context lengths. While last year most models accepted 4,096 or 8,192 tokens (with Claude 2.1 accepting 200,000), today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.」

Annotations:

「The GPT-4 barrier was completely broken...」:

  {
    "citations": [
      {
        "cited_text": "I’m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.\n\n",
        "document_index": 0,
        "document_title": "My Document",
        "end_char_index": 531,
        "start_char_index": 288,
        "type": "char_location"
      }
    ]
  }

「A major theme was increased context lengths...」:

  {
    "citations": [
      {
        "cited_text": "Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google’s Gemini series accepts up to 2 million.\n\n",
        "document_index": 0,
        "document_title": "My Document",
        "end_char_index": 1680,
        "start_char_index": 1361,
        "type": "char_location"
      }
    ]
  }

So the spans of text that have annotations are wrapped in「 and 」and the annotations themselves are then displayed below.

Here's what that looks like in a macOS terminal window:

Image

@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

For DeepSeek reasoner that might look like this:

「Okay, so I need to come up with a joke about a pelican and a
walrus running a tea room together. Hmm, that's an
interesting combination. Let me think about how these two
characters might interact in a humorous situation.

First, let's consider their characteristics. Pelicans are
known for their long beaks and Webbed feet, often seen near
the beach or water. Walruses have big teeth, thick fur, and
they're generally found in colder climates, like icebergs or
snowy areas. So, combining these two into a tea room setting
is already a funny image.」

**The Joke:**

A pelican and a walrus decide to open a quaint little tea
room together. The walrus, with its big size, struggles to
find comfortable chairs, so it sits on the table by accident,
knocking over the teapot. Meanwhile, the pelican, trying to
help, uses its beak to place saucers on the table, causing a
few spills.

After a series of comical mishaps, the walrus looks up and
says with a grin, "This isn't so fishy anymore." The pelican
smirks and remarks, "Maybe not, but we do have a lot of krill
in our tea!"

Annotations:

「Okay, so I need to come up with a joke... 」:

  {
    "thinking": true
  }

In this case I'd have to do some extra post-processing to combine all of those short token snippets into a single annotation, de-duping the "thinking": true annotation - otherwise I would end up with dozens of annotations for every word in the thinking section.

@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

For the Python layer this might look like so:

response = llm.prompt("prompt goes here")
print(response.text()) # outputs the plain text
print(response.annotations)
# Outputs annotations, see below
for annotated in response.text_with_annotations():
    print(annotated.text, annotated.annotations)

That text_with_annotations() method is a utility that uses the start/end indexes to break up the text and return each segment with its annotations.

The response.annotations list would look something like this:

[
  Annotation(start=0, end=5, data={"this": "is a dictionary of stuff"}),
  Annotation(start=55, end=58, data={"this": "is more stuff"}),
]

(data= is an ugly name for a property, but annotation= didn't look great either.)

@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

Then the SQL table design is pretty simple:

 CREATE TABLE [response_annotations] (
   [id] INTEGER PRIMARY KEY,
   [response_id] TEXT REFERENCES [responses]([id]), 
   [start_index] INTEGER,
   [end_index] INTEGER,
   [annotation] TEXT -- JSON
 ); 

@simonw
Copy link
Owner Author

simonw commented Jan 25, 2025

It bothers me very slightly that this design allows for exact positioning of annotations in a text stream response (with a start and end index) but doesn't support that for recording the position at which an image or audio clip was returned.

I think the fix for that is to have an optional single text_index integer on the response_attachments many-to-many table, to optionally record the exact point at which an image/audio-clip was included in the response.

@Quantisan
Copy link

After brainstorming with Claude I think a solution to the terminal representation challenge could be to add markers around the annotated spans of text and then display those annotations below.

One neat option here is corner brackets - 「 and 」- for example:

Based on the document, here are the key trends in AI/LLMs from 2024:

1. Breaking the GPT-4 Barrier: 「The GPT-4 barrier was completely broken, with 18 organizations now having models that rank higher than the original GPT-4 from March 2023, with 70 models in total surpassing it.」

2. Increased Context Lengths: 「A major theme was increased context lengths. While last year most models accepted 4,096 or 8,192 tokens (with Claude 2.1 accepting 200,000), today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.」

Annotations:

「The GPT-4 barrier was completely broken...」:

  {
    "citations": [
      {
        "cited_text": "I’m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)—70 models in total.\n\n",
        "document_index": 0,
        "document_title": "My Document",
        "end_char_index": 531,
        "start_char_index": 288,
        "type": "char_location"
      }
    ]
  }

「A major theme was increased context lengths...」:

  {
    "citations": [
      {
        "cited_text": "Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google’s Gemini series accepts up to 2 million.\n\n",
        "document_index": 0,
        "document_title": "My Document",
        "end_char_index": 1680,
        "start_char_index": 1361,
        "type": "char_location"
      }
    ]
  }

asking the obvious question, why not use the academic paper style of using [<number>] to reference instead of quoting the beginning text as anchor? I guess one reason is that it would add even more chars to the text block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants