Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved text token counting #463

Open
tazlin opened this issue Oct 19, 2024 · 0 comments
Open

Improved text token counting #463

tazlin opened this issue Oct 19, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@tazlin
Copy link
Member

tazlin commented Oct 19, 2024

The central issue revolves around the following function:

def get_things_count(self, generation=None):
if generation is None:
if self.generation is None:
return 0
generation = self.generation
quick_token_count = math.ceil(len(generation) / 4)
if quick_token_count < 20:
quick_token_count = 20
if self.wp.things > quick_token_count:
# logger.debug([self.wp.things,quick_token_count])
return quick_token_count
return self.wp.things

Certain tokenizers have the ability to outperform the fixed factor of "4", which leaves the horde with the belief the worker is generating tokens faster than is possible, where in reality, the tokenizer may simply generate more tokens than that on average. You can analyze the effect different tokenizers have with respect to the ratio of characters_input / tokens_generated here:
https://huggingface.co/spaces/Xenova/the-tokenizer-playground

The horde text model text reference could have a tokenizer_efficiency field added, and the AI-Horde updated to use, to reduce this problem. The text reference uses the huggingface names as the canonical name in the reference, and so the huggingface client library could be used to retrieve the tokenizer.

I propose the following process:

  • A tokenizer_efficiency and tokenizer_vocab_size (for posterity) field be added to the model reference
  • A script written to download all of the tokenizers and their configurations (each is on the order of megabytes. De-deduping may also be possible. The tokenizer_vocab_size is scraped from the tokenizer.json (it is the array length of the vocab list in the model object)
  • A large amount of random (read: representative) text is generated and saved as a fixed dataset that will be used against all tokenizers. The count of characters of this dataset is also saved.
  • Each tokenizer tokenizes the fixed dataset and the resulting number of tokens are counted.
    tokenizer_efficiency = characters_input / tokens_generated
  • The existing text model reference updated with these fields
  • A CI workflow to automate the collection and enforcement of these new fields for https://github.com/Haidra-Org/AI-Horde-text-model-reference.
  • The relevant AI-Horde code update to utilize it.

An potential alternative approach would involve downloading/using the tokenizers API-side somehow (microservice?) but I suspect this would introduce enormous and unnecessary complications as well as add an unacceptable delay to generations.

@tazlin tazlin added the enhancement New feature or request label Oct 19, 2024
@tazlin tazlin self-assigned this Oct 19, 2024
tazlin added a commit that referenced this issue Jan 17, 2025
Once upon a time, before batching and other optimizations, these were the speeds we considered unreasonable but new paradigms, backends and breakthroughs have made these numbers increasingly inaccurate or irrelevant.

While I do think there has to be some sort of longer term (such as the problem detailed in #463), there have been virtually *only* false positives, and the few true positives boiled down to innocent misconfigurations.

Further, it appears that certain terms of worker-reported failures can artificially inflate token count, which may be its own issue.

For the time being, I am advocating that the number is increased to 100t/s, as recommended by henky, and that we respond to possible abuse of this relaxation with other, more complete and sound, measures.
db0 pushed a commit that referenced this issue Jan 21, 2025
Once upon a time, before batching and other optimizations, these were the speeds we considered unreasonable but new paradigms, backends and breakthroughs have made these numbers increasingly inaccurate or irrelevant.

While I do think there has to be some sort of longer term (such as the problem detailed in #463), there have been virtually *only* false positives, and the few true positives boiled down to innocent misconfigurations.

Further, it appears that certain terms of worker-reported failures can artificially inflate token count, which may be its own issue.

For the time being, I am advocating that the number is increased to 100t/s, as recommended by henky, and that we respond to possible abuse of this relaxation with other, more complete and sound, measures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant