Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community Resource: AutoTikTokenizer - A Bridge Between TikToken and HuggingFace Tokenizers #358

Open
bhavnicksm opened this issue Nov 7, 2024 · 2 comments

Comments

@bhavnicksm
Copy link

Hi TikToken team! 👋

I wanted to share a community resource that might be helpful for TikToken users who also work with HuggingFace tokenizers. I've created AutoTikTokenizer, a lightweight library that allows loading any HuggingFace tokenizer as a TikToken-compatible encoder.

What it does:

  • Enables using TikToken's fast tokenization with any HuggingFace tokenizer
  • Preserves exact encoding/decoding compatibility with original tokenizers
  • Simple drop-in usage similar to HuggingFace's AutoTokenizer

Quick example:

from autotiktokenizer import AutoTikTokenizer

# Load any HF tokenizer as a TikToken encoder
encoder = AutoTikTokenizer.from_pretrained('gpt2')
tokens = encoder.encode("Hello world!")
text = encoder.decode(tokens)

The library is available on PyPI (pip install autotiktokenizer) and is fully open source at: https://github.com/bhavnicksm/autotiktokenizer

I've tested it with several popular models including GPT-2, LLaMA, Mistral, and others. I hope this helps TikToken users who want to work with a broader range of tokenizers while keeping TikToken's performance benefits!

Feel free to check it out if you think it would be useful for the community. Happy to hear any feedback or suggestions!

[Note: This is purely a community contribution - I'm not affiliated with the TikToken team]

@idruker-cerence
Copy link

idruker-cerence commented Dec 15, 2024

Dear Bhavnick Minhas!

For months I've been searching for any documentation describing the format of "vocab" section of tokenizer.json or any sane code showing how to interpret it. Your code is a perfect example. Where have you been so long? I am so thankful to you for your work!

@bhavnicksm
Copy link
Author

Hey @idruker-cerence!

I'm glad to hear that~ 😊

Please let me know if you have any questions on the implementation details as well, happy to clarify and share resources.

And, always open to feedback!

Thanks! ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants