You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to share a community resource that might be helpful for TikToken users who also work with HuggingFace tokenizers. I've created AutoTikTokenizer, a lightweight library that allows loading any HuggingFace tokenizer as a TikToken-compatible encoder.
What it does:
Enables using TikToken's fast tokenization with any HuggingFace tokenizer
Preserves exact encoding/decoding compatibility with original tokenizers
Simple drop-in usage similar to HuggingFace's AutoTokenizer
Quick example:
fromautotiktokenizerimportAutoTikTokenizer# Load any HF tokenizer as a TikToken encoderencoder=AutoTikTokenizer.from_pretrained('gpt2')
tokens=encoder.encode("Hello world!")
text=encoder.decode(tokens)
I've tested it with several popular models including GPT-2, LLaMA, Mistral, and others. I hope this helps TikToken users who want to work with a broader range of tokenizers while keeping TikToken's performance benefits!
Feel free to check it out if you think it would be useful for the community. Happy to hear any feedback or suggestions!
[Note: This is purely a community contribution - I'm not affiliated with the TikToken team]
The text was updated successfully, but these errors were encountered:
For months I've been searching for any documentation describing the format of "vocab" section of tokenizer.json or any sane code showing how to interpret it. Your code is a perfect example. Where have you been so long? I am so thankful to you for your work!
Hi TikToken team! 👋
I wanted to share a community resource that might be helpful for TikToken users who also work with HuggingFace tokenizers. I've created AutoTikTokenizer, a lightweight library that allows loading any HuggingFace tokenizer as a TikToken-compatible encoder.
What it does:
Quick example:
The library is available on PyPI (pip install autotiktokenizer) and is fully open source at: https://github.com/bhavnicksm/autotiktokenizer
I've tested it with several popular models including GPT-2, LLaMA, Mistral, and others. I hope this helps TikToken users who want to work with a broader range of tokenizers while keeping TikToken's performance benefits!
Feel free to check it out if you think it would be useful for the community. Happy to hear any feedback or suggestions!
[Note: This is purely a community contribution - I'm not affiliated with the TikToken team]
The text was updated successfully, but these errors were encountered: