Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] what to do when model doesn't have tokenizer.model? #2212

Open
steveepreston opened this issue Dec 29, 2024 · 2 comments
Open

[Question] what to do when model doesn't have tokenizer.model? #2212

steveepreston opened this issue Dec 29, 2024 · 2 comments

Comments

@steveepreston
Copy link

steveepreston commented Dec 29, 2024

while tokenizer.model is required in yaml config, but there are many models that doesn't have tokenizer.model (example: unsloth/Llama-3.2-1B)

In these cases, how can we use tokenizer.json or tokenizer_config.json that are included in almost all model instead of tokenizer.model?

@RdoubleA
Copy link
Contributor

RdoubleA commented Jan 1, 2025

In your case specifically, you can use the original Llama 3.2 1B tokenizer.model from https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (if the unsloth version is based off the instruct model, use the base one otherwise). If unsloth modified any of the special tokens, then you will need a new tokenizer.model.

I don't believe you can load in the tokenizer without the tokenizer.model file, because it contains the BPE encoding itself.

@steveepreston
Copy link
Author

@RdoubleA Thanks for explain, got the case.
I list some other random models that doesn't have a tokenizer.model:

deepseek-ai/DeepSeek-V3
Qwen/QVQ
nvidia/Llama-3.1-Nemotron
openai/gpt2
mistralai/Mistral-Nemo
CohereForAI/c4ai
facebook/opt-125m

I don't have any idea to what should be done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants