Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer压缩率 与 模型最终效果 的关系 #15

Open
nghuyong opened this issue Feb 28, 2024 · 2 comments
Open

tokenizer压缩率 与 模型最终效果 的关系 #15

nghuyong opened this issue Feb 28, 2024 · 2 comments

Comments

@nghuyong
Copy link

在评估tokenizer的部分给出的是tokenizer自身的评估指标,比如压缩率

但是,高压缩率的tokenizer并不意味模型的效果也更好,是否能给出最终模型层面的效果?

例如:sentencepiece实验中的BLUE

https://github.com/google/sentencepiece/blob/master/doc/experiments.md#english-to-japanese

@nghuyong nghuyong changed the title tokenizer的压缩率 与 模型最终效果 的关系 tokenizer压缩率 与 模型最终效果 的关系 Feb 28, 2024
@bojone
Copy link
Owner

bojone commented Feb 28, 2024

我暂时没这个算力去做这个比较实验...

但是从压缩就是智能的信仰来说,高压速率就等价于效果好(至少对于LLM来说)

@nghuyong
Copy link
Author

nghuyong commented Feb 28, 2024

对于LLM来说可能真不是正向的关系。例如在文章 Getting the most out of your tokenizer for pre-training and domain adaptation 中有相关的观点:

It is important to note that higher compression rates could also lead to deteriorated downstream performance, since shorter sequences give less effective FLOPs to a model to reason (Goyal et al., 2023). This is a consequence of the modern Transformer decoder architecture in which every token requires an additional forward pass to generate. Therefore even seemingly low-information tokens might still provide gains on downstream task. This is evidenced by Goyal et al. (2023), who propose Pause Tokens, special empty tokens added to the context to enable the model to 'pause' its reasoning and add FLOPs during inference.

压缩就是智能的信仰说的是 模型对信息的压缩能力 不等价 tokenizer的压缩率 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants