-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding problem when training for Russian #254
Comments
|
I think tokenizer.train function should accept utf-8 encoding |
It should, but as far as I concern it doesn't. I might be wrong though. I used utf-8 encoded txt files for training and got the result shown in the first comment. |
Make sure train function decode's the utf-8 encoded string |
You are using the byte-level BPE. It operates with bytes, not Unicode characters. |
@n1t0 @GladiatorX Thank you for your answers! I read #203 comment. If I'm not mistaken to decode token ids into human-readable subwords I have to merge two single-byte Unicode chars into one two-byte Unicode char so that for example U+00D0 Ð and U+00BD ½ become U+043D н (Russian letter [an]). It's rather inconvenient, but I see that this is by design behaviour. |
You can also probably use a >>> from tokenizers.decoders import ByteLevel
>>> decoder = ByteLevel()
>>> decoder.decode([ 'ĠÑģлово' ])
' слово' |
@n1t0 Great! Thank you a lot! |
I use same code to train it on Marathi language, but on running above code to decode it to human-readable token my colab session crashes with this WARNING _thread '' panicked at 'no entry found for key', /_w/tokenizers/tokenizers/tokenizers/src/pre_tokenizers/byte_level.rs:183:26 Try reproducing it by decoding following tokens
|
@GladiatorX the byte-level decoder can only decode byte-level tokens, not any random string. Your examples contain spaces, and these are not part of the byte-level alphabet. |
Closing this issue as everything should be resolved. Feel free to reopen otherwise! |
Is it possible to train tokenizer for Russian?
The text was updated successfully, but these errors were encountered: