Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteLevelBPETokenizer with Greek gives weird symbols. #223

Closed
gdet opened this issue Apr 8, 2020 · 1 comment
Closed

ByteLevelBPETokenizer with Greek gives weird symbols. #223

gdet opened this issue Apr 8, 2020 · 1 comment

Comments

@gdet
Copy link

gdet commented Apr 8, 2020

Hello,

I have followed your steps in this article https://huggingface.co/blog/how-to-train to train a model in Greek language.All files I used is in UTF-8 encoding. When using ByteLevelBPETokenizer I get weird symbols. I read in other issues here that this is normal but there is no normal character in my file merges.txt. Also when I try to print it to see if it tokenizes a word correctly it prints this:

 result = tokenizer.encode("Γεια τι κάνεις;")
 print (result .tokens)
 ['<s>', 'Îĵεια', 'ĠÏĦι', 'ĠκάνειÏĤ', ';', '</s>']

Is this normal? Or ByteLevelBPETokenizer is not suitable for Greek characters? Also is it possible to tranform this output to readable string to check if it is correct?
Example of merges.txt:

 ĠÏĦ ο
 ÏĦ η
 ĠÎ ½
 ĠÏĦ οÏħ

Thank you

@gdet gdet changed the title ByteLevelBPETokenizer with Greek problem. ByteLevelBPETokenizer with Greek give weird symbols. Apr 8, 2020
@gdet gdet changed the title ByteLevelBPETokenizer with Greek give weird symbols. ByteLevelBPETokenizer with Greek gives weird symbols. Apr 8, 2020
@julien-c
Copy link
Member

julien-c commented Apr 8, 2020

Yes, this is to be expected. See @n1t0's answer in this thread for context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants