ByteLevelBPETokenizer with Greek gives weird symbols. #223

gdet · 2020-04-08T08:07:19Z

Hello,

I have followed your steps in this article https://huggingface.co/blog/how-to-train to train a model in Greek language.All files I used is in UTF-8 encoding. When using ByteLevelBPETokenizer I get weird symbols. I read in other issues here that this is normal but there is no normal character in my file merges.txt. Also when I try to print it to see if it tokenizes a word correctly it prints this:

 result = tokenizer.encode("Γεια τι κάνεις;")
 print (result .tokens)
 ['<s>', 'ÎĵÎµÎ¹Î±', 'ĠÏĦÎ¹', 'ĠÎºÎ¬Î½ÎµÎ¹ÏĤ', ';', '</s>']

Is this normal? Or ByteLevelBPETokenizer is not suitable for Greek characters? Also is it possible to tranform this output to readable string to check if it is correct?
Example of merges.txt:

 ĠÏĦ Î¿
 ÏĦ Î·
 ĠÎ ½
 ĠÏĦ Î¿Ïħ

Thank you

The text was updated successfully, but these errors were encountered:

julien-c · 2020-04-08T14:53:07Z

Yes, this is to be expected. See @n1t0's answer in this thread for context.

gdet changed the title ~~ByteLevelBPETokenizer with Greek problem.~~ ByteLevelBPETokenizer with Greek give weird symbols. Apr 8, 2020

gdet changed the title ~~ByteLevelBPETokenizer with Greek give weird symbols.~~ ByteLevelBPETokenizer with Greek gives weird symbols. Apr 8, 2020

n1t0 closed this as completed Apr 10, 2020

kaunghtetsan275 mentioned this issue May 17, 2020

Is ByteLevelBPETokenizer working with cyrillics? #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteLevelBPETokenizer with Greek gives weird symbols. #223

ByteLevelBPETokenizer with Greek gives weird symbols. #223

gdet commented Apr 8, 2020

julien-c commented Apr 8, 2020

ByteLevelBPETokenizer with Greek gives weird symbols. #223

ByteLevelBPETokenizer with Greek gives weird symbols. #223

Comments

gdet commented Apr 8, 2020

julien-c commented Apr 8, 2020