encoding problem when training for Russian #254

janyfe · 2020-04-28T09:53:25Z

from tokenizers import ByteLevelBPETokenizer

# path = [txt files with some text in Russian]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
tokenizer.save(".", "dbg_bpe")

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer(
    './dbg_bpe-vocab.json',
 './dbg_bpe-merges.txt'
)

z = tokenizer.encode('честное слово написано прям руками')

print(z.tokens)

['ÑĩÐµÑģÑĤ', 'Ð½Ð¾Ðµ', 'ĠÑģÐ»Ð¾Ð²Ð¾', 'ĠÐ½Ð°Ð¿Ð¸ÑģÐ°Ð½Ð¾', 'ĠÐ¿ÑĢÑıÐ¼', 'ĠÑĢÑĥÐºÐ°Ð¼Ð¸']

Is it possible to train tokenizer for Russian?

The text was updated successfully, but these errors were encountered:

janyfe · 2020-04-28T09:56:48Z

import ftfy
repaired = [ftfy.fix_text(elem) for elem in z.tokens]

for b, g in zip(z.tokens, repaired):
    print(b, g)

ÑĩÐµÑģÑĤ ÑĩÐµÑģÑĤ
Ð½Ð¾Ðµ ное
ĠÑģÐ»Ð¾Ð²Ð¾ ĠÑģÐ»Ð¾Ð²Ð¾
ĠÐ½Ð°Ð¿Ð¸ÑģÐ°Ð½Ð¾ ĠÐ½Ð°Ð¿Ð¸ÑģÐ°Ð½Ð¾
ĠÐ¿ÑĢÑıÐ¼ ĠÐ¿ÑĢÑıÐ¼
ĠÑĢÑĥÐºÐ°Ð¼Ð¸ ĠÑĢÑĥÐºÐ°Ð¼Ð¸

ftfy.explain_unicode('Ð½Ð¾Ðµ')
U+00D0  Ð       [Lu] LATIN CAPITAL LETTER ETH
U+00BD  ½       [No] VULGAR FRACTION ONE HALF
U+00D0  Ð       [Lu] LATIN CAPITAL LETTER ETH
U+00BE  ¾       [No] VULGAR FRACTION THREE QUARTERS
U+00D0  Ð       [Lu] LATIN CAPITAL LETTER ETH
U+00B5  µ       [Ll] MICRO SIGN

ftfy.explain_unicode('ное')
U+043D  н       [Ll] CYRILLIC SMALL LETTER EN
U+043E  о       [Ll] CYRILLIC SMALL LETTER O
U+0435  е       [Ll] CYRILLIC SMALL LETTER IE

GladiatorX · 2020-05-06T06:53:12Z

I think tokenizer.train function should accept utf-8 encoding

janyfe · 2020-05-07T07:00:36Z

It should, but as far as I concern it doesn't. I might be wrong though. I used utf-8 encoded txt files for training and got the result shown in the first comment.

GladiatorX · 2020-05-07T09:55:03Z

Make sure train function decode's the utf-8 encoded string

n1t0 · 2020-05-07T12:03:26Z

You are using the byte-level BPE. It operates with bytes, not Unicode characters.
Please read this: #203 (comment)

janyfe · 2020-05-07T12:32:30Z

@n1t0 @GladiatorX Thank you for your answers!

I read #203 comment. If I'm not mistaken to decode token ids into human-readable subwords I have to merge two single-byte Unicode chars into one two-byte Unicode char so that for example U+00D0 Ð and U+00BD ½ become U+043D н (Russian letter [an]). It's rather inconvenient, but I see that this is by design behaviour.

n1t0 · 2020-05-07T12:56:05Z

You can also probably use a ByteLevel decoder (the part used by the ByteLevelBPETokenizer to decode) if you want to see the human-readable tokens:

>>> from tokenizers.decoders import ByteLevel
>>> decoder = ByteLevel()
>>> decoder.decode([ 'ĠÑģÐ»Ð¾Ð²Ð¾' ])
' слово'

janyfe · 2020-05-07T15:34:18Z

@n1t0 Great! Thank you a lot!

GladiatorX · 2020-05-09T14:54:35Z

I use same code to train it on Marathi language, but on running above code to decode it to human-readable token my colab session crashes with this WARNING _thread '' panicked at 'no entry found for key', /_w/tokenizers/tokenizers/tokenizers/src/pre_tokenizers/byte_level.rs:183:26

Try reproducing it by decoding following tokens

Ġà¥§à¥¯à¥ «
à¤£ à¤µà¤¤

decoder.decode([ 'à¤£ à¤µà¤¤' ])

n1t0 · 2020-05-10T14:12:51Z

@GladiatorX the byte-level decoder can only decode byte-level tokens, not any random string. Your examples contain spaces, and these are not part of the byte-level alphabet.
decoder.decode takes a list of tokens to decode, so
decoder.decode([ 'à¤£', 'à¤µà¤¤' ]) will behave as you would expect.

n1t0 · 2020-05-12T13:55:12Z

Closing this issue as everything should be resolved. Feel free to reopen otherwise!

janyfe changed the title ~~encoding problem when train for Russian~~ encoding problem when training for Russian Apr 28, 2020

n1t0 closed this as completed May 12, 2020

kaunghtetsan275 mentioned this issue May 17, 2020

Is ByteLevelBPETokenizer working with cyrillics? #268

Closed

itazap mentioned this issue May 22, 2024

Trained tokenizer has broken encoding for cyrillic huggingface/transformers#30937

Closed

4 tasks

itazap mentioned this issue Aug 30, 2024

Bug in Llama3.1 8B Instruct Tokenizer huggingface/transformers#33182

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding problem when training for Russian #254

encoding problem when training for Russian #254

janyfe commented Apr 28, 2020 •

edited

Loading

janyfe commented Apr 28, 2020 •

edited

Loading

GladiatorX commented May 6, 2020

janyfe commented May 7, 2020

GladiatorX commented May 7, 2020

n1t0 commented May 7, 2020

janyfe commented May 7, 2020

n1t0 commented May 7, 2020

janyfe commented May 7, 2020

GladiatorX commented May 9, 2020

n1t0 commented May 10, 2020

n1t0 commented May 12, 2020

encoding problem when training for Russian #254

encoding problem when training for Russian #254

Comments

janyfe commented Apr 28, 2020 • edited Loading

janyfe commented Apr 28, 2020 • edited Loading

GladiatorX commented May 6, 2020

janyfe commented May 7, 2020

GladiatorX commented May 7, 2020

n1t0 commented May 7, 2020

janyfe commented May 7, 2020

n1t0 commented May 7, 2020

janyfe commented May 7, 2020

GladiatorX commented May 9, 2020

n1t0 commented May 10, 2020

n1t0 commented May 12, 2020

janyfe commented Apr 28, 2020 •

edited

Loading

janyfe commented Apr 28, 2020 •

edited

Loading