Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding problem when training for Russian #254

Closed
janyfe opened this issue Apr 28, 2020 · 11 comments
Closed

encoding problem when training for Russian #254

janyfe opened this issue Apr 28, 2020 · 11 comments

Comments

@janyfe
Copy link

janyfe commented Apr 28, 2020

from tokenizers import ByteLevelBPETokenizer

# path = [txt files with some text in Russian]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
tokenizer.save(".", "dbg_bpe")

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer(
    './dbg_bpe-vocab.json',
 './dbg_bpe-merges.txt'
)

z = tokenizer.encode('честное слово написано прям руками')

print(z.tokens)

['ÑĩеÑģÑĤ', 'ное', 'ĠÑģлово', 'ĠнапиÑģано', 'ĠпÑĢÑıм', 'ĠÑĢÑĥками']

Is it possible to train tokenizer for Russian?

@janyfe janyfe changed the title encoding problem when train for Russian encoding problem when training for Russian Apr 28, 2020
@janyfe
Copy link
Author

janyfe commented Apr 28, 2020

import ftfy
repaired = [ftfy.fix_text(elem) for elem in z.tokens]

for b, g in zip(z.tokens, repaired):
    print(b, g)

ÑĩеÑģÑĤ ÑĩеÑģÑĤ
ное ное
ĠÑģлово ĠÑģлово
ĠнапиÑģано ĠнапиÑģано
ĠпÑĢÑıм ĠпÑĢÑıм
ĠÑĢÑĥками ĠÑĢÑĥками

ftfy.explain_unicode('ное')
U+00D0  Ð       [Lu] LATIN CAPITAL LETTER ETH
U+00BD  ½       [No] VULGAR FRACTION ONE HALF
U+00D0  Ð       [Lu] LATIN CAPITAL LETTER ETH
U+00BE  ¾       [No] VULGAR FRACTION THREE QUARTERS
U+00D0  Ð       [Lu] LATIN CAPITAL LETTER ETH
U+00B5  µ       [Ll] MICRO SIGN

ftfy.explain_unicode('ное')
U+043D  н       [Ll] CYRILLIC SMALL LETTER EN
U+043E  о       [Ll] CYRILLIC SMALL LETTER O
U+0435  е       [Ll] CYRILLIC SMALL LETTER IE

@GladiatorX
Copy link

I think tokenizer.train function should accept utf-8 encoding

@janyfe
Copy link
Author

janyfe commented May 7, 2020

It should, but as far as I concern it doesn't. I might be wrong though. I used utf-8 encoded txt files for training and got the result shown in the first comment.

@GladiatorX
Copy link

Make sure train function decode's the utf-8 encoded string

@n1t0
Copy link
Member

n1t0 commented May 7, 2020

You are using the byte-level BPE. It operates with bytes, not Unicode characters.
Please read this: #203 (comment)

@janyfe
Copy link
Author

janyfe commented May 7, 2020

@n1t0 @GladiatorX Thank you for your answers!

I read #203 comment. If I'm not mistaken to decode token ids into human-readable subwords I have to merge two single-byte Unicode chars into one two-byte Unicode char so that for example U+00D0 Ð and U+00BD ½ become U+043D н (Russian letter [an]). It's rather inconvenient, but I see that this is by design behaviour.

@n1t0
Copy link
Member

n1t0 commented May 7, 2020

You can also probably use a ByteLevel decoder (the part used by the ByteLevelBPETokenizer to decode) if you want to see the human-readable tokens:

>>> from tokenizers.decoders import ByteLevel
>>> decoder = ByteLevel()
>>> decoder.decode([ 'ĠÑģлово' ])
' слово'

@janyfe
Copy link
Author

janyfe commented May 7, 2020

@n1t0 Great! Thank you a lot!

@GladiatorX
Copy link

I use same code to train it on Marathi language, but on running above code to decode it to human-readable token my colab session crashes with this WARNING _thread '' panicked at 'no entry found for key', /_w/tokenizers/tokenizers/tokenizers/src/pre_tokenizers/byte_level.rs:183:26

Try reproducing it by decoding following tokens

  • Ġ१९ॠ«
  • ण वत

decoder.decode([ 'ण वत' ])

@n1t0
Copy link
Member

n1t0 commented May 10, 2020

@GladiatorX the byte-level decoder can only decode byte-level tokens, not any random string. Your examples contain spaces, and these are not part of the byte-level alphabet.
decoder.decode takes a list of tokens to decode, so
decoder.decode([ 'ण', 'वत' ]) will behave as you would expect.

@n1t0
Copy link
Member

n1t0 commented May 12, 2020

Closing this issue as everything should be resolved. Feel free to reopen otherwise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants