Releases · huggingface/tokenizers

24 Feb 21:10

n1t0

python-v0.5.2

440e8e9

Python v0.5.2

Fixes:

We introduced a bug related to the saving of the WordPiece model in 0.5.2: The vocab.txt file was named
vocab.json. This is now fixed.
The WordLevel model was also saving its vocabulary in the wrong format.

Assets 2

24 Feb 15:16

n1t0

python-v0.5.1

be08d95

Python v0.5.1

Changes:

name argument is now optional when saving a Model's vocabulary. When the name is not specified,
the files get a more generic naming, like vocab.json or merges.txt.

Assets 2

18 Feb 23:59

n1t0

python-v0.5.0

11dd6c8

Python v0.5.0

Changes:

BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
Added a new Strip normalizer
do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
Expose __len__ on Encoding (cf #139)
Improved padding performances.

Fixes:

#145: Decoding was buggy on BertWordPieceTokenizer.
#152: Some documentation and examples were still using the old BPETokenizer

Assets 2

11 Feb 13:24

n1t0

python-v0.4.2

bbbd97c

Python v0.4.2

Fixes:

Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)

Assets 2

11 Feb 04:34

n1t0

python-v0.4.1

c1ddfda

Python v0.4.1

Fixes:

Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

Assets 2

10 Feb 21:12

n1t0

python-v0.4.0

3c0164e

Python v0.4.0

Changes:

Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
Improved typings

Assets 2

05 Feb 19:03

n1t0

python-v0.3.0

9745786

Python v0.3.0

Changes:

BPETokenizer has been renamed to CharBPETokenizer for clarity.
Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
Added WordLevel: a new model that simply maps tokens to their ids.
Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
Provide mapping to the original string offsets using:

output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))

Exposed the vocabulary size on all tokenizers: #99 by @kdexd

Bug fixes:

Fix a bug with IndexableString
Fix a bug with truncation

Assets 2

22 Jan 21:13

n1t0

python-v0.2.1

0105021

Python v0.2.1

Fix a bug with the IDs associated with added tokens.
Fix a bug that was causing crashes in Python 3.5

Assets 2

20 Jan 14:24

n1t0

python-v0.2.0

da7e629

Python v0.2.0

In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).

Assets 2

12 Jan 07:37

n1t0

v0.1.1

fc9e81d

Python v0.1.1

Fix a bug where special tokens get split while encoding

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes:

Changes:

Changes:

Fixes:

Fixes:

Fixes:

Changes:

Changes:

Bug fixes:

Releases: huggingface/tokenizers

Python v0.5.2

Fixes:

Python v0.5.1

Changes:

Python v0.5.0

Changes:

Fixes:

Python v0.4.2

Fixes:

Python v0.4.1

Fixes:

Python v0.4.0

Changes:

Python v0.3.0

Changes:

Bug fixes:

Python v0.2.1

Python v0.2.0

Python v0.1.1