Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
Python v0.5.2
Fixes:
- We introduced a bug related to the saving of the WordPiece model in 0.5.2: The
vocab.txt
file was named
vocab.json
. This is now fixed. - The
WordLevel
model was also saving its vocabulary in the wrong format.
Python v0.5.1
Changes:
name
argument is now optional when saving aModel
's vocabulary. When the name is not specified,
the files get a more generic naming, likevocab.json
ormerges.txt
.
Python v0.5.0
Changes:
BertWordPieceTokenizer
now cleans up some tokenization artifacts while decoding (cf #145)ByteLevelBPETokenizer
now hasdropout
(thanks @colinclement with #149)- Added a new
Strip
normalizer do_lowercase
has been changed tolowercase
for consistency between the different tokenizers. (EspeciallyByteLevelBPETokenizer
andCharBPETokenizer
)- Expose
__len__
onEncoding
(cf #139) - Improved padding performances.
Fixes:
Python v0.4.2
Fixes:
- Fix a bug in the class
WordPieceTrainer
that preventedBertWordPieceTokenizer
from being trained. (cf #137)
Python v0.4.1
Fixes:
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
Python v0.4.0
Python v0.3.0
Changes:
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Added
CharDelimiterSplit
: a newPreTokenizer
that allows splitting sequences on the given delimiter (Works like.split(delimiter)
) - Added
WordLevel
: a new model that simply mapstokens
to theirids
. - Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing
Encoding
that are ready to be processed by a language model, just as the mainEncoding
. - Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
Bug fixes:
- Fix a bug with IndexableString
- Fix a bug with truncation
Python v0.2.1
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5
Python v0.2.0
In this release, we fixed some inconsistencies between the BPETokenizer
and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer
back to Whitespace
(instead of WhitespaceSplit
).
Python v0.1.1
- Fix a bug where special tokens get split while encoding