Skip to content

Releases: huggingface/tokenizers

Python v0.5.2

24 Feb 21:10
Compare
Choose a tag to compare

Fixes:

  • We introduced a bug related to the saving of the WordPiece model in 0.5.2: The vocab.txt file was named
    vocab.json. This is now fixed.
  • The WordLevel model was also saving its vocabulary in the wrong format.

Python v0.5.1

24 Feb 15:16
Compare
Choose a tag to compare

Changes:

  • name argument is now optional when saving a Model's vocabulary. When the name is not specified,
    the files get a more generic naming, like vocab.json or merges.txt.

Python v0.5.0

18 Feb 23:59
Compare
Choose a tag to compare

Changes:

  • BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
  • ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
  • Added a new Strip normalizer
  • do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
  • Expose __len__ on Encoding (cf #139)
  • Improved padding performances.

Fixes:

  • #145: Decoding was buggy on BertWordPieceTokenizer.
  • #152: Some documentation and examples were still using the old BPETokenizer

Python v0.4.2

11 Feb 13:24
Compare
Choose a tag to compare

Fixes:

  • Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)

Python v0.4.1

11 Feb 04:34
Compare
Choose a tag to compare

Fixes:

  • Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

Python v0.4.0

10 Feb 21:12
Compare
Choose a tag to compare

Changes:

  • Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
  • Improved typings

Python v0.3.0

05 Feb 19:03
Compare
Choose a tag to compare

Changes:

  • BPETokenizer has been renamed to CharBPETokenizer for clarity.
  • Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
  • Added WordLevel: a new model that simply maps tokens to their ids.
  • Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
  • Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
  • Exposed the vocabulary size on all tokenizers: #99 by @kdexd

Bug fixes:

  • Fix a bug with IndexableString
  • Fix a bug with truncation

Python v0.2.1

22 Jan 21:13
Compare
Choose a tag to compare
  • Fix a bug with the IDs associated with added tokens.
  • Fix a bug that was causing crashes in Python 3.5

Python v0.2.0

20 Jan 14:24
Compare
Choose a tag to compare

In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).

Python v0.1.1

12 Jan 07:37
Compare
Choose a tag to compare
  • Fix a bug where special tokens get split while encoding