Skip to content

Commit

Permalink
Python - Add Changelog
Browse files Browse the repository at this point in the history
  • Loading branch information
n1t0 committed Feb 24, 2020
1 parent 999088e commit be08d95
Showing 1 changed file with 69 additions and 0 deletions.
69 changes: 69 additions & 0 deletions bindings/python/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# v0.5.1

## Changes:
- `name` argument is now optional when saving a `Model`'s vocabulary. When the name is not specified,
the files get a more generic naming, like `vocab.json` or `merges.txt`.

# v0.5.0

## Changes:
- `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding (cf #145)
- `ByteLevelBPETokenizer` now has `dropout` (thanks @colinclement with #149)
- Added a new `Strip` normalizer
- `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers. (Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
- Expose `__len__` on `Encoding` (cf #139)
- Improved padding performances.

## Fixes:
- #145: Decoding was buggy on `BertWordPieceTokenizer`.
- #152: Some documentation and examples were still using the old `BPETokenizer`

## How to migrate:
- Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of `do_lowercase`.

# v0.4.2

## Fixes:
- Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from being trained. (cf #137)

# v0.4.1

## Fixes:
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

# v0.4.0

## Changes:
- Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
- Improved typings

## How to migrate:
- Remove all `.new` on all classe instanciations

# v0.3.0

## Changes:
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Added `CharDelimiterSplit`: a new `PreTokenizer` that allows splitting sequences on the given delimiter (Works like `.split(delimiter)`)
- Added `WordLevel`: a new model that simply maps `tokens` to their `ids`.
- Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing `Encoding` that are ready to be processed by a language model, just as the main `Encoding`.
- Provide mapping to the original string offsets using:
```
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
```
- Exposed the vocabulary size on all tokenizers: https://github.com/huggingface/tokenizers/pull/99 by @kdexd

## Fixes:
- Fix a bug with IndexableString
- Fix a bug with truncation

## How to migrate:
- Rename `BPETokenizer` to `CharBPETokenizer`
- `Encoding.overflowing` is now a List instead of a `Optional[Encoding]`

# v0.2.1

## Fixes:
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5

0 comments on commit be08d95

Please sign in to comment.