Releases: OpenNMT/Tokenizer
Releases · OpenNMT/Tokenizer
Tokenizer 1.24.0
New features
- Add
verbose
flag in file tokenization APIs to log progress every 100,000 lines - [Python] Add
options
property toTokenizer
instances - [Python] Add class
pyonmttok.SentencePieceTokenizer
to help creating a tokenizer compatible with SentencePiece
Fixes and improvements
- Fix deserialization into
Token
objects that was sometimes incorrect - Fix Windows compilation
- Fix Google Test integration that was sometimes installed as part of
make install
- [Python] Update pybind11 to 2.6.2
- [Python] Update ICU to 66.1
- [Python] Compile ICU with optimization flags
Tokenizer 1.23.0
Changes
- Drop Python 2 support
New features
- Publish Python wheels for macOS
Fixes and improvements
- Improve performance in all tokenization modes (up to 2x faster)
- Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
- Fix a regression introduced in 1.20 where
segment_alphabet_*
options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation) - Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using
preserve_segmented_tokens
and the word is segmented by both asegment_*
option and BPE - Fix incorrect tokenization when using
support_prior_joiners
and some joiners are within protected sequences
Tokenizer 1.22.2
Fixes and improvements
- Do not require "none" tokenization mode for SentencePiece vocabulary restriction
Tokenizer 1.22.1
Fixes and improvements
- Fix error when enabling vocabulary restriction with SentencePiece and
spacer_annotate
is not explicitly set - Fix backward compatibility with Kangxi and Kanbun scripts (see
segment_alphabet
option)
Tokenizer 1.22.0
Changes
- [C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a
std::shared_ptr
to make it outlive theTokenizer
instance.
New features
- Add
set_random_seed
function to make subword regularization reproducible - [Python] Support serialization of
Token
instances - [C++] Add
Options
structure to configure tokenization options (Flags
can still be used for backward compatibility)
Fixes and improvements
- Fix BPE vocabulary restriction when using
joiner_new
,spacer_annotate
, orspacer_new
(the previous implementation always assumedjoiner_annotate
was used) - [Python] Fix
spacer
argument name inToken
constructor - [C++] Fix ambiguous subword encoder ownership by using a
std::shared_ptr
Tokenizer 1.21.0
New features
- Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)
Fixes and improvements
- Fix BPE vocabulary restriction when words have a leading or trailing joiner
- Raise an error when using a multi-character joiner and
support_prior_joiner
- [Python] Implement
__hash__
method ofpyonmttok.Token
objects to be consistent with the__eq__
implementation - [Python] Declare
pyonmttok.Tokenizer
arguments (exceptmode
) as keyword-only - [Python] Improve compatibility with Python 3.9
Tokenizer 1.20.0
Changes
- The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in
Release
mode by default - Tests are no longer compiled by default (use
-DBUILD_TESTS=ON
to compile the tests)
New features
- Accept any Unicode script aliases in the
segment_alphabet
option - Update SentencePiece to 0.1.92
- [Python] Improve the capabilities of the
Token
class:- Implement the
__repr__
method - Allow setting all attributes in the constructor
- Add a copy constructor
- Implement the
- [Python] Add a copy constructor for the
Tokenizer
class
Fixes and improvements
- [Python] Accept
None
value forsegment_alphabet
argument
Tokenizer 1.19.0
New features
- Add BPE dropout (Provilkov et al. 2019)
- [Python] Introduce the "Token API": a set of methods that manipulate
Token
objects instead of serialized strings - [Python] Add
unicode_ranges
argument to thedetokenize_with_ranges
method to return ranges over Unicode characters instead of bytes
Fixes and improvements
- Include "Half-width kana" in Katakana script detection
Tokenizer 1.18.5
Fixes and improvements
- Fix possible crash when applying a case insensitive BPE model on Unicode characters
Tokenizer 1.18.4
Fixes and improvements
- Fix segmentation fault on
cli/tokenize
exit - Ignore empty tokens during detokenization
- When writing to a file, avoid flushing the output stream on each line
- Update
cli/CMakeLists.txt
to mark Boost.ProgramOptions as required
(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)