Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

OpenNMT / Tokenizer Public

Notifications You must be signed in to change notification settings
Fork 70
Star 295

Code
Issues 7
Pull requests 3
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Security
Insights

Releases: OpenNMT/Tokenizer

Releases · OpenNMT/Tokenizer

Tokenizer 1.24.0

16 Feb 16:53

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.24.0

New features

Add verbose flag in file tokenization APIs to log progress every 100,000 lines
[Python] Add options property to Tokenizer instances
[Python] Add class pyonmttok.SentencePieceTokenizer to help creating a tokenizer compatible with SentencePiece

Fixes and improvements

Fix deserialization into Token objects that was sometimes incorrect
Fix Windows compilation
Fix Google Test integration that was sometimes installed as part of make install
[Python] Update pybind11 to 2.6.2
[Python] Update ICU to 66.1
[Python] Compile ICU with optimization flags

Assets 2

Loading

All reactions

Tokenizer 1.23.0

30 Dec 09:09

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.23.0

Changes

Drop Python 2 support

New features

Publish Python wheels for macOS

Fixes and improvements

Improve performance in all tokenization modes (up to 2x faster)
Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
Fix a regression introduced in 1.20 where segment_alphabet_* options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)
Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using preserve_segmented_tokens and the word is segmented by both a segment_* option and BPE
Fix incorrect tokenization when using support_prior_joiners and some joiners are within protected sequences

Assets 2

Loading

All reactions

Tokenizer 1.22.2

12 Nov 09:45

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.22.2

Fixes and improvements

Do not require "none" tokenization mode for SentencePiece vocabulary restriction

Assets 2

Loading

All reactions

Tokenizer 1.22.1

30 Oct 11:37

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.22.1

Fixes and improvements

Fix error when enabling vocabulary restriction with SentencePiece and spacer_annotate is not explicitly set
Fix backward compatibility with Kangxi and Kanbun scripts (see segment_alphabet option)

Assets 2

Loading

All reactions

Tokenizer 1.22.0

29 Oct 09:12

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.22.0

Changes

[C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a std::shared_ptr to make it outlive the Tokenizer instance.

New features

Add set_random_seed function to make subword regularization reproducible
[Python] Support serialization of Token instances
[C++] Add Options structure to configure tokenization options (Flags can still be used for backward compatibility)

Fixes and improvements

Fix BPE vocabulary restriction when using joiner_new, spacer_annotate, or spacer_new (the previous implementation always assumed joiner_annotate was used)
[Python] Fix spacer argument name in Token constructor
[C++] Fix ambiguous subword encoder ownership by using a std::shared_ptr

Assets 2

Loading

All reactions

Tokenizer 1.21.0

22 Oct 13:10

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.21.0

New features

Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

Fix BPE vocabulary restriction when words have a leading or trailing joiner
Raise an error when using a multi-character joiner and support_prior_joiner
[Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
[Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
[Python] Improve compatibility with Python 3.9

Assets 2

Loading

All reactions

Tokenizer 1.20.0

24 Sep 08:49

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.20.0

Changes

The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in Release mode by default
- Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

Accept any Unicode script aliases in the segment_alphabet option
Update SentencePiece to 0.1.92
[Python] Improve the capabilities of the Token class:
- Implement the __repr__ method
- Allow setting all attributes in the constructor
- Add a copy constructor
[Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

[Python] Accept None value for segment_alphabet argument

Assets 2

Loading

All reactions

Tokenizer 1.19.0

02 Sep 09:17

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.19.0

New features

Add BPE dropout (Provilkov et al. 2019)
[Python] Introduce the "Token API": a set of methods that manipulate Token objects instead of serialized strings
[Python] Add unicode_ranges argument to the detokenize_with_ranges method to return ranges over Unicode characters instead of bytes

Fixes and improvements

Include "Half-width kana" in Katakana script detection

Assets 2

Loading

All reactions

Tokenizer 1.18.5

07 Jul 10:39

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.18.5

Fixes and improvements

Fix possible crash when applying a case insensitive BPE model on Unicode characters

Assets 2

Loading

All reactions

Tokenizer 1.18.4

22 May 12:19

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.18.4

Fixes and improvements

Fix segmentation fault on cli/tokenize exit
Ignore empty tokens during detokenization
When writing to a file, avoid flushing the output stream on each line
Update cli/CMakeLists.txt to mark Boost.ProgramOptions as required

(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)

Assets 2

Loading

All reactions

Previous 1 2 3 Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.