Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Tokenizer Helper class to ERNIE-Bot.SDK (#61)
* 📝 Add Tokenizer class for text tokenization Add a new Tokenizer class in the ERNIE-Bot.SDK namespace that provides methods for tokenizing text. The class includes a method, ApproxNumTokens, which calculates the approximate number of tokens in a given text. The method counts the number of Han characters and estimates the number of words based on whitespace. The result is the sum of the Han character count and 1.3 times the word count. The class also includes a unit test for the ApproxNumTokens method. * 🔧 Update Tokenizer to improve token counting accuracy - Improve accuracy of token counting in ApproxNumTokens method - Count Chinese characters using regular expression - Count English words excluding special characters - Adjust token count based on English word count - Update unit tests to reflect changes in token counting logic * fix unicode
- Loading branch information