Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tokenizer Helper class to ERNIE-Bot.SDK #61

Merged
merged 3 commits into from
Nov 9, 2023
Merged

Add Tokenizer Helper class to ERNIE-Bot.SDK #61

merged 3 commits into from
Nov 9, 2023

Conversation

xbotter
Copy link
Collaborator

@xbotter xbotter commented Nov 9, 2023

Summary:

This pull request adds a new Tokenizer class to the ERNIE-Bot.SDK project. The Tokenizer class provides methods for tokenizing text. It includes a method called
ApproxNumTokens, which calculates an approximate number of tokens in a given text. The method takes into account whitespace, Chinese characters, and other letters. The Tokenizer
class is a static class and can be used directly without instantiation.

Changes:

  • Added a new file, Tokenizer.cs, to the src/ERNIE-Bot.SDK directory. This file contains the implementation of the Tokenizer class.
  • Added a new file, TokenizerTests.cs, to the tests/ERNIE-Bot.SDK.Tests directory. This file contains unit tests for the Tokenizer class.
  • Implemented the ApproxNumTokens method in the Tokenizer class. This method calculates the approximate number of tokens in a given text by counting whitespace, Chinese characters,
    and other letters.
  • Added a unit test for the ApproxNumTokens method in the TokenizerTests class. The test verifies that the method returns the expected result for a given input text.
  • The Tokenizer class is now available for use in the ERNIE-Bot.SDK project. Developers can use the ApproxNumTokens method to calculate the approximate number of tokens in a text.

Add a new Tokenizer class in the ERNIE-Bot.SDK namespace that provides methods for tokenizing text. The class includes a method, ApproxNumTokens, which calculates the approximate number of tokens in a given text. The method counts the number of Han characters and estimates the number of words based on whitespace. The result is the sum of the Han character count and 1.3 times the word count. The class also includes a unit test for the ApproxNumTokens method.
@github-actions github-actions bot added sdk ERNIE-Bot Sdk test labels Nov 9, 2023
- Improve accuracy of token counting in ApproxNumTokens method
- Count Chinese characters using regular expression
- Count English words excluding special characters
- Adjust token count based on English word count
- Update unit tests to reflect changes in token counting logic
@xbotter xbotter merged commit 0e7f5c0 into main Nov 9, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sdk ERNIE-Bot Sdk test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant