Skip to content

Commit

Permalink
Add Tokenizer Helper class to ERNIE-Bot.SDK (#61)
Browse files Browse the repository at this point in the history
* 📝 Add Tokenizer class for text tokenization

Add a new Tokenizer class in the ERNIE-Bot.SDK namespace that provides methods for tokenizing text. The class includes a method, ApproxNumTokens, which calculates the approximate number of tokens in a given text. The method counts the number of Han characters and estimates the number of words based on whitespace. The result is the sum of the Han character count and 1.3 times the word count. The class also includes a unit test for the ApproxNumTokens method.

* 🔧 Update Tokenizer to improve token counting accuracy

- Improve accuracy of token counting in ApproxNumTokens method
- Count Chinese characters using regular expression
- Count English words excluding special characters
- Adjust token count based on English word count
- Update unit tests to reflect changes in token counting logic

* fix unicode
  • Loading branch information
xbotter authored Nov 9, 2023
1 parent f3ca395 commit 0e7f5c0
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 0 deletions.
23 changes: 23 additions & 0 deletions src/ERNIE-Bot.SDK/Tokenizer.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace ERNIE_Bot.SDK
{
/// <summary>
/// This class provides methods for tokenizing text.
/// </summary>
public static class Tokenizer
{
public static int ApproxNumTokens(string text)
{
int chinese = Regex.Matches(text, @"\p{IsCJKUnifiedIdeographs}").Count;
int english = Regex.Replace(text, @"[^\p{IsBasicLatin}-]", " ")
.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
.Count(w => !string.IsNullOrWhiteSpace(w) && w != "-" && w != "_");

return chinese + (int)Math.Floor(english * 1.3);
}
}
}
16 changes: 16 additions & 0 deletions tests/ERNIE-Bot.SDK.Tests/TokenizerTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
using ERNIE_Bot.SDK;

namespace ERNIE_Bot.SDK.Tests
{
public class TokenizerTests
{
[Fact]
public void TestApproxNumTokens()
{
string text = "这是一段测试文字This is a test string.";
int expected = 14;
int actual = Tokenizer.ApproxNumTokens(text);
Assert.Equal(expected, actual);
}
}
}

0 comments on commit 0e7f5c0

Please sign in to comment.