LLM Tokenziers in C++ (1): Unicode Regexp

LLM Tokenizers in C++ (0): Port from Python

Regexp for Unicode

The first step of BPE tokenizing is to split the text into candidate tokens. The class GPT2Tokenizer uses Python’s regexp package to do so.

The following regexp defines a candidate token.

self.pat = re.compile(
  r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

With the triple-quote, the authors do not have to escape the quote mark. The prefix r before the triple-quote means the string is treated as a raw string, so all other eescape codes will be ignored.

The general structure of this regexp is a sequence of sub-regexp’s separated by the logical-or mark, |. The first few sub-regexp’s correspond to commonly-used subwords such as 've.

\p{L}, according to this StackOverflow answer, mean a unicode letter. Similarly, \p{N} is a number. \s means a space character, which could be a space, a tab, or a newline. \S mean non-space character.

So, ?\p{L}+ means a sequence of one or more letters prefixed with or without a whitespace. This represents a “word”. Similarly, ?\p{N}+ represents a number without signs or the dot.

[^\s\p{L}\p{N}] means a character which is not a space, a letter, or a number. This leaves us the puctunations. Therefore, ?[^\s\p{L}\p{N}]+ is a sequence of successive punctuators prefixed with or without a whitespace.

\s+ means one or more spaces.

\s+(?!\S) uses the look-ahead assertion (?!...). It matches one or more spaces followed by, but excluding, a non-space character. With the existence of this sub-regexp, the successive of N spaces followed by a word is recognized as two tokens -- \s+(?!\S) matches the first token consisting of N-1 spaces, and ?\p{L}+ matches the second token consisting the last space and the word. The purpose is to have the second word recognized as a token with a leading space, which indicates that the word is not the first word of a sentence. If we remove \s+(?!\S), the N spaces will be recognized as the first token, and the word as the second one, without the leading space.

Unfortunately, re2 does NOT support look-ahead.

(?!re) before text not matching re (NOT SUPPORTED)

This will slightly change the idea into that the successive two or more spaces separate sentences, but not words.

Unicode Regexp in C++

[std::regex](https://en.cppreference.com/w/cpp/regex) does not accept Unicode regexp \p{L} or \p{N}. If we do so, the progrma will abort with the error_escape and the message

the expression contains an invalid escaped character or a trailing escape

This seems a known problem, and the workaround is boost::wregex. Unfortunately, Boost is not part of the iOS SDK and I do not want to bring it in as a dependency of my Xcode project. Moreover, according to this answer,

The Boost.Regex documentation explicitly states that there's no support for Unicode-specific character classes when using boost::wregex. If you want this functionality, you'll need to build Boost.Regex with ICU support enabled then use the boost::u32regex type instead of boost::wregex.

No! I do not want to build Boost from source code because it is huge.

I then remembered several years ago, I read Russ Cox’s great notes about doing regular expression the right way and his work RE2.

RE2 is small. It supports Unicode regexps. More importantly, I can build it using CMake for both my macOS and iOS (Simulator)! The following commands builds RE2 for the host.

cmake -B b -S . -DCMAKE_INSTALL_PREFIX=b/install
cmake —build b —target install

The following commands builds RE2 for iOS Simulator. You can change iphonesimulator into iphones to make it build for iOS devices.

cmake -B build-ios -S . \
      -DCMAKE_SYSTEM_NAME=iOS \
      -DCMAKE_OSX_SYSROOT="$(xcodebuild -version -sdk iphonesimulator  Path)" \
      -DCMAKE_OSX_DEPLOYMENT_TARGET=11.0 \
      -DCMAKE_IOS_INSTALL_COMBINED=YES \
      -DCMAKE_INSTALL_PREFIX=build-ios/install

cmake --build build-ios --target install

Side-by-Side

The following Python code is copy-n-pasted from the Transformers’ tokenzier repository.

import regex as re

pat = re.compile(
  r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

def f(text):
  for token in re.findall(pat, text):
    print(f"token=\"{token}\"")

f("we'd  see   you say 世界你好真实好的很啊")

The output is as the following.

token="we"
token="'d"
token="  "
token="see"
token="   "
token="you"
token=" say"
token=" 世界你好真实好的很啊"

The following is the corresponding C++ code.

#include <string>
#include <iostream>
#include <re2/re2.h>
#include <re2/stringpiece.h>

int main() {
  std::string w;
  std::string text = "we'd  see   you say 世界你好真实好的很啊";
  re2::StringPiece input(text);

  RE2 re("('s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+\\(?!\\S\\)|\\s+)");
  assert(re.ok());  // compiled; if not, see re.error();

  std::string var;
  int value;
  while (RE2::FindAndConsume(&input, re, &w)) {
    std::cout << "token=\"" << w << "\"" << std::endl;
  }
}

The following command builds it and links the RE2 static libarary.

clang++ -std=c++20 b.cc \
 -I ~/w/re2/b/install/include \
 -L ~/w/re2/b/install/lib -lre2 -o b

The output is identical to the above from the Python code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.md

1.md

LLM Tokenziers in C++ (1): Unicode Regexp

Regexp for Unicode

Unicode Regexp in C++

Side-by-Side

Files

1.md

Latest commit

History

1.md

File metadata and controls

LLM Tokenziers in C++ (1): Unicode Regexp

Regexp for Unicode

Unicode Regexp in C++

Side-by-Side