From a2960790227bd7fa5bbe4644af15c1210d3050b9 Mon Sep 17 00:00:00 2001 From: Gautier Dagan Date: Thu, 14 Dec 2023 17:43:37 +0000 Subject: [PATCH] add readme to benchmark --- README.md | 8 +++++--- benchmarks/README.md | 23 +++++++++++++++++++++++ 2 files changed, 28 insertions(+), 3 deletions(-) create mode 100644 benchmarks/README.md diff --git a/README.md b/README.md index fcfc22c..ae2af97 100644 --- a/README.md +++ b/README.md @@ -10,11 +10,13 @@ ## Overview -`bpeasy` is a Python package that provides a tokenizer utility, implementing in 300 lines of rust an efficient version of Byte Pair Encoding (BPE). The main implementation largely follows the huggingface tokenizers library, but makes opionated decisions to simplify the tokenizer training specifically to: +`bpeasy` is a Python package that provides a tokenizer trainer, implementing in 300 lines of rust an efficient version of Byte Pair Encoding (BPE). The implementation largely follows the huggingface `tokenizers` library, but makes opionated decisions to simplify the tokenizer training specifically to: 1. Treat text data at the byte-level first --- all text is converted to bytes before training rather than using a character-level approach (like in Huggingface). -2. Always use a regex-based pre-tokenizer. This is a customisable regex that is applied to the text before training. This regex decides where to split the text and limits what kind of tokens are possible. This is technically possible in Huggingface but is not well documented. We also use the `fancy-regex` crate which supports a richer set of regex features than the `regex` crate used in Huggingface. -3. Uses `int64` types for counting to allow for training on much larger datasets without the risk of overflow. +2. Always use a regex-based split pre-tokenizer. This is a customisable regex that is applied to the text before training. This regex decides where to split the text and limits what kind of tokens are possible. This is technically possible in Huggingface but is not well documented. We also use the `fancy-regex` crate which supports a richer set of regex features than the `regex` crate used in Huggingface. +3. Use `int64` types for counting to allow for training on much larger datasets without the risk of overflow. + +You can think of `bpeasy` as the `tiktoken` training code that was never released. ## Installation diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 0000000..08bee32 --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,23 @@ +# Benchmarks on the c4 dataset + +Using varying vocab sizes from (5k:100k) + +| Library/Operation | Time (seconds) | Standard Deviation | +|----------------------------|---------------------------------|--------------------------------| +| HuggingFace Train | 0.7369926854183799 | ±1.5505802971183824 | +| BPEasy Train | 0.652837401942203 | ±0.3869646389606906 | +| HuggingFace Encode | 0.6247405001991674 | ±0.05148973336182687 | +| BPEasy Encode (uses `tiktoken`) | 0.26793742179870605 | ±0.03566062026595773 | + +| | Normalised Bytes/Token | Standard Deviation | +|----------------------------|---------------------------------|--------------------------------| +| BPEasy Bytes/Token vs HF | 1.0008992687171223 | ±5.542696043278318e-05 | + +## Reproducing the benchmarks + +```bash +pip install tokenizers +pip install bpeasy + +python benchmarks/benchmark.py +```