Decode regression #1564

daulet · 2024-07-10T13:31:17Z

This is not a recent regression, and perhaps it won't be fixed for that reason, but I thought I'd file it anyway.

I maintain Go bindings for this library, and by sheer luck I had benchmarks when I started. At some point I've noticed a regression in decoding, but only now got around to investigating it. Long story short, I've bisected this repo and root caused it to this PR. Below is a benchmark used to find it.

Regression details:

decode                  time:   [3.9277 µs 3.9409 µs 3.9558 µs]
                        change: [+241.37% +242.64% +244.06%] (p = 0.00 < 0.05)
                        Performance has regressed.

While decode is pretty fast (order of microseconds), +240% slowdown is fairly big and I wonder if we can gain back that performance.

Benchmark code (tokenizers/benches/decode_benchmark.rs):

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use tokenizers::tokenizer::Tokenizer;

fn decode(tokenizer:&Tokenizer, ids_slice: Vec<u32>, skip_special_tokens: bool) -> String {
    tokenizer.decode(ids_slice, skip_special_tokens).expect("failed to decode input")
}

fn criterion_benchmark(c: &mut Criterion) {
    let tokenizer = Tokenizer::from_file("./test/data/bert-base-uncased.json").expect("failed to create tokenizer");
    c.bench_function("decode", 
    |b| b.iter(
        || decode(&tokenizer, black_box([2829, 4419, 14523, 2058, 1996, 13971, 3899].to_vec()), black_box(true))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Add this to Cargo.toml and run with cargo bench decode.

[[bench]]
name = "decode_benchmark"
harness = false

The tokenizer file is copied from here.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-07-12T13:34:43Z

Ow wow, thanks a mile for the clear report!
Yeah 240% is not really good! Okay, I'lll investigate but in the mean time if you have a PR with a fix, would be super good!

ArthurZucker · 2024-07-12T13:34:59Z

I'll add this to benches! 🆙

ArthurZucker · 2024-07-12T14:41:42Z

I checked ou the commit you mention, had to create a custom branch to test: test-old-decode, cherry picked the test that I added in fast-decode-fix has just the benchmark, will post the results here!

ArthurZucker · 2024-07-12T14:43:09Z

for now seems like the issue was fixed in between

daulet · 2024-07-13T05:19:03Z

for now seems like the issue was fixed in between

you mean you don't see a regression at head? I just synced up to confirm performance drop. Don't look at change % as that seems to report changes since last run of the benchmark, just compare absolute values, e.g. here is what I got on two consecutive runs on head:

decode                  time:   [3.8542 µs 3.8795 µs 3.9159 µs]
                        change: [+234.61% +236.83% +240.08%] (p = 0.00 < 0.05)
                        Performance has regressed.

decode                  time:   [3.7643 µs 3.7762 µs 3.7891 µs]
                        change: [-3.8222% -2.8277% -2.1576%] (p = 0.00 < 0.05)
                        Performance has improved.

ArthurZucker · 2024-07-13T09:01:48Z

I'll check again! I was getting 630ns for both branches but let me make sure of that!

daulet · 2024-07-13T13:06:06Z

oh, i just checked test-old-code branch, and it contains the commit (#938) that introduced the regression. You want to branch one commit earlier than that :)

ArthurZucker · 2024-07-16T11:49:06Z

Yeah I cherry picked it I think, to run before / after I'll checka gain

daulet · 2024-08-09T22:47:34Z

Since encode perf was improved in 0.20 release, any plans to look at this?

ArthurZucker · 2024-08-12T05:36:42Z

Yep for sure 😉 Ad you've seen was more focused on encode but will pick this back up

daulet · 2024-11-05T18:53:39Z

@ArthurZucker any updates? :)

sftse · 2024-12-30T13:40:59Z

@ArthurZucker
You were not able to reproduce the regression because the tokenizer in the test-old-code branch is missing a decoder and is not the original one supplied in this issue. I can reproduce the slowdown, but short of changing the API to one maybe based on iterators, it doesn't seem like there is a simple fix. The decoder API changed to one returning a Vec<String> instead of String so some regression is expected.

With the changes from #1707 the regression is a bit lower, +210%.

ArthurZucker · 2025-01-16T16:19:35Z

Yep, sorry I don't really have much BW to go back to this now! Super sorry for not answering sooner. Again, if anyone wants to have a look, would help!

ArthurZucker added performance decoding labels Jul 12, 2024

sftse mentioned this issue Dec 30, 2024

Bpe clones #1707

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode regression #1564

Decode regression #1564

daulet commented Jul 10, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

daulet commented Jul 13, 2024

ArthurZucker commented Jul 13, 2024

daulet commented Jul 13, 2024

ArthurZucker commented Jul 16, 2024

daulet commented Aug 9, 2024

ArthurZucker commented Aug 12, 2024

daulet commented Nov 5, 2024

sftse commented Dec 30, 2024 •

edited

Loading

ArthurZucker commented Jan 16, 2025

Decode regression #1564

Decode regression #1564

Comments

daulet commented Jul 10, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

daulet commented Jul 13, 2024

ArthurZucker commented Jul 13, 2024

daulet commented Jul 13, 2024

ArthurZucker commented Jul 16, 2024

daulet commented Aug 9, 2024

ArthurZucker commented Aug 12, 2024

daulet commented Nov 5, 2024

sftse commented Dec 30, 2024 • edited Loading

ArthurZucker commented Jan 16, 2025

sftse commented Dec 30, 2024 •

edited

Loading