Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode regression #1564

Open
daulet opened this issue Jul 10, 2024 · 13 comments
Open

Decode regression #1564

daulet opened this issue Jul 10, 2024 · 13 comments

Comments

@daulet
Copy link

daulet commented Jul 10, 2024

This is not a recent regression, and perhaps it won't be fixed for that reason, but I thought I'd file it anyway.

I maintain Go bindings for this library, and by sheer luck I had benchmarks when I started. At some point I've noticed a regression in decoding, but only now got around to investigating it. Long story short, I've bisected this repo and root caused it to this PR. Below is a benchmark used to find it.

Regression details:

decode                  time:   [3.9277 µs 3.9409 µs 3.9558 µs]
                        change: [+241.37% +242.64% +244.06%] (p = 0.00 < 0.05)
                        Performance has regressed.

While decode is pretty fast (order of microseconds), +240% slowdown is fairly big and I wonder if we can gain back that performance.

Benchmark code (tokenizers/benches/decode_benchmark.rs):

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use tokenizers::tokenizer::Tokenizer;

fn decode(tokenizer:&Tokenizer, ids_slice: Vec<u32>, skip_special_tokens: bool) -> String {
    tokenizer.decode(ids_slice, skip_special_tokens).expect("failed to decode input")
}

fn criterion_benchmark(c: &mut Criterion) {
    let tokenizer = Tokenizer::from_file("./test/data/bert-base-uncased.json").expect("failed to create tokenizer");
    c.bench_function("decode", 
    |b| b.iter(
        || decode(&tokenizer, black_box([2829, 4419, 14523, 2058, 1996, 13971, 3899].to_vec()), black_box(true))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Add this to Cargo.toml and run with cargo bench decode.

[[bench]]
name = "decode_benchmark"
harness = false

The tokenizer file is copied from here.

@ArthurZucker
Copy link
Collaborator

Ow wow, thanks a mile for the clear report!
Yeah 240% is not really good! Okay, I'lll investigate but in the mean time if you have a PR with a fix, would be super good!

@ArthurZucker
Copy link
Collaborator

I'll add this to benches! 🆙

@ArthurZucker
Copy link
Collaborator

I checked ou the commit you mention, had to create a custom branch to test: test-old-decode, cherry picked the test that I added in fast-decode-fix has just the benchmark, will post the results here!

@ArthurZucker
Copy link
Collaborator

image for now seems like the issue was fixed in between

@daulet
Copy link
Author

daulet commented Jul 13, 2024

for now seems like the issue was fixed in between

you mean you don't see a regression at head? I just synced up to confirm performance drop. Don't look at change % as that seems to report changes since last run of the benchmark, just compare absolute values, e.g. here is what I got on two consecutive runs on head:

decode                  time:   [3.8542 µs 3.8795 µs 3.9159 µs]
                        change: [+234.61% +236.83% +240.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
decode                  time:   [3.7643 µs 3.7762 µs 3.7891 µs]
                        change: [-3.8222% -2.8277% -2.1576%] (p = 0.00 < 0.05)
                        Performance has improved.

@ArthurZucker
Copy link
Collaborator

I'll check again! I was getting 630ns for both branches but let me make sure of that!

@daulet
Copy link
Author

daulet commented Jul 13, 2024

oh, i just checked test-old-code branch, and it contains the commit (#938) that introduced the regression. You want to branch one commit earlier than that :)

@ArthurZucker
Copy link
Collaborator

Yeah I cherry picked it I think, to run before / after I'll checka gain

@daulet
Copy link
Author

daulet commented Aug 9, 2024

Since encode perf was improved in 0.20 release, any plans to look at this?

@ArthurZucker
Copy link
Collaborator

Yep for sure 😉 Ad you've seen was more focused on encode but will pick this back up

@daulet
Copy link
Author

daulet commented Nov 5, 2024

@ArthurZucker any updates? :)

@sftse sftse mentioned this issue Dec 30, 2024
@sftse
Copy link
Contributor

sftse commented Dec 30, 2024

@ArthurZucker
You were not able to reproduce the regression because the tokenizer in the test-old-code branch is missing a decoder and is not the original one supplied in this issue. I can reproduce the slowdown, but short of changing the API to one maybe based on iterators, it doesn't seem like there is a simple fix. The decoder API changed to one returning a Vec<String> instead of String so some regression is expected.

With the changes from #1707 the regression is a bit lower, +210%.

@ArthurZucker
Copy link
Collaborator

Yep, sorry I don't really have much BW to go back to this now! Super sorry for not answering sooner. Again, if anyone wants to have a look, would help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants