[Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥 #1568

ArthurZucker · 2024-07-11T11:51:53Z

Fixes Metaspace again

HuggingFaceDocBuilderDev · 2024-07-11T12:54:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Narsil · 2024-07-22T06:56:03Z

tokenizers/src/pre_tokenizers/metaspace.rs

-                        normalized.prepend(&self.str_rep);
-                    }
+        let result = if self.prepend_scheme == PrependScheme::First {
+            let first = &mut pretokenized.splits[0];


This will panic if splits is empty, isn't

pretokenized.split(|_, mut normalized| { -> pretokenized.split(|i, mut normalized| { ... if i == 0

Enough for a change ?

Also for good measure, the reason it checks offsets.original == 0 instead of i==0 is because users can send pre pretokenized strings, meanings you can have split n still being "the first token". (Since it's transparent for us at this stage if it was pretiokenized strings from us, or from the users, only the offsets retain that information)

Narsil · 2024-07-22T06:57:03Z

bindings/python/tests/bindings/test_tokenizer.py

@@ -493,6 +493,7 @@ def test_splitting(self):
        tokenizer.pre_tokenizer.split = False
        tokenizer.add_tokens([AddedToken("<REPR_END>", rstrip=True, lstrip=True)])
        assert tokenizer.encode("<REPR_END>inform<s>. Hey.       .", add_special_tokens=False).tokens == [
+            "▁",


This seems like a bug no ?

<REPR_END> is a token on it's own, we shouldn't prepend with _ no ?

ArthurZucker added 2 commits July 10, 2024 20:36

attempt do do something

7fe9066

rusty fix

0a545ec

ArthurZucker changed the title ~~attempt do do something~~ [Fix metaspace prepending scheme] Jul 11, 2024

ArthurZucker added 5 commits July 11, 2024 17:22

better fix

2f29cdd

update

7e83218

remove prints

1639ae2

splitting still an issue

6c713d2

nit

52a6f2e

ArthurZucker mentioned this pull request Jul 12, 2024

Llama-3 offset-mapping needs fixing #1553

Closed

ArthurZucker added 3 commits July 15, 2024 09:48

nit clippy

5d9ea9f

crate publick

f650f13

Merge branch 'main' into fix-pretokenizer

90210c9

ArthurZucker changed the title ~~[Fix metaspace prepending scheme]~~ [Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥 Jul 15, 2024

ArthurZucker added 4 commits July 15, 2024 12:38

add a new test

a5de5b6

fix tests

88abf90

clippy

0dc66cc

fix

1321f22

ArthurZucker marked this pull request as ready for review July 15, 2024 11:40

ArthurZucker requested a review from Narsil July 15, 2024 12:24

Narsil reviewed Jul 22, 2024

View reviewed changes

Butanium mentioned this pull request Aug 14, 2024

Space after unnormalized token is added when use_fast=True for Llama tokenizer #1613

Closed

github-actions bot added the Stale label Aug 22, 2024

github-actions bot closed this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥 #1568

[Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥 #1568

ArthurZucker commented Jul 11, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 11, 2024

Narsil Jul 22, 2024

Narsil Jul 22, 2024

Narsil Jul 22, 2024

[Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥 #1568

[Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥 #1568

Conversation

ArthurZucker commented Jul 11, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jul 11, 2024

Narsil Jul 22, 2024

Choose a reason for hiding this comment

Narsil Jul 22, 2024

Choose a reason for hiding this comment

Narsil Jul 22, 2024

Choose a reason for hiding this comment

ArthurZucker commented Jul 11, 2024 •

edited

Loading