-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix metaspace prepending scheme] βοΈβπ₯βοΈβπ₯ #1568
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
normalized.prepend(&self.str_rep); | ||
} | ||
let result = if self.prepend_scheme == PrependScheme::First { | ||
let first = &mut pretokenized.splits[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will panic if splits is empty, isn't
pretokenized.split(|_, mut normalized| {
->
pretokenized.split(|i, mut normalized| {
...
if i == 0
Enough for a change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for good measure, the reason it checks offsets.original == 0 instead of i==0
is because users can send pre pretokenized strings, meanings you can have split n still being "the first token". (Since it's transparent for us at this stage if it was pretiokenized strings from us, or from the users, only the offsets retain that information)
@@ -493,6 +493,7 @@ def test_splitting(self): | |||
tokenizer.pre_tokenizer.split = False | |||
tokenizer.add_tokens([AddedToken("<REPR_END>", rstrip=True, lstrip=True)]) | |||
assert tokenizer.encode("<REPR_END>inform<s>. Hey. .", add_special_tokens=False).tokens == [ | |||
"β", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a bug no ?
<REPR_END> is a token on it's own, we shouldn't prepend with _
no ?
Fixes Metaspace again