You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In French, we have dash is some situtation. recasepunc lost them. Here is a reproduction of the bug:
$ cat input.txtsalut toto comment vas-tuy a-t-il quelqu'un ici
$ python recasepunc.py predict fr.22000 < input.txtWARNING: reverting to cpu as cuda is not availableSome weights of the model checkpoint at flaubert/flaubert_base_uncased were not used when initializing FlaubertModel: ['pred_layer.proj.weight', 'pred_layer.proj.bias']- This IS expected if you are initializing FlaubertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).- This IS NOT expected if you are initializing FlaubertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).Salut Toto, Comment vastu ?Y atil quelqu' un ici ?
The dashes in vas-tu and a-t-il are removed.
A space is added after guillemet in quelqu' un.
Can you confirm this ?
Regards,
Étienne
The text was updated successfully, but these errors were encountered:
This is a byproduct of how tokenization is performed in the Flaubert model. I am afraid the only way to handle it is to retrain a model with a different tokenizer (such as CamemBERT).
Hi,
In French, we have dash is some situtation. recasepunc lost them. Here is a reproduction of the bug:
The dashes in
vas-tu
anda-t-il
are removed.A space is added after guillemet in
quelqu' un
.Can you confirm this ?
Regards,
Étienne
The text was updated successfully, but these errors were encountered: