Suppress warnings from LUKE for unexpected keys #24703

ryokan0123 · 2023-07-07T04:05:17Z

What does this PR do?

Suppress the warnings when instantiating the LUKE models by adding _keys_to_ignore_on_load_unexpected.

Promblem

Currently, when you instantiate certain LUKE models from the Hugging Face Hub, such as

from transformers import AutoModel
model = transformers.AutoModel.from_pretrained("studio-ousia/mluke-base-lite")

you receive a warning indicating that a bunch of weights were not loaded.

Some weights of the model checkpoint at studio-ousia/mluke-base-lite were not used when initializing LukeModel: [
'luke.encoder.layer.0.attention.self.w2e_query.weight', 'luke.encoder.layer.0.attention.self.w2e_query.bias', 
'luke.encoder.layer.0.attention.self.e2w_query.weight', 'luke.encoder.layer.0.attention.self.e2w_query.bias', 
'luke.encoder.layer.0.attention.self.e2e_query.weight', 'luke.encoder.layer.0.attention.self.e2e_query.bias', 
...]

This seems to depend on the logging setting and is observed on Google Colabo Notebooks.
https://colab.research.google.com/drive/1kYN3eGhx5tzEMnGkUz2jPsdmFyEBwxFA?usp=sharing

This behavior is expected since these weights are optional and only loaded when use_entity_aware_attention is set to True. However, it has caused confusion among users, as evidenced by the following issues:
studio-ousia/luke#174
https://huggingface.co/studio-ousia/mluke-base/discussions/2#63be8cc6c26a8a4d713ee08a

Solution

I added _keys_to_ignore_on_load_unexpected to LukePreTrainedModel to ignore some unexpected keys in the pretrained weights.

…ghts by specifying _keys_to_ignore_on_load_unexpected

HuggingFaceDocBuilderDev · 2023-07-07T04:26:08Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2023-07-07T11:48:21Z

I believe this should not be done this way. These keys should be used only if the default behavior in the modeling code will have different keys than the canonical (original) checkpoints on the Hub.

But before further discussion, let's check one thing first:

the config in studio-ousia/mluke-base-lite has

use_entity_aware_attention": true,

Are you sure this is the checkpoint that causes confusion ..?

ydshieh · 2023-07-07T12:09:18Z

~~My wording above is not precise. I will update that comment.~~

These keys should be used only if:

a model loading from a checkpoint that is saved with from_pretrained (without changing the config during loading) will have some unexpected weight keys.
a HF checkpoint is created that has some extra keys (in order to respect the original non-HF checkpoint) which is not really used in the model (and the HF modeling code is written to avoid having such un-used keys)

ydshieh · 2023-07-07T13:45:52Z

I have run

from transformers import AutoModel
model = transformers.AutoModel.from_pretrained("studio-ousia/mluke-base-lite")

but didn't receive any warning.

ryokan0123 · 2023-07-07T14:39:31Z

Thanks @ydshieh for taking a look for the PR!

Are you sure this is the checkpoint that causes confusion ..?

When I look at the latest version of the config on the following models, I find "use_entity_aware_attention": false.
https://huggingface.co/studio-ousia/mluke-base-lite/blob/3775c9b1470636e206c38cbb1b964ba883421164/config.json#L33

but didn't receive any warning.

The following Google Colabo notebook shows the warning.
https://colab.research.google.com/drive/1kYN3eGhx5tzEMnGkUz2jPsdmFyEBwxFA?usp=sharing
Probably it depends on some logging settings given by the environment, but it does show the warnings in some cases.

ryokan0123 · 2023-07-07T15:01:52Z

These keys should be used only if:

a model loading from a checkpoint that is saved with from_pretrained (without changing the config during loading) will have some unexpected weight keys.

a HF checkpoint is created that has some extra keys (in order to respect the original non-HF checkpoint) which is not really used in the model (and the HF modeling code is written to avoid having such un-used keys)

I believe that this PR is similar to the second point mentioned above.

The HF checkpoint is derived from the original checkpoint generated by the original repository. The checkpoint contains additional keys (luke.encoder.layer.*.attention.self.*_query.*), which are only utilized when the entity-aware attention mechanism is enabled during fine-tuning.
Entity-aware attention is an optional feature and is disabled by default, because that is the setting used in the original paper.

I would like to address the problem of the confusing and overwhelming warnings even when it is the default behavior.
I would appreciate your further elaboration on why this cannot be addressed using _keys_to_ignore_on_load_unexpected, or any alternative solutions you might have in mind.

ydshieh · 2023-07-07T15:02:26Z

OK I see. We have to use LukeForMaskedLM or AutoModelForMaskedLM to see the warning.

ydshieh · 2023-07-07T15:15:25Z

We can't change these kinds of keys due to a Hub model repo. author uploading problematic weights/config file.
You can ask the author to correct (cleanup) the model weights and re-upload.

If we change in the way like done in this PR, we won't have any warning when a real problem occurs, and the bugs won't be detected.

ydshieh · 2023-07-07T15:19:05Z

The HF checkpoint is derived from the original checkpoint generated by the original repository. The checkpoint contains additional keys (luke.encoder.layer..attention.self._query.*), which are only utilized when the entity-aware attention mechanism is enabled during fine-tuning.

I didn't check the original repo. (which is not me adding that model into transformers). But the Hub repo like luke-base has

use_entity_aware_attention": true,

Also, the default value in LukeConfig.__init__ is True.

ryokan0123 · 2023-07-07T15:45:15Z

Let me share more context on this problem.

The weights uploaded on the HF repo are supposed to work either when use_entity_aware_attention is True or False and the config files just specify the default value.
The warnings are raised as expected currently, but I want to suppress the warnings as the correct behavior.

I am from the same group of the author of LukeModel and the HF weights are uploaded by me, so I am sure that it follows the intention of the original model.

In summary, when some weights should be ignored as the correct behavior, what is the right way to handle that?

ryokan0123 · 2023-07-07T15:46:47Z

If we change in the way like done in this PR, we won't have any warning when a real problem occurs, and the bugs won't be detected.

I understand that this is a risk, but couldn't that be mitigated by specifying the correct regex?

ydshieh · 2023-07-07T15:53:39Z

The problem here is the config and the model weight on the hub has inconsistent values. If the model is created with that value set to false, there would not have those extra keys in the model.

It is unclear how the Hub author ends up with sich inconsistency. The fix should happen there.

Hope this explanation makes things clear.

But thank you for your willingness to fix and help transformers better ❤️

ryokan0123 · 2023-07-08T02:20:32Z

I believe there is still some misunderstanding.

The problem here is the config and the model weight on the hub has inconsistent values.

The inconsistency is intended as having optional extra weights is a part of the model features.
Users can either choose to use the extra weights or not.

If the model is created with that value set to false, there would not have those extra keys in the model.

Those extra keys (weights) are optional.
Even though the model has use_entity_aware_attention=False by default, we'd like to give users an option to enable use_entity_aware_attention=True to check the effect.

ryokan0123 · 2023-07-08T02:20:37Z

To be clearer, the extra weights are in this part.

transformers/src/transformers/models/luke/modeling_luke.py

Lines 523 to 526 in abaca9f

    
           if self.use_entity_aware_attention: 
        
               self.w2e_query = nn.Linear(config.hidden_size, self.all_head_size) 
        
               self.e2w_query = nn.Linear(config.hidden_size, self.all_head_size) 
        
               self.e2e_query = nn.Linear(config.hidden_size, self.all_head_size)

These weights are NOT used in pretraining time, but can be optionally introduced at the fine-tuning time.
For users to be able to freely choose between the options, the weights should include the extra weights but it causes unnecessary warnings when use_entity_aware_attention = False...

ryokan0123 · 2023-07-08T02:21:56Z

I apologize for any confusion caused by my previous explanation, but I would like to request @NielsRogge's opinion on how to handle these warnings. He helped introduce LUKE in transformers.

ydshieh · 2023-07-08T06:50:52Z

These weights are NOT used in pretraining time,

So those weights are not even trained during pretraining time ..? I am a bit confused here. Or it's trained for Luke but not mLuke?

These weights are NOT used in pretraining time, but can be optionally introduced at the fine-tuning time.
For users to be able to freely choose between the options, the weights should include the extra weights

In this case, the original model weights (the checkpoint on the Hub repo studio-ousia/mluke-base-lite) should not include those extra weights (which is the opposite currently), and config should have use_entity_aware_attention=False (which is currently).

When a user want to fine-tune with the option with use_entity_aware_attention, it can load the checkpoint with this set to True at runtime: then the model will have these extra weights at runtime (but with different warning saying some weights are randomly initialized).

I am wondering what prevents you to remove those extra weights on studio-ousia/mluke-base-lite if it is never used.

ryokan0123 · 2023-07-09T02:59:52Z

Thank you for your patience.
I know the model is doing something unusual...

What is entity-aware attention?

LUKE and mLUKE take word tokens as well as entity tokens.
At pretraining time, they undergo the computation of self attention (token-to-token attention) equally.

At fine-tuning time, we can optionally add entity-aware attention.
This mechanism uses different attention weights for word-to-word, word-to-entity, entity-to-word, and entity-to-entity tokens.
The weights for these different types of attention are initialized by copying the token-to-token attention obtained during pretraining.
This is done by the following lines of the conversion script.

transformers/src/transformers/models/luke/convert_luke_original_pytorch_checkpoint_to_pytorch.py

Lines 61 to 67 in abaca9f

    
           # Initialize the query layers of the entity-aware self-attention mechanism 
        
           for layer_index in range(config.num_hidden_layers): 
        
               for matrix_name in ["query.weight", "query.bias"]: 
        
                   prefix = f"encoder.layer.{layer_index}.attention.self." 
        
                   state_dict[prefix + "w2e_" + matrix_name] = state_dict[prefix + matrix_name] 
        
                   state_dict[prefix + "e2w_" + matrix_name] = state_dict[prefix + matrix_name] 
        
                   state_dict[prefix + "e2e_" + matrix_name] = state_dict[prefix + matrix_name]

So, the checkpoints include these copied weights regardless of whether users enable entity-aware attention at fine-tuning time.
Also this is the reason why we do not want to initialize the new weights randomly.

So those weights are not even trained during pretraining time ..? I am a bit confused here. Or it's trained for Luke but not mLuke?

Both LUKE and mLUKE are pretrained without entity-aware attention, but they can still use entity-aware attention by initializing new weights with the corresponding pretrained ones.

Why is the default value of `use_entity_aware_attention` different in LUKE and mLUKE?

We set the default value to be consistent with the original papers that proposed each model.
LUKE uses entity-aware attention because it performs better in monolingual settings, but mLUKE does not as it did not give consistent gains in cross-lingual tasks.

I am wondering what prevents you to remove those extra weights on studio-ousia/mluke-base-lite if it is never used.

Although we set the default value of use_entity_aware_attention to be False in studio-ousia/mluke-base-lite, we still want to allow users to try if entity-aware attention is useful in their own settings.

However as reported in the PR description, some users find the warning confusing...
So we would like to remove this confusion.

Perhaps there are alternative approaches to achieve this goal other than setting _keys_to_ignore_on_load_unexpected such as

redefining the behavior of the initialization of LukeModel so that it copies the token-to-token attention weights with the weights of entity-aware attention missing in the checkpoint but use_entity_aware_attention=True. Then we can remove the copied weights from the checkpoints.
adding more detailed warning messages on what the ignored weights mean.

I would greatly appreciate any advice!

ydshieh · 2023-07-10T08:29:59Z

Hi @ryokan0123 . Thank you for the detailed information. Looking the following 3 points you mentioned:

To make sure, is those extra weights in studio-ousia/mluke-base-lite are neither pretrained (yes as you mentioned) nor fine-tuned. If this is the case:

Both LUKE and mLUKE are pretrained without entity-aware attention

by initializing new weights with the corresponding pretrained ones.

(Although we set the default value of use_entity_aware_attention to be False ...) we still want to allow users to try if entity-aware attention is useful in their own settings.

what you described could be easily achieved (point 3.) for a user to just specify config.use_entity_aware_attention at runtime - this doesn't require the weights to be in the checkpoint. It will just show an warning

Some weights of were not initialized from the model checkpoint at ... {pretrained_model_name_or_path} and are newly initialized ...

And this (different) warning make sense and should be kept.

Let me know if you have further question to the above suggested way to (optionally) use/enable non-trained entity_aware_attention weights.

ryokan0123 · 2023-07-10T14:01:42Z

Yes, I know that is possible.
However, the important point is that those new weights must be initialized by copying the weights obtained during pretraining.
This is exactly what we want to do here.

By randomly initializing the new weights, the model performance would degrade as the model has to learn how to attend to other tokens from scratch in fine-tuning.
We cannot randomly initialize the new weights and that's why we copy the weights here.

transformers/src/transformers/models/luke/convert_luke_original_pytorch_checkpoint_to_pytorch.py

Lines 61 to 67 in abaca9f

    
           # Initialize the query layers of the entity-aware self-attention mechanism 
        
           for layer_index in range(config.num_hidden_layers): 
        
               for matrix_name in ["query.weight", "query.bias"]: 
        
                   prefix = f"encoder.layer.{layer_index}.attention.self." 
        
                   state_dict[prefix + "w2e_" + matrix_name] = state_dict[prefix + matrix_name] 
        
                   state_dict[prefix + "e2w_" + matrix_name] = state_dict[prefix + matrix_name] 
        
                   state_dict[prefix + "e2e_" + matrix_name] = state_dict[prefix + matrix_name]

So, to achieve this and suppress warnings, I think there are some options🤔

leave the copied weights in the checkpoint and set _keys_to_ignore_on_load_unexpected (this PR, an easy path)
remove the copied weights from the checkpoint and override init_weights or post_init in LukeModel to include the copying operation (which needs a bit of work)

ydshieh · 2023-07-10T20:41:48Z

Ok, thank you for the detailed information. I finally understand why you need those weights in the check point, as they are copied from some trained weight.

I will have to think a bit more, but I feel the best is to add extra log message to explain the situation.

I will come back to you.

ydshieh · 2023-07-11T15:46:38Z

@sgugger @amyeroberts

Could you take a look the following and see if you have any comment. I tried to make it short, but still need to explain things 🙏

Summary:

In studio-ousia/mluke-base-lite (LukeModel) - checkpoint for original author):
- the checkpoint contains some keys w2e_query etc. (for entity_aware_attention)
- the config has entity_aware_attention=False:
  - from_pretrained gives unexpected keys during loading warning.
entity_aware_attention is never used during pre-training
- the checkpoint contains those w2e_query weights by coping weight values from other pre-trained weights
- (so they still make some sense and might be helpful for fine-tuning)
The model author wants to avoid confusing warning (of nexpected keys).

Two suggested actions:

(easy) add _keys_to_ignore_on_load_unexpected as done in this PR
(more work)
- remove those w2e_query weights from the checkpoint studio-ousia/mluke-base-lite
- overwrite from_pretrained to copy some weights values to the target weights (at the end of from_pretrained) - when config.use_entity_aware_attention=True + w2e_query key is found
- we will have a warning of missing key during loading, but we add a explanation to mention weights being copied

The second approach may not be worth the effort (too much work). The first one isn't really good as _keys_to_ignore_on_load_unexpected is not designed to be used for such situation (IMO).

sgugger · 2023-07-11T15:55:37Z

Note that on main, the code sample provided at the beginning does not issue any warnings (just infos) since the class used (LukeModel) is not the same as the class of the checkpoint (LukeModelForMaskedLM). It's only when loading a model LukeModelForMaskedLM that the warning appears.

As for how to deal with this, the checkpoint mentioned does not use those extra weights (as seen here in the config) so it should probably not have them in the state dict. You can use the variant parameter in from_pretrained to offer two different files for the weights if you wanted to make one version with the extra weights, for users who would like to continue fine-tuning with those extra weights. That weight file should be named pytorch_model.<variant_name>.bin.

ryokan0123 · 2023-07-13T05:37:40Z

I see, it seems the sample code only issues warnings on Colab notebooks.
Apologies for the confusion.

Thank you, @sgugger, for the suggested solution. Using the variant parameter seems a better solution.
I would also appreciate @ydshieh taking the time to handle this PR!
I will consider the suggested solution, so close this PR.

suppress warnings from LUKE for unexpected keys in the pretrained wei…

f56dd69

…ghts by specifying _keys_to_ignore_on_load_unexpected

ryokan0123 closed this Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppress warnings from LUKE for unexpected keys #24703

Suppress warnings from LUKE for unexpected keys #24703

ryokan0123 commented Jul 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 7, 2023

ryokan0123 commented Jul 7, 2023 •

edited

Loading

ryokan0123 commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 7, 2023

ydshieh commented Jul 7, 2023

ydshieh commented Jul 7, 2023

ryokan0123 commented Jul 7, 2023 •

edited

Loading

ryokan0123 commented Jul 7, 2023

ydshieh commented Jul 7, 2023 •

edited

Loading

ryokan0123 commented Jul 8, 2023

ryokan0123 commented Jul 8, 2023

ryokan0123 commented Jul 8, 2023

ydshieh commented Jul 8, 2023 •

edited

Loading

ryokan0123 commented Jul 9, 2023

ydshieh commented Jul 10, 2023 •

edited

Loading

ryokan0123 commented Jul 10, 2023 •

edited

Loading

ydshieh commented Jul 10, 2023

ydshieh commented Jul 11, 2023

sgugger commented Jul 11, 2023

ryokan0123 commented Jul 13, 2023

Suppress warnings from LUKE for unexpected keys #24703

Suppress warnings from LUKE for unexpected keys #24703

Conversation

ryokan0123 commented Jul 7, 2023 • edited Loading

What does this PR do?

Promblem

Solution

HuggingFaceDocBuilderDev commented Jul 7, 2023 • edited Loading

ydshieh commented Jul 7, 2023 • edited Loading

ydshieh commented Jul 7, 2023 • edited Loading

ydshieh commented Jul 7, 2023

ryokan0123 commented Jul 7, 2023 • edited Loading

ryokan0123 commented Jul 7, 2023 • edited Loading

ydshieh commented Jul 7, 2023

ydshieh commented Jul 7, 2023

ydshieh commented Jul 7, 2023

ryokan0123 commented Jul 7, 2023 • edited Loading

ryokan0123 commented Jul 7, 2023

ydshieh commented Jul 7, 2023 • edited Loading

ryokan0123 commented Jul 8, 2023

ryokan0123 commented Jul 8, 2023

ryokan0123 commented Jul 8, 2023

ydshieh commented Jul 8, 2023 • edited Loading

ryokan0123 commented Jul 9, 2023

What is entity-aware attention?

Why is the default value of use_entity_aware_attention different in LUKE and mLUKE?

ydshieh commented Jul 10, 2023 • edited Loading

ryokan0123 commented Jul 10, 2023 • edited Loading

ydshieh commented Jul 10, 2023

ydshieh commented Jul 11, 2023

sgugger commented Jul 11, 2023

ryokan0123 commented Jul 13, 2023

ryokan0123 commented Jul 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 7, 2023 •

edited

Loading

ryokan0123 commented Jul 7, 2023 •

edited

Loading

ryokan0123 commented Jul 7, 2023 •

edited

Loading

ryokan0123 commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 7, 2023 •

edited

Loading

ydshieh commented Jul 8, 2023 •

edited

Loading

Why is the default value of `use_entity_aware_attention` different in LUKE and mLUKE?

ydshieh commented Jul 10, 2023 •

edited

Loading

ryokan0123 commented Jul 10, 2023 •

edited

Loading