Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss goes to 0 when using LinformerLM #25

Closed
terencenwz opened this issue Apr 9, 2021 · 2 comments
Closed

Loss goes to 0 when using LinformerLM #25

terencenwz opened this issue Apr 9, 2021 · 2 comments

Comments

@terencenwz
Copy link

Hi, I used the LinformerLM class with casual=True to do some language modelling. However, there seems to be some leakage as the loss goes to 0 after 1 epoch. Or am I using it wrongly? Thank you.

These are my settings

model = LinformerLM(
        num_tokens=ntoken, # Number of tokens in the LM
        input_size=args.seq_len, # Dimension 1 of the input
        channels=args.embsize, # Dimension 2 of the input
        dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
        dim_k=16, # The second dimension of the P_bar matrix from the paper
        dim_ff=args.nhid, # Dimension in the feed forward network
        dropout_ff=args.dropout, # Dropout for feed forward network
        nhead=8, # Number of attention heads
        depth=12, # How many times to run the model
        dropout=args.dropout, # How much dropout to apply to P_bar after softmax
        activation="relu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
        checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
        parameter_sharing="none", # What level of parameter sharing to use. For more information, see below.
        k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
        full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
        include_ff=True, # Whether or not to include the Feed Forward layer
        w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
        emb_dim=None, # If you want the embedding dimension to be different than the channels for the Linformer
        causal=True, # If you want this to be a causal Linformer, where the upper right of the P_bar matrix is masked out.
        method="learnable", # The method of how to perform the projection. Supported methods are 'convolution', 'learnable', and 'no_params'
        ff_intermediate=None, # See the section below for more information
        )
@tatp22
Copy link
Owner

tatp22 commented Apr 9, 2021

Hi @terencenwz!

I see that you have opened up an issue here regarding a similar problem: lucidrains/linear-attention-transformer#6. Therefore, this might be related to the way you are running your tests, I am not sure? It should not go down to 0 right away...

But on another note, there are some caveats that you should know when using the linformer for causal lm. Check out #15 and #16 for some more information about this.

@terencenwz
Copy link
Author

Hi, thanks for the reply. I believe the linear-attention-transformer is a slightly different problem as the loss goes to infinity instead of 0. I have ran quite a number of different variants of transformer models including the original model and got comparable loss.

I think the problem might be explained by #16 (comment)
where there is some leakage of the future information. The loss did not go down to 0 right away, it took slight more than 1epoch (around 30k update steps)

@tatp22 tatp22 closed this as completed Apr 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants