Loss goes to 0 when using LinformerLM #25

terencenwz · 2021-04-09T03:52:18Z

Hi, I used the LinformerLM class with casual=True to do some language modelling. However, there seems to be some leakage as the loss goes to 0 after 1 epoch. Or am I using it wrongly? Thank you.

These are my settings

model = LinformerLM(
        num_tokens=ntoken, # Number of tokens in the LM
        input_size=args.seq_len, # Dimension 1 of the input
        channels=args.embsize, # Dimension 2 of the input
        dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
        dim_k=16, # The second dimension of the P_bar matrix from the paper
        dim_ff=args.nhid, # Dimension in the feed forward network
        dropout_ff=args.dropout, # Dropout for feed forward network
        nhead=8, # Number of attention heads
        depth=12, # How many times to run the model
        dropout=args.dropout, # How much dropout to apply to P_bar after softmax
        activation="relu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
        checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
        parameter_sharing="none", # What level of parameter sharing to use. For more information, see below.
        k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
        full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
        include_ff=True, # Whether or not to include the Feed Forward layer
        w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
        emb_dim=None, # If you want the embedding dimension to be different than the channels for the Linformer
        causal=True, # If you want this to be a causal Linformer, where the upper right of the P_bar matrix is masked out.
        method="learnable", # The method of how to perform the projection. Supported methods are 'convolution', 'learnable', and 'no_params'
        ff_intermediate=None, # See the section below for more information
        )

The text was updated successfully, but these errors were encountered:

tatp22 · 2021-04-09T12:01:43Z

Hi @terencenwz!

I see that you have opened up an issue here regarding a similar problem: lucidrains/linear-attention-transformer#6. Therefore, this might be related to the way you are running your tests, I am not sure? It should not go down to 0 right away...

But on another note, there are some caveats that you should know when using the linformer for causal lm. Check out #15 and #16 for some more information about this.

terencenwz · 2021-04-09T12:22:53Z

Hi, thanks for the reply. I believe the linear-attention-transformer is a slightly different problem as the loss goes to infinity instead of 0. I have ran quite a number of different variants of transformer models including the original model and got comparable loss.

I think the problem might be explained by #16 (comment)
where there is some leakage of the future information. The loss did not go down to 0 right away, it took slight more than 1epoch (around 30k update steps)

tatp22 closed this as completed Apr 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss goes to 0 when using LinformerLM #25

Loss goes to 0 when using LinformerLM #25

terencenwz commented Apr 9, 2021

tatp22 commented Apr 9, 2021

terencenwz commented Apr 9, 2021

Loss goes to 0 when using LinformerLM #25

Loss goes to 0 when using LinformerLM #25

Comments

terencenwz commented Apr 9, 2021

tatp22 commented Apr 9, 2021

terencenwz commented Apr 9, 2021