You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I used the LinformerLM class with casual=True to do some language modelling. However, there seems to be some leakage as the loss goes to 0 after 1 epoch. Or am I using it wrongly? Thank you.
These are my settings
model = LinformerLM(
num_tokens=ntoken, # Number of tokens in the LM
input_size=args.seq_len, # Dimension 1 of the input
channels=args.embsize, # Dimension 2 of the input
dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
dim_k=16, # The second dimension of the P_bar matrix from the paper
dim_ff=args.nhid, # Dimension in the feed forward network
dropout_ff=args.dropout, # Dropout for feed forward network
nhead=8, # Number of attention heads
depth=12, # How many times to run the model
dropout=args.dropout, # How much dropout to apply to P_bar after softmax
activation="relu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
parameter_sharing="none", # What level of parameter sharing to use. For more information, see below.
k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
include_ff=True, # Whether or not to include the Feed Forward layer
w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
emb_dim=None, # If you want the embedding dimension to be different than the channels for the Linformer
causal=True, # If you want this to be a causal Linformer, where the upper right of the P_bar matrix is masked out.
method="learnable", # The method of how to perform the projection. Supported methods are 'convolution', 'learnable', and 'no_params'
ff_intermediate=None, # See the section below for more information
)
The text was updated successfully, but these errors were encountered:
I see that you have opened up an issue here regarding a similar problem: lucidrains/linear-attention-transformer#6. Therefore, this might be related to the way you are running your tests, I am not sure? It should not go down to 0 right away...
But on another note, there are some caveats that you should know when using the linformer for causal lm. Check out #15 and #16 for some more information about this.
Hi, thanks for the reply. I believe the linear-attention-transformer is a slightly different problem as the loss goes to infinity instead of 0. I have ran quite a number of different variants of transformer models including the original model and got comparable loss.
I think the problem might be explained by #16 (comment)
where there is some leakage of the future information. The loss did not go down to 0 right away, it took slight more than 1epoch (around 30k update steps)
Hi, I used the LinformerLM class with casual=True to do some language modelling. However, there seems to be some leakage as the loss goes to 0 after 1 epoch. Or am I using it wrongly? Thank you.
These are my settings
The text was updated successfully, but these errors were encountered: