Releases: tatp22/linformer-pytorch
Full attention option
Added an option to the linformer to compare it with full attention. Watch out, this takes O(n^2) time and space complexity now, where n is the sequence length
Added option to save visualization
Added the option to save the visualization to a file
Added Visualizer, fixed bug
Added the visualizer class, which lets you see all of the attention heads.
Also fixed a bug where calculated the E and F matrices. They were calculated to be (n,d)
, but instead, they should have been (n,k)
. This has since been fixed.
0.7.0
As well as updating the README, I updated the default behavior of the calculation of the inner head dimension. Now, instead of the default value having to be given, it works just like in the "attention is all you need" paper, where it takes however many channels there are, and divides the channels by the number of heads, and then that dimension goes into each of the attention heads.
Added activation to MHAttention
Added both the RELU and GELU activation function options to the multihead attention block
Can decrease k by layer
Added the flag where one is able to reduce the value of dim_k
by layer, with the k_reduce_by_layer
flag. This was alluded to in Figure 1 of the paper, where the normalized cumulative eigenvalue index went up by layer, meaning that we can potentially get away with lower dimensions at higher depths.
Added weight sharing options and pos enc
Added the none
, headwise
, kv
, and layerwise
parameter sharing options. Also, added positional encodings
E, F matrix calculation changed
The way that the E and F matrices were calculated were changed. Before, they were an identity matrix, but with this release, they were changed to the way that the paper's authors recommended: As linear layers, with xavier init.