Skip to content

Releases: tatp22/linformer-pytorch

Full attention option

27 Jun 17:44
Compare
Choose a tag to compare

Added an option to the linformer to compare it with full attention. Watch out, this takes O(n^2) time and space complexity now, where n is the sequence length

Added option to save visualization

23 Jun 16:12
Compare
Choose a tag to compare

Added the option to save the visualization to a file

Added Visualizer, fixed bug

22 Jun 22:35
Compare
Choose a tag to compare

Added the visualizer class, which lets you see all of the attention heads.

Also fixed a bug where calculated the E and F matrices. They were calculated to be (n,d), but instead, they should have been (n,k). This has since been fixed.

0.7.0

21 Jun 14:52
Compare
Choose a tag to compare

As well as updating the README, I updated the default behavior of the calculation of the inner head dimension. Now, instead of the default value having to be given, it works just like in the "attention is all you need" paper, where it takes however many channels there are, and divides the channels by the number of heads, and then that dimension goes into each of the attention heads.

Added activation to MHAttention

20 Jun 16:14
Compare
Choose a tag to compare

Added both the RELU and GELU activation function options to the multihead attention block

Can decrease k by layer

17 Jun 21:53
Compare
Choose a tag to compare

Added the flag where one is able to reduce the value of dim_k by layer, with the k_reduce_by_layer flag. This was alluded to in Figure 1 of the paper, where the normalized cumulative eigenvalue index went up by layer, meaning that we can potentially get away with lower dimensions at higher depths.

Added weight sharing options and pos enc

17 Jun 21:45
Compare
Choose a tag to compare

Added the none, headwise, kv, and layerwise parameter sharing options. Also, added positional encodings

E, F matrix calculation changed

17 Jun 21:44
Compare
Choose a tag to compare

The way that the E and F matrices were calculated were changed. Before, they were an identity matrix, but with this release, they were changed to the way that the paper's authors recommended: As linear layers, with xavier init.