Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 902 Bytes

README.md

File metadata and controls

7 lines (4 loc) · 902 Bytes

Transformers from scratch

The implemented Transformer architecture is written from scratch using PyTorch, described in "Attention is all you need".

The Vision Transformer (ViT) model, described in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", is extended from the base Transformer architecture implemented here.

For the ViT variant, I placed layer normalization inside residual connections (i.e. before attention) instead of placing them between residual blocks, which is the method used in the original Transformers paper. This technique is described in "On Layer Normalization in the Transformer Architecture", and would lead to a smaller gradient as a result.