Transformers from scratch

The implemented Transformer architecture is written from scratch using PyTorch, described in "Attention is all you need".

The Vision Transformer (ViT) model, described in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", is extended from the base Transformer architecture implemented here.

For the ViT variant, I placed layer normalization inside residual connections (i.e. before attention) instead of placing them between residual blocks, which is the method used in the original Transformers paper. This technique is described in "On Layer Normalization in the Transformer Architecture", and would lead to a smaller gradient as a result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transformers from scratch

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transformers from scratch