Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (code), (paper), (blog)
Show and Tell: A Neural Image Caption Generator (code), (paper)
Deep Visual-Semantic Alignments for Generating Image Descriptions (code), (paper)
Encoder: Resnet50 (Contrastive learning)
Decoder: LSTM or Transformer
Synthetic data