Attention Is All You Need
Vaswani et al. · 2017
Introduces the Transformer architecture based solely on attention mechanisms, dispensing recurrence and convolutions entirely.
TransformerSelf-AttentionMulti-Head Attention
BERT: Pre-training of Deep Bidirectional Transformers
Devlin et al. · 2018
Proposes a bidirectional transformer pre-training approach that achieves SOTA on 11 NLP tasks.
Pre-trainingBidirectionalFine-tuning
An Image is Worth 16x16 Words
Dosovitskiy et al. · 2020
Applies transformer architecture directly to image patches for classification at scale.
ViTImage PatchesVision Transformer
Proximal Policy Optimization Algorithms
Schulman et al. · 2017
Presents PPO, a family of policy gradient methods for RL that balances simplicity and performance.
PPOPolicy GradientClipping
GPT-4 Technical Report
OpenAI · 2023
Describes GPT-4, a large-scale multimodal model capable of processing image and text inputs.
LLMMultimodalRLHF