Overview¶
This chapter introduces image foundation models through the lens of the Vision Transformer (ViT). If language models treat a sentence as a sequence of tokens, ViT does something surprisingly similar for images: it turns an image into a sequence of small patches and lets a Transformer reason over them.
Lectures:
- ViT Basics: Patches, Tokens, and Encoders
- Pretraining Image Foundation Models
- Fine-Tuning ViT in Practice
- Multimodal Foundation Models: Mixing Text and Images
- Interpreting Vision Transformers
