Overview¶

Open Interactive Notebook in Colab

This chapter introduces image foundation models through the lens of the Vision Transformer (ViT). If language models treat a sentence as a sequence of tokens, ViT does something surprisingly similar for images: it turns an image into a sequence of small patches and lets a Transformer reason over them.

Lectures: