Web Analytics
Skip to content

Overview

This chapter introduces image foundation models through the lens of the Vision Transformer (ViT). If language models treat a sentence as a sequence of tokens, ViT does something surprisingly similar for images: it turns an image into a sequence of small patches and lets a Transformer reason over them.

Cover

Lectures: