Encoder Training¶
An encoder turns a raw input \(x\) into a useful representation:
Instead of training a new model from scratch for every task, we often train one strong encoder first and reuse it many times.
In practice, an encoder can support:
- representation learning,
- classification or regression,
- retrieval and nearest-neighbor search,
- segmentation or detection with task-specific heads,
- modality alignment such as image-text matching.
This is why encoder pretraining matters so much: once the representation is good, many downstream tasks become easier.
Why Pretrain an Encoder?¶
Labels are expensive, but raw data is abundant. In medicine, this gap is especially large: hospitals may store millions of images, while only a small fraction have reliable expert annotations.
Self-supervised learning uses the raw data itself as the training signal. The goal is to learn features that are:
- stable under small irrelevant changes,
- different for genuinely different content,
- useful across many tasks.
That is the basic foundation-model idea for vision: learn general features first, specialize later.
After pretraining, we can freeze the encoder and test it with a small classifier, or fine-tune it for a specific task.
Why This Matters in Medicine¶
This is especially useful in medical AI because:
- unlabeled data is common,
- labels are expensive and sometimes noisy,
- a single encoder may be reused across tasks,
- transfer across scanners, hospitals, and modalities matters.
An encoder pretrained on large image collections can later support diagnosis, retrieval, report alignment, or fine-tuning on a small labeled dataset.
High-Level Idea of Self-Supervised Learning¶
Self-supervised learning creates supervision from the data itself. Instead of asking for a human label, we build an artificial target from the same sample.
For example, we may take one image, create two augmented views, and ask the model to recognize that they came from the same underlying object. In that sense, self-supervision often means making a sample into its own positive example.
The exact target can vary: match two views, recover masked content, predict a teacher output, or align two modalities. But the common idea is simple: the data teaches the encoder how to represent itself.
PCA as a Simple Self-Supervised Encoder¶
Even Principal Component Analysis can be viewed as a simple self-supervised method.
If \(X\) is a centered data matrix, then
is an eigendecomposition of the sample-similarity matrix \(XX^\top\). Two samples have a large value in \(XX^\top\) when their inner product is large, so this matrix captures which samples are similar.
The top columns of \(U\) give a low-dimensional representation that preserves the strongest structure in the data. In that sense, PCA learns a linear encoder from unlabeled data: it uses the similarity structure already present in \(X\), without any external labels.
Self-Supervised Training Methods¶
A clean way to organize self-supervised training methods is by the target the encoder is asked to match.
Reconstruction and Masked Modeling Methods¶
Core idea: hide part of the input and train the encoder to recover what is missing. We have introduced this idea in masked language model.
Typical methods: MAE, BEiT, iBOT (hybrid: masking + self-distillation).
Simple memory aid: recover what was hidden.
Contrastive Methods¶
Core idea: make matching views close and non-matching examples far apart. We will introduce this method later here.
This is the classic positive-pair and negative-pair setup.
Typical methods: SimCLR, MoCo, CLIP (multimodal contrastive, image-text alignment).
Simple memory aid: same image close, different images apart.
Self-Distillation¶
Core idea: make two views match without using explicit negative pairs.
These methods avoid collapse through asymmetry, such as a teacher-student setup, stop-gradient, or predictor heads. We will introduce this method later here.
Typical methods: BYOL, SimSiam, DINO.
Simple memory aid: match two views without negatives.
Clustering and Prototype Methods¶
Core idea: map features to shared prototypes or cluster assignments, then make different views agree on those assignments.
Typical methods: SwAV, DeepCluster.
Simple memory aid: learn through stable pseudo-labels.
Redundancy-Reduction Methods¶
Core idea: make views agree while also encouraging different feature dimensions to carry different information.
Typical methods: Barlow Twins, VICReg.
Simple memory aid: match views, but do not let features collapse into copies of each other.
Hybrid Methods¶
Some influential methods combine multiple ideas.
- DINO is mostly self-distillation, but also has prototype-like behavior.
- iBOT combines masked modeling and self-distillation.
- CLIP extends contrastive learning to multiple modalities.
The exact boundaries are not always strict, but this taxonomy is a useful mental map.
| Family | Core question | Representative methods |
|---|---|---|
| Contrastive | Which views should be close, and which should be far? | SimCLR, MoCo, CLIP |
| Non-contrastive / self-distillation | How can two views match without negatives? | BYOL, SimSiam, DINO |
| Clustering / prototypes | Can two views get the same prototype? | SwAV, DeepCluster |
| Redundancy reduction | Can views match while features stay non-redundant? | Barlow Twins, VICReg |
| Reconstruction / masked modeling | Can the model infer what was hidden? | MAE, BEiT |
| Hybrid | Can we combine masking, distillation, or prototypes? | iBOT, DINO-style variants |
References and Further Reading¶
- Chen et al., A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020.
- He et al., Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020.
- Grill et al., Bootstrap Your Own Latent, NeurIPS 2020.
- Caron et al., Emerging Properties in Self-Supervised Vision Transformers, ICCV 2021.
- He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022.
- Radford et al., Learning Transferable Visual Models From Natural Language Supervision, ICML 2021.



