Web Analytics
Skip to content

Encoder Training

▶ Try in Colab

An encoder turns a raw input \(x\) into a useful representation:

\[ z = f_\theta(x) \]

Instead of training a new model from scratch for every task, we often train one strong encoder first and reuse it many times.

In practice, an encoder can support:

  • representation learning,
  • classification or regression,
  • retrieval and nearest-neighbor search,
  • segmentation or detection with task-specific heads,
  • modality alignment such as image-text matching.

This is why encoder pretraining matters so much: once the representation is good, many downstream tasks become easier.

Why Pretrain an Encoder?

Labels are expensive, but raw data is abundant. In medicine, this gap is especially large: hospitals may store millions of images, while only a small fraction have reliable expert annotations.

Self-supervised learning uses the raw data itself as the training signal. The goal is to learn features that are:

  • stable under small irrelevant changes,
  • different for genuinely different content,
  • useful across many tasks.

That is the basic foundation-model idea for vision: learn general features first, specialize later.

After pretraining, we can freeze the encoder and test it with a small classifier, or fine-tune it for a specific task.

Why This Matters in Medicine

This is especially useful in medical AI because:

  • unlabeled data is common,
  • labels are expensive and sometimes noisy,
  • a single encoder may be reused across tasks,
  • transfer across scanners, hospitals, and modalities matters.

An encoder pretrained on large image collections can later support diagnosis, retrieval, report alignment, or fine-tuning on a small labeled dataset.

High-Level Idea of Self-Supervised Learning

Self-supervised learning creates supervision from the data itself. Instead of asking for a human label, we build an artificial target from the same sample.

For example, we may take one image, create two augmented views, and ask the model to recognize that they came from the same underlying object. In that sense, self-supervision often means making a sample into its own positive example.

The exact target can vary: match two views, recover masked content, predict a teacher output, or align two modalities. But the common idea is simple: the data teaches the encoder how to represent itself.

PCA as a Simple Self-Supervised Encoder

Even Principal Component Analysis can be viewed as a simple self-supervised method.

pca

If \(X\) is a centered data matrix, then

\[ XX^\top = U D U^\top \]

is an eigendecomposition of the sample-similarity matrix \(XX^\top\). Two samples have a large value in \(XX^\top\) when their inner product is large, so this matrix captures which samples are similar.

The top columns of \(U\) give a low-dimensional representation that preserves the strongest structure in the data. In that sense, PCA learns a linear encoder from unlabeled data: it uses the similarity structure already present in \(X\), without any external labels.

Self-Supervised Training Methods

A clean way to organize self-supervised training methods is by the target the encoder is asked to match.

Reconstruction and Masked Modeling Methods

Core idea: hide part of the input and train the encoder to recover what is missing. We have introduced this idea in masked language model.

Typical methods: MAE, BEiT, iBOT (hybrid: masking + self-distillation).

Simple memory aid: recover what was hidden.

mlm

Contrastive Methods

Core idea: make matching views close and non-matching examples far apart. We will introduce this method later here.

This is the classic positive-pair and negative-pair setup.

Typical methods: SimCLR, MoCo, CLIP (multimodal contrastive, image-text alignment).

Simple memory aid: same image close, different images apart.

clip

Self-Distillation

Core idea: make two views match without using explicit negative pairs.

These methods avoid collapse through asymmetry, such as a teacher-student setup, stop-gradient, or predictor heads. We will introduce this method later here.

Typical methods: BYOL, SimSiam, DINO.

Simple memory aid: match two views without negatives.

dino

Clustering and Prototype Methods

Core idea: map features to shared prototypes or cluster assignments, then make different views agree on those assignments.

Typical methods: SwAV, DeepCluster.

Simple memory aid: learn through stable pseudo-labels.

Redundancy-Reduction Methods

Core idea: make views agree while also encouraging different feature dimensions to carry different information.

Typical methods: Barlow Twins, VICReg.

Simple memory aid: match views, but do not let features collapse into copies of each other.

Hybrid Methods

Some influential methods combine multiple ideas.

  • DINO is mostly self-distillation, but also has prototype-like behavior.
  • iBOT combines masked modeling and self-distillation.
  • CLIP extends contrastive learning to multiple modalities.

The exact boundaries are not always strict, but this taxonomy is a useful mental map.

Family Core question Representative methods
Contrastive Which views should be close, and which should be far? SimCLR, MoCo, CLIP
Non-contrastive / self-distillation How can two views match without negatives? BYOL, SimSiam, DINO
Clustering / prototypes Can two views get the same prototype? SwAV, DeepCluster
Redundancy reduction Can views match while features stay non-redundant? Barlow Twins, VICReg
Reconstruction / masked modeling Can the model infer what was hidden? MAE, BEiT
Hybrid Can we combine masking, distillation, or prototypes? iBOT, DINO-style variants

References and Further Reading