What is Self-Supervised Learning?

Self-supervised learning is a training paradigm where the model generates its own labels from the structure of the data, without requiring human-annotated labels. It's the actual technique used to pre-train large language models and modern vision models.

Why It Matters

Self-supervised learning is what made the LLM revolution possible. Labeling data manually is expensive and doesn't scale to the trillions of tokens needed for pre-training. By creating learning signals from the data itself — predict the next token, fill in masked words, match image crops — self-supervised learning unlocks virtually unlimited training data from the open web.

How It Works

Self-supervised learning creates a pretext task — an automatically generated prediction problem — from unlabeled data:

For language models:

Next-token prediction (GPT, LLaMA, Claude) — given preceding text, predict the next token. The "label" is simply the actual next token in the corpus.
Masked language modeling (BERT) — randomly mask 15% of tokens and train the model to predict them from surrounding context.
Denoising (T5, BART) — corrupt the input (mask spans, shuffle sentences) and train the model to reconstruct the original.

For vision models:

Contrastive learning (CLIP, SimCLR) — learn representations where different views of the same image are similar and different images are dissimilar.
Masked image modeling (MAE, BEiT) — mask patches of an image and predict the missing pixels or features.

For multimodal models:

Image-text matching (CLIP) — learn aligned representations of images and their text descriptions.

Self-supervised learning sits between supervised (needs labels) and unsupervised (finds structure with no objective). It's technically unsupervised but uses a supervised-style loss by auto-generating labels.

Example

When GPT-4 was pre-trained, every sentence in the training corpus became thousands of training examples automatically: "The cat sat on the" → predict "mat"; "The cat sat on" → predict "the"; etc. No human needed to label anything — the text itself provides the signal.