
What is Self-Supervised Learning?
Self-supervised learning is a training paradigm where the model generates its own labels from the structure of the data, without requiring human-annotated labels. It's the actual technique used to pre-train large language models and modern vision models.
Why It Matters
Self-supervised learning is what made the LLM revolution possible. Labeling data manually is expensive and doesn't scale to the trillions of tokens needed for pre-training. By creating learning signals from the data itself — predict the next token, fill in masked words, match image crops — self-supervised learning unlocks virtually unlimited training data from the open web.
How It Works
Self-supervised learning creates a pretext task — an automatically generated prediction problem — from unlabeled data:
For language models:
- Next-token prediction (GPT, LLaMA, Claude) — given preceding text, predict the next token. The "label" is simply the actual next token in the corpus.
- Masked language modeling (BERT) — randomly mask 15% of tokens and train the model to predict them from surrounding context.
- Denoising (T5, BART) — corrupt the input (mask spans, shuffle sentences) and train the model to reconstruct the original.
For vision models:
- Contrastive learning (CLIP, SimCLR) — learn representations where different views of the same image are similar and different images are dissimilar.
- Masked image modeling (MAE, BEiT) — mask patches of an image and predict the missing pixels or features.
For multimodal models:
- Image-text matching (CLIP) — learn aligned representations of images and their text descriptions.
Self-supervised learning sits between supervised (needs labels) and unsupervised (finds structure with no objective). It's technically unsupervised but uses a supervised-style loss by auto-generating labels.
Example
When GPT-4 was pre-trained, every sentence in the training corpus became thousands of training examples automatically: "The cat sat on the" → predict "mat"; "The cat sat on" → predict "the"; etc. No human needed to label anything — the text itself provides the signal.