
What is Self-Supervised Learning?
Self-supervised learning is a training paradigm where the model generates its own labels from the structure of the data, without requiring human-annotated labels. It's the actual technique used to pre-train large language models and modern vision models.
Why It Matters
Self-supervised learning is what made the LLM revolution possible. Labeling data manually is expensive and doesn't scale to the trillions of tokens needed for pre-training. By creating learning signals from the data itself β predict the next token, fill in masked words, match image crops β self-supervised learning unlocks virtually unlimited training data from the open web.
How It Works
Self-supervised learning creates a pretext task β an automatically generated prediction problem β from unlabeled data:
For language models:
- Next-token prediction (GPT, LLaMA, Claude) β given preceding text, predict the next token. The "label" is simply the actual next token in the corpus.
- Masked language modeling (BERT) β randomly mask 15% of tokens and train the model to predict them from surrounding context.
- Denoising (T5, BART) β corrupt the input (mask spans, shuffle sentences) and train the model to reconstruct the original.
For vision models:
- (CLIP, SimCLR) β learn representations where different views of the same image are similar and different images are dissimilar.