
What is Unsupervised Learning?
Unsupervised learning is a machine learning paradigm in which a model learns patterns and structure from unlabeled data — without being told what the "correct" output should be. The model discovers hidden relationships on its own.
Why It Matters
Most real-world data is unlabeled — labeling is expensive and time-consuming. Unsupervised learning makes it possible to extract value from this vast unlabeled data. It's used for customer segmentation, anomaly detection, data compression, and — critically — the pre-training phase of modern LLMs, where models learn language structure from unlabeled text.
How It Works
Without labels, unsupervised algorithms look for structure in data:
- Clustering — group similar data points together (e.g., K-means, DBSCAN). The algorithm finds natural groupings without being told what groups exist.
- Dimensionality reduction — compress high-dimensional data into fewer dimensions while preserving important structure (e.g., PCA, t-SNE, autoencoders). Used for visualization and feature extraction.
- Anomaly detection — identify data points that don't fit the learned pattern. Useful for fraud detection and quality control.
- Generative modeling — learn the underlying distribution of data to generate new, similar examples (GANs, VAEs).
Self-supervised learning (used in LLM pre-training) is technically a form of unsupervised learning: the model creates its own labels by predicting masked or next tokens.
Example
A retailer uses K-means clustering on customer purchase data to automatically discover customer segments (budget shoppers, premium buyers, occasional browsers) without pre-defining any categories. These segments then inform targeted marketing campaigns.
Related
See also: Supervised Learning, Machine Learning, Embedding, Latent Space