
What is Unsupervised Learning?
Unsupervised learning is a machine learning paradigm in which a model learns patterns and structure from unlabeled data β without being told what the "correct" output should be. The model discovers hidden relationships on its own.
Why It Matters
Most real-world data is unlabeled β labeling is expensive and time-consuming. Unsupervised learning makes it possible to extract value from this vast unlabeled data. It's used for customer segmentation, anomaly detection, data compression, and β critically β the pre-training phase of modern LLMs, where models learn language structure from unlabeled text.
How It Works
Without labels, unsupervised algorithms look for structure in data:
- Clustering β group similar data points together (e.g., K-means, DBSCAN). The algorithm finds natural groupings without being told what groups exist.
- Dimensionality reduction β compress high-dimensional data into fewer dimensions while preserving important structure (e.g., PCA, t-SNE, autoencoders). Used for visualization and feature extraction.
- Anomaly detection β identify data points that don't fit the learned pattern. Useful for fraud detection and quality control.
- Generative modeling β learn the underlying distribution of data to generate new, similar examples (GANs, VAEs).
Self-supervised learning (used in LLM pre-training) is technically a form of unsupervised learning: the model creates its own labels by predicting masked or next tokens.