What is Unsupervised Learning?

Unsupervised learning is a machine learning paradigm in which a model learns patterns and structure from unlabeled data — without being told what the "correct" output should be. The model discovers hidden relationships on its own.

Why It Matters

Most real-world data is unlabeled — labeling is expensive and time-consuming. Unsupervised learning makes it possible to extract value from this vast unlabeled data. It's used for customer segmentation, anomaly detection, data compression, and — critically — the pre-training phase of modern LLMs, where models learn language structure from unlabeled text.

How It Works

Without labels, unsupervised algorithms look for structure in data:

Clustering — group similar data points together (e.g., K-means, DBSCAN). The algorithm finds natural groupings without being told what groups exist.
Dimensionality reduction — compress high-dimensional data into fewer dimensions while preserving important structure (e.g., PCA, t-SNE, autoencoders). Used for visualization and feature extraction.
Anomaly detection — identify data points that don't fit the learned pattern. Useful for fraud detection and quality control.
Generative modeling — learn the underlying distribution of data to generate new, similar examples (GANs, VAEs).

Self-supervised learning (used in LLM pre-training) is technically a form of unsupervised learning: the model creates its own labels by predicting masked or next tokens.

Example

A retailer uses K-means clustering on customer purchase data to automatically discover customer segments (budget shoppers, premium buyers, occasional browsers) without pre-defining any categories. These segments then inform targeted marketing campaigns.