What is Pre-training in AI? | AI Dictionary

What is Pre-training?

Pre-training is the initial phase of training an AI model on a large, general-purpose dataset before it is adapted for specific tasks. During pre-training, the model learns broad patterns — language structure, visual concepts, or domain knowledge — that serve as a foundation for later specialization.

Why It Matters

Pre-training is what makes foundation models possible. By investing massive compute to train on broad data once, the resulting model can be cheaply adapted to thousands of different tasks. Without pre-training, every new application would require training from scratch — an expensive and data-hungry process.

How It Works

For large language models, pre-training typically uses self-supervised learning:

Data — trillions of tokens from web pages, books, code, and other text sources.
Objective — predict the next token given preceding context (autoregressive, as in GPT) or predict masked tokens (masked language modeling, as in BERT).
Scale — training runs for weeks or months across thousands of GPUs, costing millions of dollars for frontier models.
Output — a general-purpose model with broad knowledge and capabilities, but not yet aligned to be helpful or safe.

After pre-training, models typically undergo:

Supervised fine-tuning (SFT) — training on curated instruction-response pairs
RLHF / Constitutional AI — alignment to be helpful, honest, and harmless
Task-specific fine-tuning — adapting to a particular domain or use case

Example

GPT-4 was pre-trained on a vast corpus of internet text, learning everything from grammar and facts to reasoning patterns. This pre-training gave it broad capabilities, which OpenAI then refined through alignment to produce the assistant users interact with.