
What is an Activation Function?
An activation function is a mathematical function applied to each neuron's output in a neural network that introduces non-linearity. Without activation functions, a neural network — regardless of depth — would be equivalent to a single linear transformation, unable to learn complex patterns like image recognition or language understanding.
Why It Matters
Activation functions are what make neural networks powerful. A network of purely linear operations can only learn linear relationships. Activation functions enable non-linear computation, allowing deep networks to approximate virtually any function. The choice of activation function affects training speed, gradient flow, and model performance — GELU replaced ReLU in transformers for good reasons.
How It Works
Why non-linearity is needed:
- Linear function: f(x) = wx + b
- Stacking linear functions: f(g(x)) = w₂(w₁x + b₁) + b₂ = w₃x + b₃ (still linear!)
- Adding non-linearity between layers enables learning complex patterns
Common activation functions:
ReLU (Rectified Linear Unit):
- f(x) = max(0, x)
- Simple, fast, effective — the default for most deep learning
- Problem: "dying ReLU" — neurons that output 0 for all inputs stop learning
GELU (Gaussian Error Linear Unit):
- f(x) = x · Φ(x), where Φ is the Gaussian cumulative distribution
- Smooth approximation of ReLU with stochastic regularization
- Used in GPT, BERT, and most modern transformers
SiLU / Swish:
- f(x) = x · σ(x), where σ is the sigmoid function
- Similar to GELU, popular in vision models and LLaMA
Sigmoid:
- f(x) = 1 / (1 + e^(-x))
- Outputs between 0 and 1 — used for probabilities
- Problem: vanishing gradients for deep networks
Tanh:
- f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
- Outputs between -1 and 1
- Better than sigmoid for hidden layers, but still suffers from vanishing gradients
Softmax:
- Converts a vector of raw scores into probabilities that sum to 1
- Used as the final layer for classification tasks
- Also the core of the attention mechanism in transformers
Choosing an activation function:
- Hidden layers: GELU (transformers), SiLU (newer models), ReLU (CNNs, general)
- Output layer: softmax (classification), sigmoid (binary), linear (regression)
Example
In a neural network classifying images, each layer computes weighted sums and then applies ReLU: any negative result becomes 0, positive results pass through. This simple non-linearity, stacked across dozens of layers, enables the network to detect edges (layer 1), textures (layer 2), object parts (layer 3), and finally whole objects (final layers).