
What is an Activation Function?
An activation function is a mathematical function applied to each neuron's output in a neural network that introduces non-linearity. Without activation functions, a neural network β regardless of depth β would be equivalent to a single linear transformation, unable to learn complex patterns like image recognition or language understanding.
Why It Matters
Activation functions are what make neural networks powerful. A network of purely linear operations can only learn linear relationships. Activation functions enable non-linear computation, allowing deep networks to approximate virtually any function. The choice of activation function affects training speed, gradient flow, and model performance β GELU replaced ReLU in transformers for good reasons.
How It Works
Why non-linearity is needed:
- Linear function: f(x) = wx + b
- Stacking linear functions: f(g(x)) = wβ(wβx + bβ) + bβ = wβx + bβ (still linear!)
- Adding non-linearity between layers enables learning complex patterns
Common activation functions:
ReLU (Rectified Linear Unit):
- f(x) = max(0, x)
- Simple, fast, effective β the default for most deep learning
- Problem: "dying ReLU" β neurons that output 0 for all inputs stop learning
GELU (Gaussian Error Linear Unit):