What is an Activation Function?

An activation function is a mathematical function applied to each neuron's output in a neural network that introduces non-linearity. Without activation functions, a neural network — regardless of depth — would be equivalent to a single linear transformation, unable to learn complex patterns like image recognition or language understanding.

Why It Matters

Activation functions are what make neural networks powerful. A network of purely linear operations can only learn linear relationships. Activation functions enable non-linear computation, allowing deep networks to approximate virtually any function. The choice of activation function affects training speed, gradient flow, and model performance — GELU replaced ReLU in transformers for good reasons.

How It Works

Why non-linearity is needed:

Linear function: f(x) = wx + b
Stacking linear functions: f(g(x)) = w₂(w₁x + b₁) + b₂ = w₃x + b₃ (still linear!)
Adding non-linearity between layers enables learning complex patterns

Common activation functions:

ReLU (Rectified Linear Unit):

f(x) = max(0, x)
Simple, fast, effective — the default for most deep learning
Problem: "dying ReLU" — neurons that output 0 for all inputs stop learning

GELU (Gaussian Error Linear Unit):

f(x) = x · Φ(x), where Φ is the Gaussian cumulative distribution
Smooth approximation of ReLU with stochastic regularization
Used in GPT, BERT, and most modern transformers

SiLU / Swish:

f(x) = x · σ(x), where σ is the sigmoid function
Similar to GELU, popular in vision models and LLaMA

Sigmoid:

f(x) = 1 / (1 + e^(-x))
Outputs between 0 and 1 — used for probabilities
Problem: vanishing gradients for deep networks

Tanh:

f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Outputs between -1 and 1
Better than sigmoid for hidden layers, but still suffers from vanishing gradients

Softmax:

Converts a vector of raw scores into probabilities that sum to 1
Used as the final layer for classification tasks
Also the core of the attention mechanism in transformers

Choosing an activation function:

Hidden layers: GELU (transformers), SiLU (newer models), ReLU (CNNs, general)
Output layer: softmax (classification), sigmoid (binary), linear (regression)

Example

In a neural network classifying images, each layer computes weighted sums and then applies ReLU: any negative result becomes 0, positive results pass through. This simple non-linearity, stacked across dozens of layers, enables the network to detect edges (layer 1), textures (layer 2), object parts (layer 3), and finally whole objects (final layers).