Activation Functions: The Decision Makers of Neural Networks
The Story: Your Brain’s Light Switches
Imagine you’re in a huge house with millions of tiny light switches. Each switch decides whether to turn on a light or keep it off. Some switches are very picky—they only turn on for bright sunshine. Others are more relaxed—they’ll turn on even for a small candle flame.
Activation functions are exactly like these switches! They help neural networks decide what information to pass forward and what to ignore.
What Are Activation Functions?
Think of a neural network as a team of workers passing messages. Each worker receives a number, does some math, and then asks: “Should I pass this message on? And how loudly?”
The activation function is the rule each worker uses to answer that question.
Why Do We Need Them?
Without activation functions, a neural network would be like a calculator that can only add and multiply. It couldn’t learn curves, patterns, or anything interesting!
Simple Example:
- Without activation: Input 5 → always gives same boring output
- With activation: Input 5 → “Hmm, is this important? Let me decide!”
Input → [Math] → [Activation Function] → Output
↑ ↑
"What did "Should I
I receive?" care about it?"
Sigmoid: The Gentle Squisher
The Story
Imagine a tube of toothpaste. No matter how hard you squeeze (big number) or how gently you press (small number), the toothpaste always comes out between “nothing” and “a full squeeze.”
That’s Sigmoid! It squishes ANY number into a range between 0 and 1.
How It Works
Sigmoid(x) = 1 / (1 + e^(-x))
What this means:
- Very negative number → Output near 0 (almost off)
- Very positive number → Output near 1 (almost fully on)
- Zero → Output is exactly 0.5 (halfway)
Example
| Input | Output | Meaning |
|---|---|---|
| -10 | 0.00 | “Nope, ignore this!” |
| 0 | 0.50 | “Hmm, I’m unsure” |
| +10 | 1.00 | “Yes! Pass it on!” |
When to Use Sigmoid
- Binary classification (Yes/No questions)
- When you need probabilities (0% to 100%)
The Problem
Sigmoid has a vanishing gradient problem. When inputs are very big or very small, the network learns VERY slowly—like trying to push a car uphill through mud.
Tanh: Sigmoid’s Cooler Sibling
The Story
Tanh is like Sigmoid, but instead of toothpaste, imagine a seesaw. It can tilt down (-1), stay flat (0), or tilt up (+1).
How It Works
Tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
What this means:
- Very negative → Output near -1
- Very positive → Output near +1
- Zero → Output is exactly 0
Sigmoid vs Tanh
| Feature | Sigmoid | Tanh |
|---|---|---|
| Output Range | 0 to 1 | -1 to +1 |
| Zero-centered? | No | Yes |
| When to use | Final layer for probability | Hidden layers |
Why Tanh is Often Better
Tanh is zero-centered. This means the average output is around 0, which helps the network learn faster!
Example:
- Sigmoid: “The answer is somewhere between 0 and 1”
- Tanh: “The answer is somewhere between negative and positive”
ReLU Family: The Simplest Heroes
ReLU (Rectified Linear Unit)
The Story
Imagine a bouncer at a club. If you’re under 0 (negative), you can’t enter—you get 0. If you’re 0 or above, you pass through unchanged!
ReLU(x) = max(0, x)
How It Works
| Input | Output | Rule |
|---|---|---|
| -5 | 0 | Negative? Block it! |
| 0 | 0 | Zero stays zero |
| +5 | +5 | Positive? Let it through! |
Why ReLU is Popular
- Super fast to compute
- No vanishing gradient for positive values
- Works great in most situations
The “Dying ReLU” Problem
Sometimes neurons get stuck at 0 forever. Like a light switch that breaks in the “off” position.
Leaky ReLU: Fixing the Dying Problem
Instead of blocking negatives completely, Leaky ReLU lets a tiny bit through.
Leaky ReLU(x) = max(0.01x, x)
Example:
- Input: -10 → Output: -0.1 (not zero!)
- Input: +10 → Output: +10 (unchanged)
Parametric ReLU (PReLU)
Like Leaky ReLU, but the “leak amount” is learned by the network!
PReLU(x) = max(αx, x)
Where α (alpha) is a learnable parameter.
ELU (Exponential Linear Unit)
For negative inputs, ELU uses a smooth curve instead of a straight line.
ELU(x) = x if x > 0
ELU(x) = α(e^x - 1) if x ≤ 0
Benefit: Smoother gradients, better learning!
GELU: The Smart Gatekeeper
The Story
GELU is like a thoughtful bouncer who doesn’t just look at whether you’re positive or negative. It considers probability—“How likely is it that this value is important?”
GELU(x) ≈ x · Φ(x)
Where Φ(x) is the probability that a random number is less than x.
Why GELU is Special
- Used in BERT and GPT (the AI models behind ChatGPT!)
- Smoother than ReLU
- Better for language tasks
Simple Understanding
| Input | GELU says… |
|---|---|
| Very negative | “Probably not important, near 0” |
| Near zero | “Maybe important, partial pass” |
| Very positive | “Definitely important, full pass!” |
Swish: Self-Gated Activation
The Story
Swish multiplies the input by its own Sigmoid! It’s like asking the input to rate itself.
Swish(x) = x · Sigmoid(x)
How It Works
| Input | Sigmoid(Input) | Swish Output |
|---|---|---|
| -5 | 0.007 | -0.03 |
| 0 | 0.5 | 0 |
| +5 | 0.993 | 4.97 |
Why Swish Works Well
- Non-monotonic: Can actually decrease before increasing
- Smooth: No sharp corners
- Found by Google using automated search!
Softmax: The Probability Distributor
The Story
Imagine you have 5 friends voting for their favorite ice cream. Softmax takes their votes and turns them into percentages that add up to 100%.
How It Works
Softmax(xi) = e^xi / Σe^xj
In simple terms: “How big is this number compared to ALL the numbers?”
Example
Raw scores (logits): [2, 1, 0.5]
After Softmax: [0.59, 0.24, 0.17]
This means:
- First option: 59% chance
- Second option: 24% chance
- Third option: 17% chance
Total: 100%!
When to Use Softmax
- Multi-class classification (picking ONE answer from many)
- The final layer when you need probabilities
- When choices are mutually exclusive (can only pick one)
Logits: The Raw Scores
What Are Logits?
Logits are the raw, unprocessed outputs before applying Softmax or Sigmoid.
The Story
Think of logits as raw test scores before converting to grades.
| Student | Raw Score (Logit) | After Softmax (Grade) |
|---|---|---|
| Alice | 10 | 90% |
| Bob | 5 | 8% |
| Carol | 3 | 2% |
Why Are Logits Important?
- More numerically stable during training
- Carry more information than probabilities
- Loss functions like Cross-Entropy work with logits
Logits → Probabilities Pipeline
Neural Network → Logits → Softmax → Probabilities
↑ ↑
Raw numbers Adds up to 1
(any value) (0 to 1 each)
Comparison Chart
graph TD A[Activation Functions] --> B[For Hidden Layers] A --> C[For Output Layer] B --> D[ReLU Family] B --> E[GELU/Swish] C --> F[Sigmoid] C --> G[Softmax] D --> H[ReLU: Fast & Simple] D --> I[Leaky ReLU: No dying] D --> J[ELU: Smooth negatives] F --> K[Binary: Yes/No] G --> L[Multi-class: Pick one]
Quick Reference Table
| Function | Output Range | Best For | Watch Out |
|---|---|---|---|
| Sigmoid | 0 to 1 | Binary output | Vanishing gradient |
| Tanh | -1 to +1 | Hidden layers | Vanishing gradient |
| ReLU | 0 to ∞ | Most cases | Dying neurons |
| Leaky ReLU | -∞ to ∞ | When ReLU dies | Extra parameter |
| GELU | -0.17 to ∞ | Transformers | Slower to compute |
| Swish | -0.28 to ∞ | Deep networks | Slower to compute |
| Softmax | 0 to 1 (sum=1) | Multi-class | Only for output |
Key Takeaways
-
Activation functions add non-linearity — they let networks learn complex patterns
-
Sigmoid/Tanh squish values but can slow learning (vanishing gradients)
-
ReLU is fast and effective but neurons can “die”
-
GELU/Swish are modern favorites for transformers and deep networks
-
Softmax converts scores to probabilities for classification
-
Logits are raw scores before probability conversion
You Did It!
You now understand the gatekeepers of neural networks! Every time an AI makes a decision, activation functions are working behind the scenes, deciding what’s important and what’s not.
Next time you use ChatGPT or unlock your phone with face recognition, remember: millions of tiny activation functions are saying “Yes!”, “No!”, or “Maybe!” to help make that magic happen.