Activation Functions: The Decision Makers of Neural Networks

The Story: Your Brain’s Light Switches

Imagine you’re in a huge house with millions of tiny light switches. Each switch decides whether to turn on a light or keep it off. Some switches are very picky—they only turn on for bright sunshine. Others are more relaxed—they’ll turn on even for a small candle flame.

Activation functions are exactly like these switches! They help neural networks decide what information to pass forward and what to ignore.

What Are Activation Functions?

Think of a neural network as a team of workers passing messages. Each worker receives a number, does some math, and then asks: “Should I pass this message on? And how loudly?”

The activation function is the rule each worker uses to answer that question.

Why Do We Need Them?

Without activation functions, a neural network would be like a calculator that can only add and multiply. It couldn’t learn curves, patterns, or anything interesting!

Simple Example:

Without activation: Input 5 → always gives same boring output
With activation: Input 5 → “Hmm, is this important? Let me decide!”

Input → [Math] → [Activation Function] → Output
         ↑              ↑
     "What did      "Should I
      I receive?"    care about it?"

Sigmoid: The Gentle Squisher

The Story

Imagine a tube of toothpaste. No matter how hard you squeeze (big number) or how gently you press (small number), the toothpaste always comes out between “nothing” and “a full squeeze.”

That’s Sigmoid! It squishes ANY number into a range between 0 and 1.

How It Works

Sigmoid(x) = 1 / (1 + e^(-x))

What this means:

Very negative number → Output near 0 (almost off)
Very positive number → Output near 1 (almost fully on)
Zero → Output is exactly 0.5 (halfway)

Example

Input	Output	Meaning
-10	0.00	“Nope, ignore this!”
0	0.50	“Hmm, I’m unsure”
+10	1.00	“Yes! Pass it on!”

When to Use Sigmoid

Binary classification (Yes/No questions)
When you need probabilities (0% to 100%)

The Problem

Sigmoid has a vanishing gradient problem. When inputs are very big or very small, the network learns VERY slowly—like trying to push a car uphill through mud.

Tanh: Sigmoid’s Cooler Sibling

The Story

Tanh is like Sigmoid, but instead of toothpaste, imagine a seesaw. It can tilt down (-1), stay flat (0), or tilt up (+1).

How It Works

Tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

What this means:

Very negative → Output near -1
Very positive → Output near +1
Zero → Output is exactly 0

Sigmoid vs Tanh

Feature	Sigmoid	Tanh
Output Range	0 to 1	-1 to +1
Zero-centered?	No	Yes
When to use	Final layer for probability	Hidden layers

Why Tanh is Often Better

Tanh is zero-centered. This means the average output is around 0, which helps the network learn faster!

Example:

Sigmoid: “The answer is somewhere between 0 and 1”
Tanh: “The answer is somewhere between negative and positive”

ReLU Family: The Simplest Heroes

ReLU (Rectified Linear Unit)

The Story

Imagine a bouncer at a club. If you’re under 0 (negative), you can’t enter—you get 0. If you’re 0 or above, you pass through unchanged!

ReLU(x) = max(0, x)

How It Works

Input	Output	Rule
-5	0	Negative? Block it!
0	0	Zero stays zero
+5	+5	Positive? Let it through!

Why ReLU is Popular

Super fast to compute
No vanishing gradient for positive values
Works great in most situations

The “Dying ReLU” Problem

Sometimes neurons get stuck at 0 forever. Like a light switch that breaks in the “off” position.

Leaky ReLU: Fixing the Dying Problem

Instead of blocking negatives completely, Leaky ReLU lets a tiny bit through.

Leaky ReLU(x) = max(0.01x, x)

Example:

Input: -10 → Output: -0.1 (not zero!)
Input: +10 → Output: +10 (unchanged)

Parametric ReLU (PReLU)

Like Leaky ReLU, but the “leak amount” is learned by the network!

PReLU(x) = max(αx, x)

Where α (alpha) is a learnable parameter.

ELU (Exponential Linear Unit)

For negative inputs, ELU uses a smooth curve instead of a straight line.

ELU(x) = x           if x > 0
ELU(x) = α(e^x - 1)  if x ≤ 0

Benefit: Smoother gradients, better learning!

GELU: The Smart Gatekeeper

The Story

GELU is like a thoughtful bouncer who doesn’t just look at whether you’re positive or negative. It considers probability—“How likely is it that this value is important?”

GELU(x) ≈ x · Φ(x)

Where Φ(x) is the probability that a random number is less than x.

Why GELU is Special

Used in BERT and GPT (the AI models behind ChatGPT!)
Smoother than ReLU
Better for language tasks

Simple Understanding

Input	GELU says…
Very negative	“Probably not important, near 0”
Near zero	“Maybe important, partial pass”
Very positive	“Definitely important, full pass!”

Swish: Self-Gated Activation

The Story

Swish multiplies the input by its own Sigmoid! It’s like asking the input to rate itself.

Swish(x) = x · Sigmoid(x)

How It Works

Input	Sigmoid(Input)	Swish Output
-5	0.007	-0.03
0	0.5	0
+5	0.993	4.97

Why Swish Works Well

Non-monotonic: Can actually decrease before increasing
Smooth: No sharp corners
Found by Google using automated search!

Softmax: The Probability Distributor

The Story

Imagine you have 5 friends voting for their favorite ice cream. Softmax takes their votes and turns them into percentages that add up to 100%.

How It Works

Softmax(xi) = e^xi / Σe^xj

In simple terms: “How big is this number compared to ALL the numbers?”

Example

Raw scores (logits): [2, 1, 0.5]

After Softmax: [0.59, 0.24, 0.17]

This means:

First option: 59% chance
Second option: 24% chance
Third option: 17% chance

Total: 100%!

When to Use Softmax

Multi-class classification (picking ONE answer from many)
The final layer when you need probabilities
When choices are mutually exclusive (can only pick one)

Logits: The Raw Scores

What Are Logits?

Logits are the raw, unprocessed outputs before applying Softmax or Sigmoid.

The Story

Think of logits as raw test scores before converting to grades.

Student	Raw Score (Logit)	After Softmax (Grade)
Alice	10	90%
Bob	5	8%
Carol	3	2%

Why Are Logits Important?

More numerically stable during training
Carry more information than probabilities
Loss functions like Cross-Entropy work with logits

Logits → Probabilities Pipeline

Neural Network → Logits → Softmax → Probabilities
                  ↑                      ↑
            Raw numbers          Adds up to 1
            (any value)          (0 to 1 each)

Comparison Chart

graph TD
    A[Activation Functions] --> B[For Hidden Layers]
    A --> C[For Output Layer]

    B --> D[ReLU Family]
    B --> E[GELU/Swish]

    C --> F[Sigmoid]
    C --> G[Softmax]

    D --> H[ReLU: Fast & Simple]
    D --> I[Leaky ReLU: No dying]
    D --> J[ELU: Smooth negatives]

    F --> K[Binary: Yes/No]
    G --> L[Multi-class: Pick one]

Quick Reference Table

Function	Output Range	Best For	Watch Out
Sigmoid	0 to 1	Binary output	Vanishing gradient
Tanh	-1 to +1	Hidden layers	Vanishing gradient
ReLU	0 to ∞	Most cases	Dying neurons
Leaky ReLU	-∞ to ∞	When ReLU dies	Extra parameter
GELU	-0.17 to ∞	Transformers	Slower to compute
Swish	-0.28 to ∞	Deep networks	Slower to compute
Softmax	0 to 1 (sum=1)	Multi-class	Only for output

Key Takeaways

Activation functions add non-linearity — they let networks learn complex patterns
Sigmoid/Tanh squish values but can slow learning (vanishing gradients)
ReLU is fast and effective but neurons can “die”
GELU/Swish are modern favorites for transformers and deep networks
Softmax converts scores to probabilities for classification
Logits are raw scores before probability conversion

You Did It!

You now understand the gatekeepers of neural networks! Every time an AI makes a decision, activation functions are working behind the scenes, deciding what’s important and what’s not.

Next time you use ChatGPT or unlock your phone with face recognition, remember: millions of tiny activation functions are saying “Yes!”, “No!”, or “Maybe!” to help make that magic happen.

Loading story...

No Story Available

This concept doesn't have a story yet.

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Sign In to Access Get Premium Access Close

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Sign In to Access Get Premium Access Close

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Sign In to Access Get Premium Access Close

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Sign In to Access Get Premium Access Close

No Quiz Available

This concept doesn't have a quiz yet.

Activation Functions

Unable to load concept

Coming Soon...

Activation Functions: The Decision Makers of Neural Networks

The Story: Your Brain’s Light Switches

What Are Activation Functions?

Why Do We Need Them?

Sigmoid: The Gentle Squisher

The Story

How It Works

Example

When to Use Sigmoid

The Problem

Tanh: Sigmoid’s Cooler Sibling

The Story

How It Works

Sigmoid vs Tanh

Why Tanh is Often Better

ReLU Family: The Simplest Heroes

ReLU (Rectified Linear Unit)

The Story

How It Works

Why ReLU is Popular

The “Dying ReLU” Problem

Leaky ReLU: Fixing the Dying Problem

Parametric ReLU (PReLU)

ELU (Exponential Linear Unit)

GELU: The Smart Gatekeeper

The Story

Why GELU is Special

Simple Understanding

Swish: Self-Gated Activation

The Story

How It Works

Why Swish Works Well

Softmax: The Probability Distributor

The Story

How It Works

Example

When to Use Softmax

Logits: The Raw Scores

What Are Logits?

The Story

Why Are Logits Important?

Logits → Probabilities Pipeline

Comparison Chart

Quick Reference Table

Key Takeaways

You Did It!

No Story Available

Story - Premium Content

Interactive - Premium Content

No Interactive Content

Cheatsheet - Premium Content

No Cheatsheet Available

Quiz - Premium Content

No Quiz Available

Report an Issue