Activation Functions

Loading concept...

Activation Functions: The Decision Makers of Neural Networks

The Story: Your Brain’s Light Switches

Imagine you’re in a huge house with millions of tiny light switches. Each switch decides whether to turn on a light or keep it off. Some switches are very picky—they only turn on for bright sunshine. Others are more relaxed—they’ll turn on even for a small candle flame.

Activation functions are exactly like these switches! They help neural networks decide what information to pass forward and what to ignore.


What Are Activation Functions?

Think of a neural network as a team of workers passing messages. Each worker receives a number, does some math, and then asks: “Should I pass this message on? And how loudly?”

The activation function is the rule each worker uses to answer that question.

Why Do We Need Them?

Without activation functions, a neural network would be like a calculator that can only add and multiply. It couldn’t learn curves, patterns, or anything interesting!

Simple Example:

  • Without activation: Input 5 → always gives same boring output
  • With activation: Input 5 → “Hmm, is this important? Let me decide!”
Input → [Math] → [Activation Function] → Output
         ↑              ↑
     "What did      "Should I
      I receive?"    care about it?"

Sigmoid: The Gentle Squisher

The Story

Imagine a tube of toothpaste. No matter how hard you squeeze (big number) or how gently you press (small number), the toothpaste always comes out between “nothing” and “a full squeeze.”

That’s Sigmoid! It squishes ANY number into a range between 0 and 1.

How It Works

Sigmoid(x) = 1 / (1 + e^(-x))

What this means:

  • Very negative number → Output near 0 (almost off)
  • Very positive number → Output near 1 (almost fully on)
  • Zero → Output is exactly 0.5 (halfway)

Example

Input Output Meaning
-10 0.00 “Nope, ignore this!”
0 0.50 “Hmm, I’m unsure”
+10 1.00 “Yes! Pass it on!”

When to Use Sigmoid

  • Binary classification (Yes/No questions)
  • When you need probabilities (0% to 100%)

The Problem

Sigmoid has a vanishing gradient problem. When inputs are very big or very small, the network learns VERY slowly—like trying to push a car uphill through mud.


Tanh: Sigmoid’s Cooler Sibling

The Story

Tanh is like Sigmoid, but instead of toothpaste, imagine a seesaw. It can tilt down (-1), stay flat (0), or tilt up (+1).

How It Works

Tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

What this means:

  • Very negative → Output near -1
  • Very positive → Output near +1
  • Zero → Output is exactly 0

Sigmoid vs Tanh

Feature Sigmoid Tanh
Output Range 0 to 1 -1 to +1
Zero-centered? No Yes
When to use Final layer for probability Hidden layers

Why Tanh is Often Better

Tanh is zero-centered. This means the average output is around 0, which helps the network learn faster!

Example:

  • Sigmoid: “The answer is somewhere between 0 and 1”
  • Tanh: “The answer is somewhere between negative and positive”

ReLU Family: The Simplest Heroes

ReLU (Rectified Linear Unit)

The Story

Imagine a bouncer at a club. If you’re under 0 (negative), you can’t enter—you get 0. If you’re 0 or above, you pass through unchanged!

ReLU(x) = max(0, x)

How It Works

Input Output Rule
-5 0 Negative? Block it!
0 0 Zero stays zero
+5 +5 Positive? Let it through!

Why ReLU is Popular

  1. Super fast to compute
  2. No vanishing gradient for positive values
  3. Works great in most situations

The “Dying ReLU” Problem

Sometimes neurons get stuck at 0 forever. Like a light switch that breaks in the “off” position.


Leaky ReLU: Fixing the Dying Problem

Instead of blocking negatives completely, Leaky ReLU lets a tiny bit through.

Leaky ReLU(x) = max(0.01x, x)

Example:

  • Input: -10 → Output: -0.1 (not zero!)
  • Input: +10 → Output: +10 (unchanged)

Parametric ReLU (PReLU)

Like Leaky ReLU, but the “leak amount” is learned by the network!

PReLU(x) = max(αx, x)

Where α (alpha) is a learnable parameter.

ELU (Exponential Linear Unit)

For negative inputs, ELU uses a smooth curve instead of a straight line.

ELU(x) = x           if x > 0
ELU(x) = α(e^x - 1)  if x ≤ 0

Benefit: Smoother gradients, better learning!


GELU: The Smart Gatekeeper

The Story

GELU is like a thoughtful bouncer who doesn’t just look at whether you’re positive or negative. It considers probability—“How likely is it that this value is important?”

GELU(x) ≈ x · Φ(x)

Where Φ(x) is the probability that a random number is less than x.

Why GELU is Special

  • Used in BERT and GPT (the AI models behind ChatGPT!)
  • Smoother than ReLU
  • Better for language tasks

Simple Understanding

Input GELU says…
Very negative “Probably not important, near 0”
Near zero “Maybe important, partial pass”
Very positive “Definitely important, full pass!”

Swish: Self-Gated Activation

The Story

Swish multiplies the input by its own Sigmoid! It’s like asking the input to rate itself.

Swish(x) = x · Sigmoid(x)

How It Works

Input Sigmoid(Input) Swish Output
-5 0.007 -0.03
0 0.5 0
+5 0.993 4.97

Why Swish Works Well

  • Non-monotonic: Can actually decrease before increasing
  • Smooth: No sharp corners
  • Found by Google using automated search!

Softmax: The Probability Distributor

The Story

Imagine you have 5 friends voting for their favorite ice cream. Softmax takes their votes and turns them into percentages that add up to 100%.

How It Works

Softmax(xi) = e^xi / Σe^xj

In simple terms: “How big is this number compared to ALL the numbers?”

Example

Raw scores (logits): [2, 1, 0.5]

After Softmax: [0.59, 0.24, 0.17]

This means:

  • First option: 59% chance
  • Second option: 24% chance
  • Third option: 17% chance

Total: 100%!

When to Use Softmax

  • Multi-class classification (picking ONE answer from many)
  • The final layer when you need probabilities
  • When choices are mutually exclusive (can only pick one)

Logits: The Raw Scores

What Are Logits?

Logits are the raw, unprocessed outputs before applying Softmax or Sigmoid.

The Story

Think of logits as raw test scores before converting to grades.

Student Raw Score (Logit) After Softmax (Grade)
Alice 10 90%
Bob 5 8%
Carol 3 2%

Why Are Logits Important?

  1. More numerically stable during training
  2. Carry more information than probabilities
  3. Loss functions like Cross-Entropy work with logits

Logits → Probabilities Pipeline

Neural Network → Logits → Softmax → Probabilities
                  ↑                      ↑
            Raw numbers          Adds up to 1
            (any value)          (0 to 1 each)

Comparison Chart

graph TD A[Activation Functions] --> B[For Hidden Layers] A --> C[For Output Layer] B --> D[ReLU Family] B --> E[GELU/Swish] C --> F[Sigmoid] C --> G[Softmax] D --> H[ReLU: Fast & Simple] D --> I[Leaky ReLU: No dying] D --> J[ELU: Smooth negatives] F --> K[Binary: Yes/No] G --> L[Multi-class: Pick one]

Quick Reference Table

Function Output Range Best For Watch Out
Sigmoid 0 to 1 Binary output Vanishing gradient
Tanh -1 to +1 Hidden layers Vanishing gradient
ReLU 0 to ∞ Most cases Dying neurons
Leaky ReLU -∞ to ∞ When ReLU dies Extra parameter
GELU -0.17 to ∞ Transformers Slower to compute
Swish -0.28 to ∞ Deep networks Slower to compute
Softmax 0 to 1 (sum=1) Multi-class Only for output

Key Takeaways

  1. Activation functions add non-linearity — they let networks learn complex patterns

  2. Sigmoid/Tanh squish values but can slow learning (vanishing gradients)

  3. ReLU is fast and effective but neurons can “die”

  4. GELU/Swish are modern favorites for transformers and deep networks

  5. Softmax converts scores to probabilities for classification

  6. Logits are raw scores before probability conversion


You Did It!

You now understand the gatekeepers of neural networks! Every time an AI makes a decision, activation functions are working behind the scenes, deciding what’s important and what’s not.

Next time you use ChatGPT or unlock your phone with face recognition, remember: millions of tiny activation functions are saying “Yes!”, “No!”, or “Maybe!” to help make that magic happen.

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.