Kubernetes Disruption Management: Keeping Your Pods Safe & Sound 🛡️
The Playground Supervisor Story
Imagine you’re running a busy playground with kids playing on swings, slides, and seesaws. You’re the supervisor. Sometimes, you need to move kids around:
- The slide needs fixing → You gently ask kids to move to the swings (voluntary)
- A sudden storm comes → You rush everyone inside immediately (involuntary)
- You have a rule: At least 3 kids must always be playing outside (disruption budget)
Kubernetes works exactly the same way with your pods!
1. Eviction and Node Pressure
What is Node Pressure?
Think of a node like a toy box. It can only hold so many toys (pods). When the box gets too full, something has to come out!
Node pressure happens when:
- 📦 Memory is running low (too many big toys)
- đź’ľ Disk space is filling up (no room for new toys)
- 🔢 Too many processes (too many toys making noise)
What is Eviction?
Eviction = Kubernetes politely (or not so politely) removing a pod from a node.
When a node feels “pressure,” it starts evicting pods to survive. It’s like your toy box pushing out some toys to avoid breaking!
graph TD A["Node Running Smoothly"] --> B{Memory Running Low?} B -->|Yes| C["Node Under Pressure!"] B -->|No| A C --> D["Start Evicting Pods"] D --> E["Pick Low-Priority Pods First"] E --> F["Remove Pod from Node"] F --> G["Node Recovers"]
Real Example
Your node has 8GB memory. Pods are using 7.5GB. A new pod needs 1GB.
What happens?
- Node detects memory pressure
- Kubernetes looks for pods to evict
- Picks pods based on priority
- Evicts them to free memory
- New pod can now run!
Simple Rule: Kubernetes always tries to keep nodes healthy, even if it means removing some pods.
2. Pod Eviction and Disruption
Types of Pod Eviction
There are two ways pods get evicted:
| Type | Cause | Speed |
|---|---|---|
| API-Initiated | Someone asked Kubernetes to remove it | Graceful |
| Node-Pressure | Node is struggling | Can be sudden |
The Graceful Goodbye
When a pod is evicted gracefully:
- ⏰ Pod gets a warning (termination grace period)
- 📤 Pod finishes what it’s doing
- đź’ľ Pod saves its work
- đź‘‹ Pod shuts down cleanly
Default grace period: 30 seconds
Disruption = Anything That Stops Your Pod
Disruption is the fancy word for “something stopped my pod from running.”
Think of it like this:
- Your swing stopped → That’s a disruption
- You chose to stop → Voluntary disruption
- Someone pushed you off → Involuntary disruption
3. Voluntary vs Involuntary Disruptions
Voluntary Disruptions (Planned Events)
You (or your team) caused this on purpose.
| Example | Why It Happens |
|---|---|
kubectl drain node |
Preparing node for maintenance |
| Deleting a deployment | You don’t need those pods anymore |
| Updating pod template | Rolling out new version |
| Node upgrade | Installing new Kubernetes version |
| Cluster scaling down | Saving costs |
Key Point: You have control. You can plan for these!
Involuntary Disruptions (Unplanned Events)
The universe happened to your pods.
| Example | Why It Happens |
|---|---|
| Hardware failure | Server died |
| Kernel panic | Operating system crashed |
| Node disappears | Network issues |
| Out of memory | Node pressure eviction |
| Cloud provider issues | VM deleted by accident |
Key Point: You can’t prevent these, but you CAN prepare for them!
graph TD A["Disruption Happens"] --> B{Was it planned?} B -->|Yes| C["Voluntary"] B -->|No| D["Involuntary"] C --> E["You control timing"] C --> F["Can use PDB"] D --> G["Unexpected event"] D --> H["Must be resilient"]
Why Does This Matter?
Voluntary disruptions respect your rules.
When you drain a node, Kubernetes:
- âś… Checks your Pod Disruption Budgets
- âś… Waits for new pods to be ready
- âś… Moves pods one at a time
Involuntary disruptions don’t care about rules.
When a node dies:
- ❌ No warning
- ❌ No graceful shutdown
- ❌ Pods just disappear
4. Pod Disruption Budgets (PDB)
The Safety Rule
Remember our playground rule? “At least 3 kids must always be playing.”
Pod Disruption Budget = Your rule for how many pods must stay running during voluntary disruptions.
How to Create a PDB
You have two ways to set your safety rule:
Option 1: Minimum Available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
“At least 2 pods must always be running”
Option 2: Maximum Unavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: my-app
“At most 1 pod can be down at a time”
PDB in Action
Scenario: You have 5 pods. PDB says minAvailable: 3.
| Action | Allowed? |
|---|---|
| Evict 1 pod (4 remain) | âś… Yes |
| Evict 2 pods (3 remain) | âś… Yes |
| Evict 3 pods (2 remain) | ❌ No! Below minimum |
graph TD A["5 Pods Running"] --> B["Try to evict 1"] B --> C{3+ pods remain?} C -->|Yes, 4 remain| D["✅ Eviction Allowed"] D --> E["4 Pods Running"] E --> F["Try to evict 2 more"] F --> G{3+ pods remain?} G -->|No, only 2| H["❌ Eviction Blocked"] G -->|Yes, exactly 3| I["✅ Only 1 more allowed"]
When PDB Doesn’t Help
Important: PDB only works for voluntary disruptions!
| Disruption Type | PDB Protects? |
|---|---|
kubectl drain |
âś… Yes |
| Node upgrade | âś… Yes |
| Deployment update | âś… Yes |
| Node crash | ❌ No |
| Out of memory | ❌ No |
| Hardware failure | ❌ No |
For involuntary disruptions: Use multiple replicas spread across nodes!
Putting It All Together
The Complete Picture
graph TD A["Your Pods Running"] --> B{Disruption Type?} B -->|Voluntary| C["Check PDB"] B -->|Involuntary| D["No Protection"] C --> E{Budget Allows?} E -->|Yes| F["Pod Evicted Gracefully"] E -->|No| G["Eviction Blocked"] D --> H["Pod Dies Immediately"] F --> I["New Pod Scheduled"] H --> I G --> J["Wait & Retry"]
Best Practices Checklist
1. Always Set PDBs for Important Apps
- Web servers:
minAvailable: 2 - Databases:
maxUnavailable: 1 - Critical services: At least 50% available
2. Spread Pods Across Nodes
- Use
podAntiAffinity - Don’t put all eggs in one basket
3. Set Resource Requests
- Helps Kubernetes make smart eviction choices
- Prevents unexpected memory pressure
4. Plan for the Worst
- Assume nodes will die
- Test with chaos engineering
- Monitor pod disruptions
Summary: The Key Takeaways
| Concept | Simple Definition |
|---|---|
| Node Pressure | Node running out of resources (memory/disk) |
| Eviction | Removing a pod from a node |
| Voluntary Disruption | Planned pod removal (you control it) |
| Involuntary Disruption | Unplanned pod removal (accidents happen) |
| PDB | Your rule for minimum running pods |
Remember This
PDBs are your seatbelt for voluntary disruptions.
They won’t save you in a crash (involuntary), but they’ll protect you during normal driving (planned maintenance).
You Did It! 🎉
You now understand:
- âś… Why nodes evict pods (pressure!)
- âś… How pod eviction works (graceful or sudden)
- âś… The difference between voluntary and involuntary disruptions
- âś… How to protect your apps with Pod Disruption Budgets
Next time someone says “We need to drain that node,” you’ll know exactly what happens to your pods!
