GPU and Special Hardware in Kubernetes
The Story of the Super Workshop ๐ญ
Imagine you have a giant workshop with many worker tables (these are your Kubernetes nodes). Most tables are good for regular workโcutting paper, writing, drawing. But sometimes you need special tools: a super-powerful laser cutter, a 3D printer, or a microscope.
These special tools are like GPUs and special hardware in Kubernetes. Not every table has them, and you need a smart way to:
- Tell Kubernetes which tables have special tools (Node Feature Discovery)
- Let your projects use those tools properly (Device Plugins)
Letโs explore this magical workshop!
What is a GPU? ๐ฎ
A GPU (Graphics Processing Unit) is like a super-brain thatโs really good at doing many small tasks at once.
Simple Example:
- Your regular brain (CPU): Solves one hard math problem at a time
- GPU brain: Solves 1000 easy math problems ALL AT ONCE!
Why Do We Need GPUs?
| Task | CPU (Regular Brain) | GPU (Super Brain) |
|---|---|---|
| Training AI | ๐ข Slow (days) | ๐ Fast (hours) |
| Video editing | ๐ด Sluggish | โก Smooth |
| Scientific math | ๐ One by one | ๐ Thousands together |
Device Plugins: The Tool Librarians ๐
The Problem
Kubernetes is smart, but it doesnโt automatically know about special hardware. Itโs like having a librarian who knows about books but not about the 3D printer in the corner.
The Solution: Device Plugins!
A Device Plugin is like a special helper that tells Kubernetes:
โHey! This node has 2 GPUs ready to use!โ
graph TD A["๐ฅ๏ธ Node with GPU"] --> B["Device Plugin"] B --> C["๐ข Tells Kubernetes"] C --> D["โ GPU Available!"] D --> E["๐ Pods Can Use GPU"]
How Device Plugins Work
Step 1: Discovery The device plugin finds all the GPUs on the node.
Step 2: Registration It tells the kubelet: โI manage GPUs!โ
Step 3: Allocation When a pod asks for a GPU, the plugin gives it one.
Real Example: NVIDIA Device Plugin
This is the most popular GPU plugin. It lets your pods use NVIDIA graphics cards.
# Installing NVIDIA device plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
spec:
selector:
matchLabels:
name: nvidia-plugin
template:
spec:
containers:
- name: nvidia-plugin
image: nvidia/k8s-device-plugin
Requesting a GPU in Your Pod
Once the plugin is running, asking for a GPU is easy!
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-app
image: my-ml-app
resources:
limits:
nvidia.com/gpu: 1
๐ฏ Key Point: The nvidia.com/gpu: 1 line is like saying โI need 1 special tool from the GPU shelf!โ
Node Feature Discovery: The Detective ๐
The Problem
Your cluster has 100 nodes. Some have GPUs. Some have fast SSDs. Some have special Intel features. How does Kubernetes know what each node can do?
Enter: Node Feature Discovery (NFD)!
NFD is like a detective that visits every node and creates a detailed report of its special abilities.
graph TD A["๐ NFD Visits Node"] --> B["Checks Hardware"] B --> C["Finds: GPU โ "] B --> D["Finds: Fast SSD โ "] B --> E["Finds: Intel AVX โ "] C --> F["๐ท๏ธ Adds Labels to Node"] D --> F E --> F F --> G["Scheduler Knows Everything!"]
What NFD Discovers
| Category | Examples |
|---|---|
| CPU | Intel, AMD, number of cores, special instructions |
| Memory | How much RAM, memory speed |
| Storage | SSD, NVMe, rotational drives |
| Network | Speed, SR-IOV capability |
| GPU | NVIDIA, AMD, model, memory |
| Custom | Your own special features! |
NFD Labels: The Name Tags
After NFD runs, your nodes get labels like name tags:
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/pci-1234.present=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
Installing Node Feature Discovery
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nfd-worker
namespace: node-feature-discovery
spec:
selector:
matchLabels:
app: nfd-worker
template:
spec:
containers:
- name: nfd-worker
image: registry.k8s.io/nfd/node-feature-discovery:v0.14.0
args:
- "-feature-sources=all"
Using NFD Labels for Scheduling
Now you can tell Kubernetes: โRun this pod ONLY on nodes with GPUs!โ
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
containers:
- name: trainer
image: my-ml-trainer
๐ง Fun Fact: 10de is NVIDIAโs vendor code. NFD found it automatically!
How They Work Together ๐ค
Device Plugins and NFD are best friends!
graph TD A["Node Feature Discovery"] -->|Finds GPUs| B["Adds Labels"] B --> C["Scheduler Sees Labels"] D["Device Plugin"] -->|Registers GPUs| E["Kubelet Knows Count"] E --> F["Pods Can Request GPUs"] C --> G["Smart Scheduling!"] F --> G
The Complete Flow
- NFD scans the node and adds labels
- Device Plugin tells kubelet about GPU count
- You write a pod asking for GPU
- Scheduler finds nodes with GPU label
- Kubelet allocates an actual GPU to your pod
- Your pod runs with GPU power! ๐
Common Device Plugins ๐
| Plugin | Hardware | What It Does |
|---|---|---|
| NVIDIA | GPU | Exposes NVIDIA graphics cards |
| AMD | GPU | Exposes AMD graphics cards |
| Intel | GPU/FPGA | Intel accelerators |
| SR-IOV | Network | Fast network cards |
| RDMA | Network | Ultra-fast networking |
Practice Example: ML Training Setup
Letโs set up a cluster for machine learning!
Step 1: Install NFD
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.14.0/deployment/nfd.yaml
Step 2: Install NVIDIA Device Plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
Step 3: Check Your Nodes
kubectl get nodes -o json | jq '.items[].metadata.labels' | grep feature
Step 4: Run Your ML Pod
apiVersion: v1
kind: Pod
metadata:
name: pytorch-training
spec:
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
containers:
- name: pytorch
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 2
command: ["python", "train.py"]
Key Takeaways ๐ฏ
- Device Plugins = Librarians that manage special hardware
- NFD = Detective that discovers what each node can do
- Labels = Name tags that help scheduling
- Together = Smart placement of GPU workloads!
Remember This Analogy:
๐ญ Workshop = Cluster ๐ช Tables = Nodes ๐ง Special Tools = GPUs/Hardware ๐ Tool Inventory = Device Plugin ๐ Inspector = Node Feature Discovery ๐ท๏ธ Labels = What tools each table has
Quick Reference
Request 1 GPU:
resources:
limits:
nvidia.com/gpu: 1
Target GPU Nodes:
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
Check Available GPUs:
kubectl describe node <node-name> | grep nvidia
You now understand how Kubernetes finds and uses special hardware! ๐
The detective (NFD) discovers whatโs special about each node, and the librarian (Device Plugin) makes sure your pods can use those special tools. Together, they make GPU workloads on Kubernetes magical! โจ
