Pod Lifecycle and Scheduling
Pod Lifecycle
Pod Phases
┌──────────────────────────────────────────────────────────┐
│ Pod Lifecycle Phases │
│ │
│ Pending → Running → Succeeded/Failed │
│ ↓ │
│ Unknown │
└──────────────────────────────────────────────────────────┘Pending → Pod accepted but awaiting resource allocation, scheduling, and container image downloads.
Running → Pod bound to node, all containers in the Pod have been created, and at least one container is still running.
Succeeded → All containers in the Pod have terminated in success, and will not be restarted.
The “Succeeded” Pod status is typical for Pods run by
JoborCronJob.Failed → All containers in the Pod have terminated, and at least one container has terminated in failure.
Unknown → For some reason the state of the Pod could not be obtained (node communication lost).
Container States
Within a pod, each container has a state:
# kubectl describe pod shows container states
Containers:
nginx:
State: Running
Started: Mon, 01 Jan 2024 10:00:00 +0000
Last State: Terminated
Reason: Error
Exit Code: 1States:
- Waiting: Container not running (pulling image, waiting for init)
- Running: Container executing
- Terminated: Container finished or crashed
Complete Pod Lifecycle
User creates Pod
↓
API Server validates & stores
↓
Pod: Pending (nodeName=null)
↓
Scheduler assigns node
↓
Pod: Pending (nodeName=worker-1)
↓
kubelet pulls images
↓
Pod: Pending (containers creating)
↓
Init containers run (if any)
↓
Init containers succeed
↓
Startup probe (if configured)
↓
Main containers start
↓
Pod: Running
↓
Liveness & Readiness probes
↓
┌─────────────────┬─────────────────┐
│ Success path │ Failure path │
├─────────────────┼─────────────────┤
│ Containers exit │ Container crash │
│ with code 0 │ or killed │
│ ↓ │ ↓ │
│ Pod: Succeeded │ Pod: Failed │
│ (for Jobs) │ (restartPolicy │
│ │ determines │
│ │ next action) │
└─────────────────┴─────────────────┘Container Restart Policy
spec:
restartPolicy: Always # Always, OnFailure, Never| Policy | Behavior | Use Case |
|---|---|---|
| Always | Restart on any termination | Long-running services |
| OnFailure | Restart only on failure (exit code != 0) | Batch jobs |
| Never | Never restart | One-time tasks |
Example:
restartPolicy: Always
Container exits with code 0 → Restart anyway
Container exits with code 1 → Restart
Container crashes → Restart
Backoff: 10s, 20s, 40s, 80s, 160s, max 5mPod Scheduling
Scheduling Process
1. Pod created (nodeName=null)
↓
2. Scheduler watches for unscheduled pods
↓
3. FILTERING PHASE
┌──────────────────────────┐
│ Filter unsuitable nodes: │
│ • Insufficient resources │
│ • Taints without tolerations
│ • Node selector mismatch │
│ • Affinity rules violated│
└──────────┬───────────────┘
↓
4. SCORING PHASE
┌──────────────────────────┐
│ Rank remaining nodes: │
│ • Resource balance │
│ • Affinity preferences │
│ • Spread constraints │
└──────────┬───────────────┘
↓
5. SELECT highest-scored node
↓
6. BIND pod to node (update nodeName)Scheduling Constraints
Control where pods are scheduled.
1. Node Selector (Simple)
Purpose: Select nodes with specific labels.
Uses labels on nodes and label selectors (nodeSelector) on pods. This is the simplest way to constrain pods to specific nodes.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
nodeSelector:
gpu: "true"
zone: us-east-1a
containers:
- name: app
image: ml-app# Label nodes first
kubectl label nodes worker-1 gpu=true zone=us-east-1a
kubectl label nodes worker-2 gpu=true zone=us-east-1a
# Pod will only schedule on nodes with BOTH labelsUse cases:
- Schedule GPU workloads only on GPU nodes
- Pin pods to specific availability zones
- Separate dev/prod workloads on different node pools
Limitations:
- All labels must match (AND logic only)
- No “preferred” scheduling (hard requirement only)
- Limited expressiveness
2. Node Affinity (Flexible)
Purpose: More expressive and flexible node selection with support for required and preferred rules.
Provides more control than nodeSelector with support for operators and soft/hard constraints.
spec:
affinity:
nodeAffinity:
# HARD requirement (must match)
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: node-type
operator: NotIn
values:
- spot # Don't schedule on spot instances
# SOFT preference (try to match, but not required)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100 # Higher weight = stronger preference (1-100)
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 50
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east-1aOperators:
In- Label value in listNotIn- Label value not in listExists- Label key exists (ignore value)DoesNotExist- Label key doesn’t existGt- Greater than (numeric comparison)Lt- Less than (numeric comparison)
Key difference from nodeSelector:
nodeSelector:
gpu: "true" ← Simple, hard requirement only
nodeAffinity:
required: ← Hard requirement (like nodeSelector)
- gpu: "true"
preferred: ← Soft preference (NEW!)
- ssd: "true" ← Try to match, but okay if not availableUse cases:
- Prefer SSD nodes but allow HDD if needed
- Multiple OR conditions (
nodeSelectorTermsis OR) - Complex label matching logic
3. Pod Affinity/Anti-Affinity
Purpose: Define rules for Pod placement based on node or Pod attributes (more flexible).
Schedule pods relative to OTHER pods, not just node labels.
Pod Affinity (Preferred Placement)
What it does: Allows Pods to prefer certain nodes or co-locate with other Pods.
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: cache # Find pods with app=cache
topologyKey: kubernetes.io/hostname # Co-locate on same nodeResult:
If cache pod is on node-1:
→ This pod MUST also schedule on node-1
If no cache pod exists:
→ This pod cannot be scheduled (pending)Common topologyKey values:
kubernetes.io/hostname → Same physical node
topology.kubernetes.io/zone → Same availability zone
topology.kubernetes.io/region → Same regionUse cases:
- Co-locate app with its cache (reduce latency)
- Schedule related microservices together
- Data locality requirements
Pod Anti-Affinity (Avoid Placement)
What it does: Prevents Pods from being scheduled on the same node, zone, or near other specific Pods (can replicate DaemonSet behavior).
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web # Find pods with app=web
topologyKey: kubernetes.io/hostname # Avoid same nodeResult:
If web pod already on node-1:
→ This web pod MUST schedule on different node
Ensures high availability:
Node-1: web-pod-1
Node-2: web-pod-2 ← Spread across nodes
Node-3: web-pod-3Use case: Spread replicas across nodes
# Ensure replicas never share a node
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web
topologyKey: kubernetes.io/hostnameSoft vs Hard Rules:
# HARD (required): Pod won't schedule if rule can't be satisfied
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostname
# SOFT (preferred): Try to satisfy, but schedule anyway if impossible
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostnameCommon use cases:
- High Availability: Spread replicas across zones
- Performance: Co-locate related services
- Compliance: Separate production from non-production
4. Taints and Tolerations
Purpose: Restrict Pod scheduling on nodes using a repel mechanism. Taints act as locks on nodes; tolerations are the keys for pods.
Taints (Node-Level Rule)
What it does: Applied to nodes to repel unwanted Pods.
Think of taints as “keep out” signs on nodes. Only pods with matching tolerations can schedule there.
# Taint a node
kubectl taint nodes <node-name> <key>=<value>:effect
# Examples:
kubectl taint nodes worker-1 workload=database:NoSchedule
kubectl taint nodes worker-2 dedicated=gpu:NoSchedule
kubectl taint nodes worker-3 maintenance=true:NoExecuteTaint Effects:
| Effect | Behavior | Use Case |
|---|---|---|
| NoSchedule | Dont schedule new pods (existing pods stay) | Reserve nodes for specific workloads |
| PreferNoSchedule | Try to avoid scheduling (soft) | Prefer not to use, but okay if needed |
| NoExecute | Evict existing pods + dont schedule new ones | Drain nodes for maintenance |
NoSchedule:
Existing pods: ✅ Stay running
New pods: ❌ Cannot schedule (unless toleration)
PreferNoSchedule:
Existing pods: ✅ Stay running
New pods: ⚠️ Avoid if possible, but can schedule
NoExecute:
Existing pods: ❌ Evicted (unless toleration)
New pods: ❌ Cannot schedule (unless toleration)Tolerations (Pod-Level Rule)
What it does: Applied to Pods to allow scheduling on tainted nodes.
Tolerations are the “key” that unlocks tainted nodes.
spec:
tolerations:
# Exact match toleration
- key: "workload"
operator: "Equal"
value: "database"
effect: "NoSchedule"
# Tolerate any value for this key
- key: "workload"
operator: "Exists"
effect: "NoSchedule"
# Tolerate ALL taints (universal key)
- operator: "Exists"Toleration Operators:
Equal: # key, value, and effect must match
- key: "gpu"
operator: "Equal"
value: "nvidia"
effect: "NoSchedule"
Exists: # Only key and effect must match (any value)
- key: "gpu"
operator: "Exists"
effect: "NoSchedule"
Exists (no key): # Tolerates everything
- operator: "Exists"Use Case 1: Dedicated Nodes
Scenario: Reserve nodes for database workloads only.
# Step 1: Taint database nodes
kubectl taint nodes db-node-1 workload=database:NoSchedule
kubectl taint nodes db-node-2 workload=database:NoSchedule
# Result: Regular pods CANNOT schedule on these nodes# Step 2: Database pods get toleration
apiVersion: v1
kind: Pod
metadata:
name: postgres
spec:
tolerations:
- key: "workload"
operator: "Equal"
value: "database"
effect: "NoSchedule"
containers:
- name: postgres
image: postgres:14Result:
db-node-1: ✅ postgres pods only
db-node-2: ✅ postgres pods only
worker-1: ✅ regular pods only
worker-2: ✅ regular pods onlyUse Case 2: Node Maintenance
Scenario: Evict all pods from a node being taken down.
# Taint with NoExecute effect
kubectl taint nodes worker-1 maintenance=true:NoExecute
# All pods without matching toleration are EVICTEDResult:
Before:
worker-1: [pod-1] [pod-2] [pod-3]
After taint:
worker-1: [] ← All pods evicted
worker-2: [pod-1] [pod-2] ← Rescheduled here
worker-3: [pod-3] ← Rescheduled hereUse Case 3: GPU Nodes
# Reserve GPU nodes for ML workloads
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule# ML workload with toleration
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
resources:
limits:
nvidia.com/gpu: 1Use Case 4: Spot Instances
# Mark spot instances (can be terminated anytime)
kubectl taint nodes spot-1 node.kubernetes.io/instance-type=spot:NoSchedule# Non-critical workload tolerates spot instances
spec:
tolerations:
- key: "node.kubernetes.io/instance-type"
operator: "Equal"
value: "spot"
effect: "NoSchedule"Common Built-in Taints:
Kubernetes automatically applies these taints:
# Node not ready
node.kubernetes.io/not-ready:NoExecute
# Unreachable node
node.kubernetes.io/unreachable:NoExecute
# Out of disk
node.kubernetes.io/out-of-disk:NoSchedule
# Memory pressure
node.kubernetes.io/memory-pressure:NoSchedule
# Disk pressure
node.kubernetes.io/disk-pressure:NoSchedule
# Network unavailable
node.kubernetes.io/network-unavailable:NoSchedule
# Unschedulable
node.kubernetes.io/unschedulable:NoScheduleTaints vs Affinity:
| Mechanism | Purpose | Default Behavior |
|---|---|---|
| Taints + Tolerations | Repel pods (deny by default) | Pods CANNOT schedule |
| Node Affinity | Attract pods (allow by default) | Pods CAN schedule |
Taints: "Keep everyone out except those with permission"
Affinity: "Prefer these nodes, but others are okay"5. Topology Spread Constraints
Purpose: Control pod distribution across failure-domains (e.g., regions, zones, nodes, etc.) to improve availability and resource utilization.
What it does: Spreads pods across topology domains (zones, nodes, regions) based on constraints.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: webKey Parameters:
maxSkew: 1
# Maximum allowed difference in pod count between domains
# Lower = more even distribution
# Higher = more flexibility
topologyKey: topology.kubernetes.io/zone
# Label key that defines topology domains
# Common values:
# - kubernetes.io/hostname (spread across nodes)
# - topology.kubernetes.io/zone (spread across AZs)
# - topology.kubernetes.io/region (spread across regions)
whenUnsatisfiable: DoNotSchedule
# DoNotSchedule: Hard constraint (pod stays pending)
# ScheduleAnyway: Soft constraint (best effort)
labelSelector:
matchLabels:
app: web
# Which pods to consider when calculating spreadExample Result:
Configuration:
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
Initial state (unbalanced):
us-east-1a: [pod-1] [pod-2] [pod-3] ← 3 pods
us-east-1b: [pod-4] ← 1 pod
us-east-1c: [] ← 0 pods
Difference: 3 - 0 = 3 (violates maxSkew=1!)
After topology spread:
us-east-1a: [pod-1] [pod-2] ← 2 pods
us-east-1b: [pod-3] [pod-4] ← 2 pods
us-east-1c: [pod-5] ← 1 pod
Difference: 2 - 1 = 1 ✅ (satisfies maxSkew=1)Use Case 1: High Availability Across Zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: webResult:
Zone A: 2 pods ← maxSkew=1 ensures
Zone B: 2 pods ← no more than 1 pod
Zone C: 2 pods ← difference between zones
If Zone A fails:
✅ Still have 4/6 pods running (66% availability)Use Case 2: Spread Across Nodes
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: cacheResult:
node-1: [cache-1] ← 1 pod per node
node-2: [cache-2] ← Avoids single point of failure
node-3: [cache-3]
node-4: [cache-4]Use Case 3: Multiple Constraints
spec:
topologySpreadConstraints:
# Spread across zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
# Also spread across nodes within each zone
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # Soft constraint
labelSelector:
matchLabels:
app: webwhenUnsatisfiable Comparison:
DoNotSchedule:
# Hard constraint
# Pod stays Pending if constraint cannot be satisfied
# Use for: Critical availability requirements
ScheduleAnyway:
# Soft constraint (best effort)
# Pod schedules even if constraint violated
# Use for: Preferences, optimizationTopology Spread vs Pod Anti-Affinity:
| Feature | Topology Spread | Pod Anti-Affinity |
|---|---|---|
| Distribution | Even spread with maxSkew | Binary (same/different) |
| Flexibility | Gradual (maxSkew=1,2,3…) | All-or-nothing |
| Use Case | Balance across zones | Separate critical replicas |
Pod Anti-Affinity:
✅ Good: "Never put 2 replicas on same node"
❌ Limited: Can't express "spread evenly"
Topology Spread:
✅ Good: "Spread evenly with max difference of 1"
✅ Flexible: Control degree of spreading6. nodeName Field
Purpose: Directly assigns a Pod to a specific node (hard binding, no scheduling logic).
apiVersion: v1
kind: Pod
metadata:
name: manual-pod
spec:
nodeName: worker-node-2 # ← Hard-coded assignment
containers:
- name: nginx
image: nginxHow it works:
# Normal Scheduling
Pod created
↓
Scheduler evaluates all nodes
↓
Filtering phase (resource checks, taints, etc.)
↓
Scoring phase (best node selection)
↓
Pod assigned to chosen node
# With nodeName
Pod created with nodeName=worker-node-2
↓
Scheduler SKIPPED entirely
↓
Pod assigned directly to worker-node-2
↓
kubelet on worker-node-2 starts podCharacteristics:
✅ Bypasses scheduler completely
✅ No resource checks
✅ No taint/toleration checks
✅ No affinity evaluation
❌ No validation if node exists
❌ No validation if node has capacity
❌ Pod fails if node doesn't exist or can't run itWhen to use:
✅ Good use cases:
- Debugging (force pod to specific node)
- DaemonSets (Kubernetes uses this internally)
- Static pods (kubelet manages directly)
- Testing node-specific behavior
❌ Avoid for:
- Production workloads
- Applications requiring HA
- Anywhere scheduler intelligence is neededPod Priority and Preemption
The Problem
Scenario: Your cluster is at capacity (all resources used). A critical pod needs to start NOW, but there’s no room.
Cluster at 100% capacity:
node-1: [batch-job-1] [batch-job-2] [batch-job-3]
node-2: [batch-job-4] [batch-job-5] [batch-job-6]
node-3: [batch-job-7] [batch-job-8] [batch-job-9]
Critical API pod needs to start:
api-pod: Pending (no resources available)
Question: Should low-priority batch jobs block critical API?
Answer: No! Priority + Preemption solves this.What is Pod Priority?
Pod Priority assigns importance levels to pods. Higher priority = more important.
Think of it like airline boarding:
Priority 2000000: First Class (critical services)
Priority 1000: Business (important apps)
Priority 0: Economy (batch jobs, default)How it works:
# Step 1: Define PriorityClass (cluster-wide)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-priority
value: 1000000 # Higher number = higher priority
globalDefault: false # Don't apply to all pods automatically
description: "Critical production services"# Step 2: Reference PriorityClass in Pod
apiVersion: v1
kind: Pod
metadata:
name: api-server
spec:
priorityClassName: critical-priority # ← Uses priority value 1000000
containers:
- name: api
image: api-server:v1Built-in Priority Classes:
Kubernetes comes with two system priorities:
# View built-in priorities
kubectl get priorityclasses
NAME VALUE GLOBAL-DEFAULT
system-cluster-critical 2000000000 false
system-node-critical 2000001000 falseThese are reserved for Kubernetes system components (kube-proxy, CoreDNS, etc.).
What is Preemption?
Preemption is when the scheduler evicts (kills) lower-priority pods to make room for higher-priority pods.
When it happens:
1. High-priority pod created
↓
2. Scheduler tries to find a node with resources
↓
3. No node has enough resources (cluster full)
↓
4. Scheduler looks for lower-priority pods to evict
↓
5. Evicts lower-priority pods to free resources
↓
6. Schedules high-priority podPreventing Preemption
Sometimes you want priority WITHOUT evicting other pods:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-no-eviction
value: 5000
preemptionPolicy: Never # ← Won't evict other pods
description: "Important but won't preempt others"Behavior:
Pod with preemptionPolicy: Never
If resources available:
✅ Schedules immediately (uses priority for queue ordering)
If NO resources available:
❌ Stays Pending (won't evict lower-priority pods)
⏳ Waits for resources to free up naturallyUse case:
Important workload that should start first, but shouldnt
kick out other running workloads.
Example: Database backup job
- Important (run before other batch jobs)
- But shouldnt evict running application podsPriority vs Preemption
| Feature | Purpose | Effect |
|---|---|---|
| Priority | Scheduling order | Which pod schedules first when resources available |
| Preemption | Resource reclamation | Whether to evict lower-priority pods |
Priority without Preemption:
→ High-priority pod goes to front of queue
→ But waits if no resources available
→ preemptionPolicy: Never
Priority with Preemption (default):
→ High-priority pod goes to front of queue
→ Can evict lower-priority pods if needed
→ preemptionPolicy: PreemptLowerPriorityDefault Priority
What happens if no priorityClassName specified?
# Pod without priority
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
# No priorityClassName specified
containers:
- name: app
image: myappResult:
Default priority: 0
Unless globalDefault: true is set on a PriorityClass:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: default-priority
value: 100
globalDefault: true # ← All pods without priorityClassName get this
Then pods default to priority: 100Quality of Service (QoS)
Kubernetes assigns QoS classes based on resources. They are used by Kubernetes to decide which Pods to evict from a Node experiencing Node Pressure.
QoS classes
1. Guaranteed (Highest Priority)
resources:
requests:
memory: "1Gi"
cpu: "1"
limits:
memory: "1Gi" # Same as requests
cpu: "1" # Same as requestsRequirements:
- Every container has CPU and memory limits
- Requests == limits
2. Burstable
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi" # Different from requests
cpu: "1"Requirements:
- At least one container has CPU or memory request or limit
- Requests != limits
3. BestEffort (Lowest Priority)
# No resources defined
containers:
- name: app
image: nginxEviction Order (Resource Pressure)
Node runs low on memory/disk
↓
1. Evict BestEffort pods first
↓
2. Evict Burstable pods (exceeding requests)
↓
3. Evict Burstable pods (within requests)
↓
4. Evict Guaranteed pods (last resort)Pod Disruption Budgets (PDB)
Purpose: Ensures minimum availability for a set of pods during voluntary disruptions (e.g., node drains, rolling updates, or manual pod deletions).
Think of PDB as a “safety net” that prevents us from accidentally taking down too many pods at once.
What Are Voluntary vs Involuntary Disruptions?
Voluntary Disruptions (PDB protects against these):
✅ kubectl drain node-1 (admin draining node)
✅ kubectl delete pod my-pod (manual deletion)
✅ Deployment rolling update (controlled update)
✅ Cluster autoscaler scaling (removing nodes)
✅ Node maintenance (planned downtime)Involuntary Disruptions (PDB does NOT protect against these):
❌ Node hardware failure (unexpected crash)
❌ Node runs out of resources (OOM, disk full)
❌ Network partition (network issues)
❌ Kernel panic (OS crash)
❌ Pod eviction due to pressure (system-driven)Key point: PDB only prevents human-initiated or automated voluntary disruptions.
How It Works
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 2 # OR use maxUnavailable: 1
selector:
matchLabels:
app: webTwo ways to specify the budget:
| Field | Meaning | Example |
|---|---|---|
| minAvailable | Minimum pods that MUST stay running | minAvailable: 2 = At least 2 pods must be up |
| maxUnavailable | Maximum pods that CAN be down | maxUnavailable: 1 = At most 1 pod can be down |
3 replicas, minAvailable: 2
✅ Can disrupt 1 pod (2 remain)
❌ Cannot disrupt 2 pods (only 1 remains)
3 replicas, maxUnavailable: 1
✅ Can disrupt 1 pod
❌ Cannot disrupt 2 podsWhat PDB Prevents
1. Draining nodes when it would violate budget
# 3 web pods: node-1 has 1, node-2 has 2
# PDB: minAvailable: 2
kubectl drain node-1
✅ Allowed (1 pod evicted, 2 remain on node-2)
kubectl drain node-2
❌ Blocked! (would evict 2 pods, leaving only 1)
Error: Cannot evict pod web-xxx: violates PodDisruptionBudget2. Scaling down when it would violate budget
# 3 replicas, PDB: minAvailable: 2
kubectl scale deployment web --replicas=2
✅ Allowed (still have 2 pods)
kubectl scale deployment web --replicas=1
❌ Blocked! (would violate minAvailable: 2)3. Rolling updates that are too aggressive
# Deployment with 3 replicas
# PDB: minAvailable: 2
strategy:
rollingUpdate:
maxUnavailable: 2 # Wants to take down 2 pods
# Kubernetes will reduce maxUnavailable to 1
# to respect PDB (keeping 2 pods available)Special Case: Single Replica
Important: If only one replica exists, no disruption will be allowed.
apiVersion: apps/v1
kind: Deployment
metadata:
name: single-pod-app
spec:
replicas: 1 # Only 1 replica
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: single-pod-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: single-pod-appResult:
kubectl drain node-1
❌ Error: Cannot evict pod single-pod-app-xxx
Reason: Would violate PodDisruptionBudget (0 would remain)
# To drain anyway, you must bypass PDB:
kubectl drain node-1 --disable-eviction
✅ Forces drain (ignores PDB)Percentage Values
You can use percentages instead of absolute numbers:
spec:
minAvailable: 50% # At least 50% of pods must be up
# OR
maxUnavailable: 30% # At most 30% can be downExample:
10 replicas, minAvailable: 50%
→ Must keep at least 5 pods running
→ Can disrupt up to 5 pods
10 replicas, maxUnavailable: 30%
→ Can disrupt at most 3 pods
→ Must keep at least 7 pods runningPod Lifecycle Hooks
Purpose
Lifecycle hooks allow you to run code at specific points in a container’s lifecycle:
- PostStart: Right after container starts
- PreStop: Right before container terminates
PostStart Hook
What it does: Runs immediately after a container is created.
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
spec:
containers:
- name: app
image: nginx
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo 'Container started' > /tmp/startup.log"]**Flow:**Container created → postStart hook + ENTRYPOINT start around the same time → container running
PreStop Hook
What it does: Executes immediately before a container is terminated.
Purpose: Helps manage pod state (finish in-flight requests, flush logs, close DB connections, etc.) before termination.
apiVersion: v1
kind: Pod
metadata:
name: graceful-shutdown
spec:
terminationGracePeriodSeconds: 60
containers:
- name: nginx
image: nginx
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "nginx -s quit; sleep 30"]**Flow:**Pod deleted → preStop hook runs (blocking) → SIGTERM sent to container → wait up to terminationGracePeriodSeconds → SIGKILL if still not exited
- Important: preStop hook counts against
terminationGracePeriodSeconds.
Better understanding:
# WRONG: This doesn't work as expected
spec:
terminationGracePeriodSeconds: 30
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 60"] # 60s sleep
# What happens:
# t=0s: preStop starts
# t=30s: terminationGracePeriodSeconds expires
# t=30s: SIGKILL sent (preStop still running!)
# Result: Forcefully killed mid-cleanup!
# CORRECT: Ensure preStop + app shutdown < grace period
spec:
terminationGracePeriodSeconds: 90 # Total time budget
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # 15s buffer
# Leaves 75s for app to handle SIGTERM and shutdownPod Deletion Flow with PreStop Hook
1. Pod deletion triggered (kubectl delete or eviction)
↓
2. Pod marked as Terminating
+ Removed from Service endpoints (stops receiving traffic)
↓
3. PARALLEL actions:
├─> preStop hook executed (if configured) ← BLOCKING
└─> Pod removed from iptables rules (no new connections)
↓
4. preStop completes (Kubernetes waits for preStop to finish)
↓
5. SIGTERM sent to all containers in pod
↓
6. Wait for terminationGracePeriodSeconds (default 30s)
↓
7. If containers exit within timeout → Pod terminated gracefully
↓
8. If containers still running → SIGKILL sent (force kill)
↓
9. Pod removed from cluster (etcd)Real-World Example: Web Server Graceful Shutdown
apiVersion: v1
kind: Pod
metadata:
name: web-server
spec:
terminationGracePeriodSeconds: 60 # Give app time to shutdown
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/graceful-shutdown.sh"] # Drain connectionsScript: graceful-shutdown.sh
#!/bin/sh
# 1. Stop accepting new connections
nginx -s quit # Graceful nginx shutdown
# 2. Wait for in-flight requests to complete
# (Nginx will finish active requests but won't accept new ones)
sleep 20
# 3. Flush logs
sync
# preStop completes, then SIGTERM sent
# Total: 20s preStop + 40s for SIGTERM = 60s grace periodWhat happens:
t=0s: kubectl delete pod web-server
t=0s: Pod removed from Service endpoints
t=0s: preStop starts
→ nginx -s quit (stop accepting new connections)
→ sleep 20 (finish active requests)
t=20s: preStop completes
t=20s: SIGTERM sent to nginx
t=30s: nginx exits gracefully
t=30s: Pod removed
✅ No dropped connections
✅ All requests completed
✅ Clean shutdownUse Cases
PostStart:
- Register with service discovery
- Initialize local cache
- Send startup notification
PreStop:
- Finish in-flight requests
- Flush logs to remote storage
- Close database connections gracefully
- Deregister from service discovery
- Save application state
Termination Grace Period
What is terminationGracePeriodSeconds?
terminationGracePeriodSeconds defines the total time budget Kubernetes gives a pod to shut down gracefully before forcefully killing it.
Key Concept:
terminationGracePeriodSeconds = Total time from deletion to forced kill
This includes:
1. Time for preStop hook execution
2. Time for application to handle SIGTERM
3. Time for cleanup operations
Default: 30 secondsCritical Understanding: It’s a Total Budget
terminationGracePeriodSeconds: 60
Real behavior:
├─ preStop hook: 15s ┐
├─ SIGTERM handling: 40s ├─ Total: 55s (within 60s budget)
└─ Pod exits: t=55s ┘
If total exceeds 60s:
├─ preStop hook: 15s ┐
├─ SIGTERM handling: 50s ├─ Would be 65s...
└─ t=60s: SIGKILL! ┘ But killed at 60s!Verifying Your Settings
Check if pods are being killed prematurely:
# Look for SIGKILL in pod events
kubectl describe pod my-pod
Events:
Type Reason Message
---- ------ -------
Normal Killing Stopping container app
Warning Failed Error: exit code 137
# Exit code 137 = 128 + 9 (SIGKILL)
# Means: Pod was forcefully killed
# Solution: Increase terminationGracePeriodSecondsSpecial Cases
1. Immediate deletion (bypass grace period):
# Force delete without waiting
kubectl delete pod my-pod --grace-period=0 --force
# Warning: Skips graceful shutdown completely!
# Use only when pod is stuck or for non-critical workloads2. Infinite grace period (not recommended):
spec:
terminationGracePeriodSeconds: 0 # Not what you think!
# 0 means "use default (30s)", NOT infinite
# To have a very long grace period:
terminationGracePeriodSeconds: 3600 # 1 hour3. Per-container grace period:
Note: terminationGracePeriodSeconds is pod-level, not container-level.
All containers in a pod share the same grace period.
Pod with 2 containers:
terminationGracePeriodSeconds: 60
↓
Both containers get SIGTERM simultaneously
Both must exit within 60s total
If any container still running at 60s → both get SIGKILLSummary
Pod Lifecycle:
- Pending → Running → Succeeded/Failed
- Container states: Waiting, Running, Terminated
- Restart policies control container restart behavior
Scheduling:
- Filtering → Scoring → Selection → Binding
- Control via nodeSelector, affinity, taints/tolerations
- Topology spread for distribution
Resource Management:
- Priority classes for important workloads
- QoS classes determine eviction order
- PDBs protect availability during disruptions
Key Takeaways:
- Understand pod phases for troubleshooting
- Use scheduling constraints to control placement
- QoS classes are automatic based on resource definitions
- PDBs prevent over-disruption during maintenance
- Graceful shutdown requires proper termination handling