Rohit Kumar
•
25 February 2026
•
14 mins
A chaotic approach wastes time and risks misdiagnosis — especially critical in production. Always follow a structured methodology.
1. IDENTIFY → What exactly is broken?
2. GATHER → Collect logs, events, resource states
3. ANALYZE → Form a hypothesis from the data
4. FIX → Apply a targeted solution
5. VERIFY → Confirm fix, check for side effects
Always work top-down, from application to infrastructure:
Application Layer → Pod logs, exit codes, probe failures
↓
Service Layer → Endpoints, selectors, DNS resolution
↓
Node Layer → kubelet, resource pressure, node status
↓
Control Plane Layer → API server, scheduler, controller manager, etcd
↓
Infrastructure Layer → Network, storage, OS, hardware
| Status | Meaning | First Action |
|---|---|---|
Pending |
Not yet scheduled | kubectl describe pod → Events |
ContainerCreating |
Scheduled, pulling image or mounting volumes | kubectl describe pod |
ImagePullBackOff |
Cannot pull container image | Check image name, registry credentials |
CrashLoopBackOff |
Container starts then crashes repeatedly | kubectl logs --previous |
OOMKilled |
Exceeded memory limit | Increase memory limit or fix memory leak |
Error |
Container exited with non-zero code | kubectl logs for app error |
Terminating |
Pod stuck during deletion | Check for finalizers |
Unknown |
Node communication lost | Check node status |
Running but not ready |
Readiness probe failing | kubectl describe pod → probe events |
PendingRoot Causes & RCA:
# Step 1: Describe the pod — read the Events section
kubectl describe pod <pod-name> -n <namespace>
# Look for messages like:
# "0/3 nodes are available: 3 Insufficient cpu"
# "0/3 nodes are available: 3 Insufficient memory"
# Step 2: Check node allocatable resources
kubectl describe node <node-name>
# Look for: Allocatable vs Requests in the resource section
# Step 3: See what's consuming resources
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory
Solution:
# Option A: Scale down other workloads
kubectl scale deployment <other-app> --replicas=1
# Option B: Reduce pod resource requests
kubectl edit deployment <deployment-name>
# Lower spec.containers[].resources.requests
# Option C: Add a new node to the cluster
# Then verify it joins
kubectl get nodes
# Describe pod — look for:
# "didn't match Pod's node affinity/selector"
kubectl describe pod <pod-name>
# Check what labels your nodes have
kubectl get nodes --show-labels
# Check what the pod is requiring
kubectl get pod <pod-name> -o yaml | grep -A 20 affinity
Solution:
# Option A: Add the required label to a node
kubectl label node <node-name> disk-type=ssd
# Option B: Fix the affinity rule in the deployment
kubectl edit deployment <deployment-name>
# Correct the matchExpressions values
# Describe pod — look for:
# "node(s) had taint {key:value}, that the pod didn't tolerate"
kubectl describe pod <pod-name>
# Check node taints
kubectl describe node <node-name> | grep Taints
Solution:
# Add toleration to pod spec
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
# Describe pod — look for:
# "persistentvolumeclaim <name> not found" or "unbound"
kubectl describe pod <pod-name>
# Check PVC status
kubectl get pvc -n <namespace>
# Check available PVs
kubectl get pv
Solution:
# Check if PV exists with matching storageClass, accessMode, and capacity
kubectl describe pvc <pvc-name>
kubectl describe pv <pv-name>
# If no matching PV exists, create one
# Or ensure your StorageClass provisioner is running
kubectl get pods -n kube-system | grep provisioner
ImagePullBackOff# Step 1: Get the exact error
kubectl describe pod <pod-name>
# Events will show: "Failed to pull image: <specific error>"
# Step 2: Verify the image name and tag manually
# Common mistakes: typos, wrong tag, private registry without credentials
# Check what image is specified
kubectl get pod <pod-name> -o yaml | grep image
# Verify the image exists in the registry
docker pull <image-name>:<tag>
Solution:
# Correct the image in the deployment
kubectl set image deployment/<name> \
<container-name>=<correct-image>:<correct-tag>
# Check if imagePullSecret is configured
kubectl get pod <pod-name> -o yaml | grep imagePullSecret
# Check if the secret exists
kubectl get secret -n <namespace>
Solution:
# Create registry credentials secret
kubectl create secret docker-registry regcred \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email> \
-n <namespace>
# Reference it in pod spec
spec:
imagePullSecrets:
- name: regcred
containers:
- name: app
image: private-registry.io/app:v1
CrashLoopBackOffThis is the most complex pod failure — the container starts but immediately crashes.
# Step 1: Get current logs
kubectl logs <pod-name> -n <namespace>
# Step 2: Get logs from the PREVIOUS crashed container (most useful)
kubectl logs <pod-name> -n <namespace> --previous
# Step 3: Check exit code and last state
kubectl describe pod <pod-name>
# Look for: Last State → Exit Code
Exit Code Reference:
| Exit Code | Meaning | Likely Cause |
|---|---|---|
0 |
Success | Misconfigured as a one-shot job |
1 |
General error | Application error — check logs |
137 |
SIGKILL (128+9) | OOMKilled or manual kill |
139 |
Segfault (128+11) | Application bug |
143 |
SIGTERM (128+15) | Graceful shutdown — check why |
255 |
Exit status out of range | Application startup failure |
kubectl logs <pod-name> --previous
# Read the stack trace or error message carefully
Solution: Fix the application bug, misconfiguration, or missing dependency.
kubectl describe pod <pod-name>
# Look for: "Liveness probe failed"
# And: "Container killed by liveness probe"
Solution:
# Increase initialDelaySeconds to give app more startup time
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # increase this
periodSeconds: 10
failureThreshold: 3
kubectl describe pod <pod-name>
# Look for: "configmap <name> not found"
# Or: "secret <name> not found"
Solution:
# Verify the ConfigMap/Secret exists in the correct namespace
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>
# Create if missing
kubectl create configmap <name> --from-literal=key=value -n <namespace>
# Override the container command to keep it alive for debugging
kubectl edit deployment <name>
# Temporarily change container command
spec:
containers:
- name: app
command: ["sleep", "3600"] # override normal entrypoint
# Now exec into the running container and debug manually
kubectl exec -it <pod-name> -- /bin/sh
# Run the actual application command manually to see the error
OOMKilled# Confirm OOMKilled
kubectl describe pod <pod-name>
# Look for: Last State: Terminated Reason: OOMKilled
# Check current memory usage vs limits
kubectl top pod <pod-name>
kubectl get pod <pod-name> -o yaml | grep -A 5 resources
RCA: Memory limit set too low for actual application usage.
Solution:
# Increase memory limit
kubectl edit deployment <deployment-name>
resources:
requests:
memory: "256Mi"
limits:
memory: "1Gi" # increase this
Long-term fix: Profile the application’s actual memory usage under load using
kubectl top podsand set limits at ~20-30% above peak observed usage.
Terminating# Check for finalizers blocking deletion
kubectl get pod <pod-name> -o yaml | grep finalizers
Solution:
# Remove finalizers to force deletion
kubectl patch pod <pod-name> \
-p '{"metadata":{"finalizers":[]}}' \
--type=merge
# Or force delete (use with caution in production)
kubectl delete pod <pod-name> --force --grace-period=0
Step 1: DNS Resolution
Step 2: Service Existence & Selector
Step 3: Endpoints Population
Step 4: Direct Pod Connectivity
Step 5: Network Policy Check
Step 6: kube-proxy Check
# Launch a debug pod with networking tools
kubectl run debug --image=busybox --rm -it --restart=Never -- /bin/sh
# Inside the pod, test DNS
nslookup <service-name>
nslookup <service-name>.<namespace>.svc.cluster.local
# Test external DNS too
nslookup google.com
If DNS fails → CoreDNS issue:
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml
# Check the service selector
kubectl describe service <service-name> -n <namespace>
# Look for: Selector field
# Check the pod labels
kubectl get pods -n <namespace> --show-labels
# The selector MUST exactly match pod labels
# Even a single character difference breaks it
RCA Example:
# Service selector says:
selector:
app: my-app # ← looking for this label
# But pods have:
labels:
app: myapp # ← hyphen missing — NO MATCH
Solution:
# Fix the service selector
kubectl edit service <service-name>
# OR fix the pod labels
kubectl label pod <pod-name> app=my-app --overwrite
# Check if the service has any backend pods
kubectl get endpoints <service-name> -n <namespace>
# Healthy output looks like:
# NAME ENDPOINTS AGE
# my-service 10.244.1.5:80,10.244.2.3:80 5m
# If ENDPOINTS shows <none> — selector mismatch or no running pods
# Get pod IP
kubectl get pod <pod-name> -o wide
# From debug pod, try connecting directly to pod IP
wget -O- http://<pod-ip>:<port>
curl http://<pod-ip>:<port>
# If this fails → CNI plugin issue or Network Policy blocking
# List all network policies in the namespace
kubectl get networkpolicy -n <namespace>
# Describe to see ingress/egress rules
kubectl describe networkpolicy <policy-name> -n <namespace>
# Temporarily delete the policy to test
kubectl delete networkpolicy <policy-name> -n <namespace>
# If connectivity restores → policy was blocking traffic
# Restore and fix the policy rules
# Check kube-proxy pods on all nodes
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# Check kube-proxy logs
kubectl logs -n kube-system <kube-proxy-pod-name>
# Verify iptables rules exist for the service
# SSH into a node
sudo iptables -t nat -L KUBE-SERVICES | grep <service-cluster-ip>
# Get service details — verify port assignment
kubectl get service <service-name> -o wide
# Get node IP
kubectl get nodes -o wide
# Test from outside
curl http://<node-ip>:<nodeport>
Common RCAs:
| Issue | Check | Fix |
|---|---|---|
| Firewall blocking NodePort range (30000-32767) | Cloud security groups / firewall rules | Open port range in firewall |
| Wrong node IP used | kubectl get nodes -o wide |
Use INTERNAL-IP column |
| Pod not ready | kubectl get pods |
Fix readiness probe |
| Service selector wrong | kubectl describe svc |
Fix selector |
# Check ingress resource
kubectl describe ingress <ingress-name> -n <namespace>
# Check ingress controller pods are running
kubectl get pods -n ingress-nginx
# Check ingress controller logs
kubectl logs -n ingress-nginx <controller-pod-name>
# Verify backend services exist
kubectl get service -n <namespace>
# Check ingress class
kubectl get ingressclass
Common RCA — Missing ingressClassName:
# Add this to your ingress spec
spec:
ingressClassName: nginx # must match your controller
rules:
- host: myapp.example.com
NotReady State# Step 1: Identify affected nodes
kubectl get nodes
# Step 2: Get details
kubectl describe node <node-name>
# Look for: Conditions section — check Ready condition message
# Look for: Events section
```bash
ssh
sudo systemctl status