Troubleshooting
When a Pod breaks, don't guess. Follow the evidence.
Kubernetes is incredibly transparent - it almost always tells you exactly what is wrong, provided you know where to look. This guide outlines a systematic workflow for diagnosing clusters.
The Triage Flowchart
Before running random commands, mentally locate where the failure is occurring.
graph TD
Start["Pod Issue"] --> Status{"Check Pod Status"}
Status -- "Pending" --> Scheduler["Scheduler Issue<br/>(No resources, Taints?)"]
Status -- "ImagePullBackOff" --> Image["Registry Issue<br/>(Auth, Typo?)"]
Status -- "CrashLoopBackOff" --> App["App Issue<br/>(Bug, Config?)"]
Status -- "CreateContainerConfigError" --> Config["Config Issue<br/>(Missing Secret/Map?)"]
Status -- "Running" --> Net{"Network Issue?"}
Net -- "Can't connect" --> Service["Check Service/DNS"]
Net -- "500 Error" --> AppLogs["Check App Logs"]
Phase 1: The "Big Three" Commands
In 90% of cases, you can solve the problem using just these three commands in order.
1. kubectl describe pod <name>
The "Crime Scene Report". This tells you what Kubernetes thinks happened. Look at the Events section at the bottom.
- Did the scheduler fail to find a node?
- Did the Liveness probe fail?
- Did the volume fail to mount?
2. kubectl logs <name>
The "Victim's Last Words". This shows the application's standard output (stdout).
- Did the app throw a Python stack trace?
- Did it say "Database Connection Refused"?
Pro Tip
If your pod is in a restart loop, kubectl logs might show you the current (empty) container starting up. To see why the last one died, use:
kubectl logs my-pod --previous
3. kubectl get pod <name> -o yaml
The "Blueprint". Check the configuration.
- Did you typo the ConfigMap name?
- Are you pulling the
latesttag by accident? - Did you verify the
commandarguments?
Phase 2: Decoding Common States
| State | Translation | Where to look |
|---|---|---|
| Pending | "I'm waiting for a home." The Scheduler cannot find a node that fits this Pod (CPU/Mem limits, Taints, or Affinity). | kubectl describe pod |
| ImagePullBackOff | "I can't get the package." The registry path is wrong, the tag doesn't exist, or you forgot the imagePullSecret. |
kubectl describe pod |
| CrashLoopBackOff | "I started, but I died immediately." The app is buggy or misconfigured. | kubectl logs |
| CreateContainerConfigError | "I can't find my keys." You referenced a Secret or ConfigMap that doesn't exist. | kubectl describe pod |
| OOMKilled | "I ate too much." The app used more RAM than the limit allowed. |
kubectl get pod (Look for Exit Code 137) |
Phase 3: Advanced Debugging Tools
Sometimes logs aren't enough. You need to get inside.
1. kubectl exec (The Standard Way)
If the Pod is running (e.g., a web server that is returning 500 errors), jump inside to poke around.
2. kubectl debug (The Modern Way)
Use this when: The Pod is crashing (CrashLoopBackOff) or is a "Distroless" image (has no shell/bash installed).
kubectl debug spins up a new container attached to the broken Pod. It shares the process namespace but brings its own tools.
# Attach a "busybox" container to your broken "my-app" pod
kubectl debug -it my-app --image=busybox --target=my-app-container
3. Networking Debugging
If Service A can't talk to Service B:
- Check DNS:
nslookup my-service(Does it return an IP?) - Check Service Selector: Does the Service actually point to any pods?
kubectl get endpoints my-service(If this is empty, your labels are wrong).
Phase 4: Exit Codes (The Cheat Sheet)
Computers speak in numbers. Here is how to translate them.
| Code | Meaning | Likely Cause |
|---|---|---|
| 0 | Success | The process finished its job and exited. (Normal for Jobs, bad for Deployments). |
| 1 | Application Error | Generic app crash. Check logs. |
| 137 | SIGKILL (OOM) | Out of Memory. Increase memory limits. |
| 143 | SIGTERM | Kubernetes asked the Pod to stop (normal during scale-down). |
| 255 | Node Error | The Node itself failed (disk full, network partition). |
Summary
- Don't Panic. Read the status.
- Events First: Always run
kubectl describebefore anything else. - Logs Second: Check
kubectl logs(and--previous). - Validate Config: Ensure Secrets/ConfigMaps exist before the Pod starts.
- Use
debug: Learnkubectl debugfor difficult crashes.
Tip
The "Rubber Duck" Method
If you are stuck, read the kubectl describe output out loud line-by-line. You will almost always find the error staring you in the face (e.g., MountVolume.SetUp failed for volume "secret-key": secret "my-secret" not found).