Troubleshooting

When a Pod breaks, don't guess. Follow the evidence.

Kubernetes is incredibly transparent - it almost always tells you exactly what is wrong, provided you know where to look. This guide outlines a systematic workflow for diagnosing clusters.

The Triage Flowchart

Before running random commands, mentally locate where the failure is occurring.

graph TD
    Start["Pod Issue"] --> Status{"Check Pod Status"}

    Status -- "Pending" --> Scheduler["Scheduler Issue<br/>(No resources, Taints?)"]
    Status -- "ImagePullBackOff" --> Image["Registry Issue<br/>(Auth, Typo?)"]
    Status -- "CrashLoopBackOff" --> App["App Issue<br/>(Bug, Config?)"]
    Status -- "CreateContainerConfigError" --> Config["Config Issue<br/>(Missing Secret/Map?)"]
    Status -- "Running" --> Net{"Network Issue?"}

    Net -- "Can't connect" --> Service["Check Service/DNS"]
    Net -- "500 Error" --> AppLogs["Check App Logs"]

Phase 1: The "Big Three" Commands

In 90% of cases, you can solve the problem using just these three commands in order.

1. `kubectl describe pod <name>`

The "Crime Scene Report". This tells you what Kubernetes thinks happened. Look at the Events section at the bottom.

Did the scheduler fail to find a node?
Did the Liveness probe fail?
Did the volume fail to mount?

2. `kubectl logs <name>`

The "Victim's Last Words". This shows the application's standard output (stdout).

Did the app throw a Python stack trace?
Did it say "Database Connection Refused"?

Pro Tip

If your pod is in a restart loop, kubectl logs might show you the current (empty) container starting up. To see why the last one died, use: kubectl logs my-pod --previous

3. `kubectl get pod <name> -o yaml`

The "Blueprint". Check the configuration.

Did you typo the ConfigMap name?
Are you pulling the latest tag by accident?
Did you verify the command arguments?

Phase 2: Decoding Common States

State	Translation	Where to look
Pending	"I'm waiting for a home." The Scheduler cannot find a node that fits this Pod (CPU/Mem limits, Taints, or Affinity).	`kubectl describe pod`
ImagePullBackOff	"I can't get the package." The registry path is wrong, the tag doesn't exist, or you forgot the `imagePullSecret`.	`kubectl describe pod`
CrashLoopBackOff	"I started, but I died immediately." The app is buggy or misconfigured.	`kubectl logs`
CreateContainerConfigError	"I can't find my keys." You referenced a Secret or ConfigMap that doesn't exist.	`kubectl describe pod`
OOMKilled	"I ate too much." The app used more RAM than the `limit` allowed.	`kubectl get pod` (Look for Exit Code 137)

Phase 3: Advanced Debugging Tools

Sometimes logs aren't enough. You need to get inside.

1. `kubectl exec` (The Standard Way)

If the Pod is running (e.g., a web server that is returning 500 errors), jump inside to poke around.

kubectl exec -it my-pod -- /bin/sh
# Inside:
# curl localhost:8080
# cat /etc/config/app.conf

2. `kubectl debug` (The Modern Way)

Use this when: The Pod is crashing (CrashLoopBackOff) or is a "Distroless" image (has no shell/bash installed).

kubectl debug spins up a new container attached to the broken Pod. It shares the process namespace but brings its own tools.

# Attach a "busybox" container to your broken "my-app" pod
kubectl debug -it my-app --image=busybox --target=my-app-container

3. Networking Debugging

If Service A can't talk to Service B:

Check DNS: nslookup my-service (Does it return an IP?)
Check Service Selector: Does the Service actually point to any pods?
- kubectl get endpoints my-service (If this is empty, your labels are wrong).

Phase 4: Exit Codes (The Cheat Sheet)

Computers speak in numbers. Here is how to translate them.

Code	Meaning	Likely Cause
0	Success	The process finished its job and exited. (Normal for Jobs, bad for Deployments).
1	Application Error	Generic app crash. Check logs.
137	SIGKILL (OOM)	Out of Memory. Increase memory `limits`.
143	SIGTERM	Kubernetes asked the Pod to stop (normal during scale-down).
255	Node Error	The Node itself failed (disk full, network partition).

Summary

Don't Panic. Read the status.
Events First: Always run kubectl describe before anything else.
Logs Second: Check kubectl logs (and --previous).
Validate Config: Ensure Secrets/ConfigMaps exist before the Pod starts.
Use debug: Learn kubectl debug for difficult crashes.

Tip

The "Rubber Duck" Method

If you are stuck, read the kubectl describe output out loud line-by-line. You will almost always find the error staring you in the face (e.g., MountVolume.SetUp failed for volume "secret-key": secret "my-secret" not found).