Cluster Maintenance¶
Cluster maintenance should be repeatable, staged, and reversible.
This page focuses on safe day-2 operations for upgrades, node work, and certificate hygiene.
1) Backup and Recovery Readiness¶
Before control-plane changes, verify your recovery path.
For kubeadm-managed control planes, etcd snapshot workflow:
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
ETCDCTL_API=3 etcdctl snapshot status snapshot.db
Also validate restore procedure in a non-production environment.
2) Upgrade Sequencing¶
Follow Kubernetes version skew policy and vendor guidance.
Typical order:
- Upgrade control plane components.
- Upgrade worker nodes in controlled batches.
- Validate workload health between phases.
Golden rule: do not run kubelet newer than API server.
3) Node Maintenance Workflow¶
When patching or upgrading a node:
- Cordon the node.
- Drain workloads safely.
- Perform maintenance.
- Validate node health.
- Uncordon node.
kubectl cordon node-01
kubectl drain node-01 --ignore-daemonsets --delete-emptydir-data
# perform maintenance
kubectl uncordon node-01
Use PodDisruptionBudgets and rollout budgets so drains do not cause avoidable outages.
4) kubeadm Node Upgrade Pattern¶
Use target versions explicitly rather than hardcoded tutorial versions.
# control plane and worker workflows differ; follow kubeadm docs for your target version
sudo apt-get update
sudo apt-get install -y kubeadm=<target-version>
sudo kubeadm upgrade node
sudo apt-get install -y kubelet=<target-version> kubectl=<target-version>
sudo systemctl restart kubelet
On RPM-based systems, use equivalent package manager commands.
5) Certificate Lifecycle¶
On kubeadm clusters, many internal certificates default to one-year validity.
Check regularly:
Renew as needed and restart affected components per kubeadm guidance:
6) OS and Kernel Patching¶
- Avoid unmanaged fleet-wide auto reboots.
- Reboot in waves.
- Use maintenance automation (for example, Kured) with disruption controls.
7) Post-Maintenance Validation¶
Validate:
- node readiness
- control-plane stability
- workload SLO indicators
- ingress and core service paths
Summary¶
- Maintenance is an operational workflow, not a one-off command.
- Backups and restore drills are part of upgrade safety.
- Respect version skew and execute upgrades in stages.
- Use cordon/drain/uncordon consistently.
- Keep certificate and OS patch routines scheduled and observable.