Scaling and HPA¶
Scaling in Kubernetes has three layers:
- workload scaling: change pod replicas
- node scaling: add or remove cluster nodes
- resource sizing: change CPU or memory requests per pod
This page focuses on workload scaling with Horizontal Pod Autoscaler (HPA).
Manual scaling¶
Manual scaling is still useful for planned events:
For live traffic variability, manual scaling does not react quickly enough.
How HPA works¶
HPA is a closed control loop driven by metrics:
flowchart LR
MS[Metrics Server\nor custom adapter] -->|current usage| HPA[HPA controller]
HPA -->|desired replicas| DEP[Deployment / StatefulSet]
DEP -->|pod metrics| MS
Loop steps:
- Read metrics for current pods (via Metrics API).
- Compare current utilization to the configured target.
- Compute desired replica count:
desiredReplicas = ceil(currentReplicas × currentUtil / targetUtil). - Apply stabilization window to avoid oscillation.
- Update the target workload replica count.
Prerequisites¶
HPA is only as good as metric quality.
Required baseline:
- metrics pipeline available (
metrics-serverfor CPU or memory) - workload has realistic
resources.requests - readiness probes are configured so new pods enter traffic safely
If requests are missing, percentage-based resource targets become unreliable.
HPA example¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60
This configuration scales up aggressively and scales down more cautiously to reduce flapping.
HPA troubleshooting¶
Common failure patterns:
Unknowntargets due to missing metrics pipeline- very slow response because pods have long startup times
- oscillation caused by too-tight thresholds and no stabilization
HPA, VPA, and node autoscaling¶
- HPA scales pod count horizontally.
- VPA (Vertical Pod Autoscaler) adjusts pod resource requests over time - do not use HPA and VPA together on the same CPU/memory signal; they will conflict. VPA is safe to combine with HPA when HPA uses custom/external metrics instead.
- Node autoscaler or Karpenter adds infrastructure capacity when pods cannot be scheduled.
Custom and external metrics¶
The built-in autoscaling/v2 HPA supports three metric types:
Resource: CPU or memory utilization against pod requests.Pods: custom per-pod metric from an adapter (e.g. requests per second).External: metric from an external system (e.g. queue depth from SQS or Kafka lag).
For event-driven scaling needs beyond what HPA covers natively, consider KEDA (Kubernetes Event-driven Autoscaling). KEDA extends HPA with out-of-the-box scalers for message queues, databases, HTTP traffic, and 70+ other sources - scaling to zero when idle is a key advantage.
Practical guidance¶
- Start with CPU utilization targets around 50 to 70 percent
- Tune using real production latency and error metrics, not only CPU
- Set sensible min and max replica limits to protect cost and stability
- Validate behavior with load tests before relying on autoscaling in production
Summary¶
HPA is a control loop, not a magic switch. It works well when metrics are trustworthy, pod requests are accurate, and rollout health checks are disciplined.