Scaling and HPA¶
Scaling in Kubernetes has three layers:
- workload scaling: change pod replicas
- node scaling: add or remove cluster nodes
- resource sizing: change CPU or memory requests per pod
This page focuses on workload scaling with Horizontal Pod Autoscaler (HPA).
Manual scaling¶
Manual scaling is still useful for planned events:
For live traffic variability, manual scaling does not react quickly enough.
How HPA works¶
HPA watches metrics and adjusts replica count toward a target.
Typical loop:
- read metrics for current pods
- compare current utilization to desired target
- compute desired replica count
- update target Deployment or StatefulSet
Prerequisites¶
HPA is only as good as metric quality.
Required baseline:
- metrics pipeline available (
metrics-serverfor CPU or memory) - workload has realistic
resources.requests - readiness probes are configured so new pods enter traffic safely
If requests are missing, percentage-based resource targets become unreliable.
HPA example¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60
This configuration scales up aggressively and scales down more cautiously to reduce flapping.
HPA troubleshooting¶
Common failure patterns:
Unknowntargets due to missing metrics pipeline- very slow response because pods have long startup times
- oscillation caused by too-tight thresholds and no stabilization
HPA, VPA, and node autoscaling¶
- HPA scales pod count
- VPA adjusts pod resource requests
- node autoscaler or Karpenter adds infrastructure capacity
You can combine them, but avoid configuring HPA and VPA to fight on the same signal without a design.
Practical guidance¶
- Start with CPU utilization targets around 50 to 70 percent
- Tune using real production latency and error metrics, not only CPU
- Set sensible min and max replica limits to protect cost and stability
- Validate behavior with load tests before relying on autoscaling in production
Summary¶
HPA is a control loop, not a magic switch. It works well when metrics are trustworthy, pod requests are accurate, and rollout health checks are disciplined.