Scaling and HPA¶

Scaling in Kubernetes has three layers:

workload scaling: change pod replicas
node scaling: add or remove cluster nodes
resource sizing: change CPU or memory requests per pod

This page focuses on workload scaling with Horizontal Pod Autoscaler (HPA).

Manual scaling¶

Manual scaling is still useful for planned events:

kubectl scale deployment web --replicas=8

For live traffic variability, manual scaling does not react quickly enough.

How HPA works¶

HPA is a closed control loop driven by metrics:

flowchart LR
    MS[Metrics Server\nor custom adapter] -->|current usage| HPA[HPA controller]
    HPA -->|desired replicas| DEP[Deployment / StatefulSet]
    DEP -->|pod metrics| MS

Loop steps:

Read metrics for current pods (via Metrics API).
Compare current utilization to the configured target.
Compute desired replica count: desiredReplicas = ceil(currentReplicas × currentUtil / targetUtil).
Apply stabilization window to avoid oscillation.
Update the target workload replica count.

Prerequisites¶

HPA is only as good as metric quality.

Required baseline:

metrics pipeline available (metrics-server for CPU or memory)
workload has realistic resources.requests
readiness probes are configured so new pods enter traffic safely

If requests are missing, percentage-based resource targets become unreliable.

HPA example¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60

This configuration scales up aggressively and scales down more cautiously to reduce flapping.

HPA troubleshooting¶

kubectl get hpa
kubectl describe hpa web-hpa
kubectl top pods -l app=web

Common failure patterns:

Unknown targets due to missing metrics pipeline
very slow response because pods have long startup times
oscillation caused by too-tight thresholds and no stabilization

HPA, VPA, and node autoscaling¶

HPA scales pod count horizontally.
VPA (Vertical Pod Autoscaler) adjusts pod resource requests over time - do not use HPA and VPA together on the same CPU/memory signal; they will conflict. VPA is safe to combine with HPA when HPA uses custom/external metrics instead.
Node autoscaler or Karpenter adds infrastructure capacity when pods cannot be scheduled.

Custom and external metrics¶

The built-in autoscaling/v2 HPA supports three metric types:

Resource: CPU or memory utilization against pod requests.
Pods: custom per-pod metric from an adapter (e.g. requests per second).
External: metric from an external system (e.g. queue depth from SQS or Kafka lag).

For event-driven scaling needs beyond what HPA covers natively, consider KEDA (Kubernetes Event-driven Autoscaling). KEDA extends HPA with out-of-the-box scalers for message queues, databases, HTTP traffic, and 70+ other sources - scaling to zero when idle is a key advantage.

Practical guidance¶

Start with CPU utilization targets around 50 to 70 percent
Tune using real production latency and error metrics, not only CPU
Set sensible min and max replica limits to protect cost and stability
Validate behavior with load tests before relying on autoscaling in production

Summary¶

HPA is a control loop, not a magic switch. It works well when metrics are trustworthy, pod requests are accurate, and rollout health checks are disciplined.