Skip to content

Argo Rollouts Blue-Green Deployments: What Zero Downtime Actually Requires in Production

Kubernetes Deployments have a strategy: RollingUpdate that most teams call zero-downtime. It isn't. During a rolling update, old and new versions run simultaneously, serving traffic to the same endpoints. There is no atomicity. A request can hit v1 and a subsequent request from the same user can hit v2, mid-rollout, before you have any confidence the new version works. If you need to roll back, you initiate another rolling update in reverse, while live traffic continues hitting whatever mix of versions exists at that moment.

Blue-green deployments - real ones - mean a different thing: two complete environments run in parallel, traffic is switched atomically from old to new, and rollback is instant because the old environment still exists. Argo Rollouts implements this correctly. The gap between what built-in Deployments call a rollout and what blue-green actually means is where production incidents live.

This guide covers how to implement true blue-green with Argo Rollouts in an enterprise environment: the architecture, traffic routing, automated analysis gates, database schema discipline, and the failure modes that surface once you get past the demo.


Situation

A typical enterprise service has several properties that make rolling updates quietly dangerous:

  • Stateful clients: browsers with cached sessions, mobile apps with long-lived connections, APIs with mid-flight batch requests
  • Database schema coupling: the new binary may write schema migrations that the old binary cannot read back
  • Downstream dependencies: other services may call your API and depend on response contracts that changed
  • Audit and compliance gates: some organizations require a human approval step before live traffic reaches new code

Rolling updates expose users to both versions simultaneously for the duration of the rollout. If the new version has a subtle bug that only appears under load, you discover it while it is already serving traffic - and rollback is slow. If the new version wrote a database migration, rollback becomes schema surgery.

Blue-green removes most of this. The new version runs in complete isolation, gets exercised, gets analyzed, and only receives user traffic after a deliberate gate. Rollback is a single operation that shifts traffic back - the old version never stopped running.

Argo Rollouts implements blue-green with a Rollout CRD that replaces your Deployment, a controller that manages two ReplicaSets (active and preview), and hooks for automated canary analysis, ingress traffic splitting, and promotion gates.


Mental Model

Think of blue-green as two pools of Pods behind two Kubernetes Services:

  • Active service (my-app): receives production traffic. Points to the current stable ReplicaSet.
  • Preview service (my-app-preview): receives no production traffic. Points to the new ReplicaSet.

When you push a new image, Argo Rollouts creates a new ReplicaSet, scales it up, and routes it only to the preview service. You run tests, automated analysis, smoke checks, or manual validation against the preview service. When you are confident, you promote: Argo Rollouts atomically patches the active service selector to point to the new ReplicaSet. The old ReplicaSet scales down after a configurable scaleDownDelaySeconds.

Rollback before promotion is trivial - you just abort the rollout and the preview ReplicaSet scales to zero. Rollback after promotion is a new rollout in reverse, which is fast because the old ReplicaSet can be retained for a time window before scale-down.

The critical insight: traffic routing and Pod lifecycle are separate concerns. Kubernetes normally conflates them - Pods join a Service as soon as they pass readiness. Argo Rollouts separates them explicitly. New Pods can be running and healthy from Kubernetes' perspective while receiving zero production traffic, because the active service selector still points to the old ReplicaSet. This is the mechanism that makes zero downtime real.


How Argo Rollouts Works

The Rollout CRD

A Rollout is structurally similar to a Deployment. The spec includes a Pod template, replicas count, and selector. The strategy block replaces RollingUpdate:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
  namespace: production
spec:
  replicas: 6
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-org/my-app:v2.1.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 512Mi
  strategy:
    blueGreen:
      activeService: my-app
      previewService: my-app-preview
      autoPromotionEnabled: false
      scaleDownDelaySeconds: 600
      previewReplicaCount: 3
      antiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          weight: 100

Key fields:

  • activeService / previewService: names of the two Services Argo Rollouts will manage selectors on
  • autoPromotionEnabled: false: requires an explicit promote command or webhook call before traffic shifts - this is correct for enterprise environments
  • scaleDownDelaySeconds: 600: keeps the old ReplicaSet alive for 10 minutes after promotion, giving you a fast rollback window
  • previewReplicaCount: 3: run a smaller preview stack (half the active replica count) to save resources during validation

The two Services

These are standard Kubernetes Services. Argo Rollouts manages their selectors - specifically the rollouts-pod-template-hash label it adds to Pods in each ReplicaSet. You do not manage those selectors manually.

apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: production
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: my-app-preview
  namespace: production
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080

At rollout time, Argo Rollouts patches both services to add a hash selector that uniquely identifies the active or preview ReplicaSet. You never need to touch these selectors - the controller owns them.

The controller

Argo Rollouts installs a cluster-scoped controller that watches Rollout, AnalysisRun, AnalysisTemplate, and Experiment resources. The controller is the entity that:

  • Creates and scales ReplicaSets
  • Patches Service selectors during promotion
  • Fires AnalysisRuns against your metrics backend
  • Handles ingress weight annotations for weighted traffic splits (less relevant for pure blue-green, critical for canary strategies)

Install via Helm:

helm repo add argo https://argoproj.github.io/argo-helm
helm install argo-rollouts argo/argo-rollouts \
  --namespace argo-rollouts \
  --create-namespace \
  --set dashboard.enabled=true \
  --set dashboard.ingress.enabled=true

The kubectl plugin gives you rollout visibility:

kubectl argo rollouts get rollout my-app -n production --watch

Automated Analysis: The Gate That Matters

autoPromotionEnabled: false means you must explicitly promote. In practice, you want automated analysis to make that decision for you - otherwise you have added ops overhead without adding confidence.

Argo Rollouts has an AnalysisTemplate and AnalysisRun system. You define metric queries, success conditions, and failure thresholds. The Rollout's prePromotionAnalysis block runs these against the preview environment before allowing promotion. If analysis fails, the rollout is aborted automatically.

AnalysisTemplate against Prometheus

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: production
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 30s
    count: 5
    successCondition: result[0] >= 0.99
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[2m]))
          /
          sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))
  - name: p99-latency
    interval: 30s
    count: 5
    successCondition: result[0] <= 200
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[2m])) by (le)
          ) * 1000

This template runs two checks against the preview service over 5 intervals (2.5 minutes total): error rate must stay above 99% success, and p99 latency must stay under 200ms. One failure in either metric aborts the rollout.

Wire it into the Rollout:

strategy:
  blueGreen:
    activeService: my-app
    previewService: my-app-preview
    autoPromotionEnabled: false
    prePromotionAnalysis:
      templates:
      - templateName: success-rate
      args:
      - name: service-name
        value: my-app-preview
    postPromotionAnalysis:
      templates:
      - templateName: success-rate
      args:
      - name: service-name
        value: my-app
    scaleDownDelaySeconds: 600

postPromotionAnalysis runs the same checks against the now-active service after promotion. If it fails, Argo Rollouts enters an aborted state and automatically switches traffic back to the previous stable ReplicaSet - this is a full automatic rollback. The old ReplicaSet must still be running (within scaleDownDelaySeconds) for this to be instant; if it has already been scaled down, recovery requires a new rollout from the previous image.

Load generation for meaningful analysis

Pre-promotion analysis only works if the preview service is receiving traffic to analyze. With pure blue-green and no traffic, your Prometheus queries return no data. You have two options:

Option 1: Synthetic load against preview. Run a load generator Job as part of your CI pipeline that fires requests at the preview service URL immediately after the new ReplicaSet is healthy. This is simpler and doesn't require exposing preview to users.

Option 2: Traffic mirroring. If your ingress controller supports traffic mirroring (NGINX does via nginx.ingress.kubernetes.io/mirror-target), you can mirror a percentage of production requests to the preview service. The mirror requests are fire-and-forget - users only see responses from the active service - but the preview service processes real traffic patterns and emits real metrics.

Traffic mirroring is more representative but more complex to set up, especially for stateful APIs where mirrored requests might cause side effects (double-writes to databases, double-sends to queues). Mirror only at the ingress boundary and make sure your preview deployment has side-effect isolation - separate database schema or feature-flagged no-op paths for writes.


Traffic Routing in Enterprise Environments

Pure blue-green at the Service level works cleanly in internal environments. Enterprise environments typically have additional routing layers that need explicit handling.

Ingress controllers

NGINX Ingress and similar controllers route to a Service by name. Point your production Ingress at the active Service, and your preview Ingress at the preview Service:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  namespace: production
spec:
  rules:
  - host: my-app.internal
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app          # active service - Argo Rollouts patches its selector
            port:
              number: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-preview
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: preview-basic-auth
spec:
  rules:
  - host: my-app-preview.internal
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-preview  # preview service
            port:
              number: 80

The auth-secret annotation gates the preview endpoint to authorized testers only. No production user can accidentally reach the preview environment.

Service meshes (Istio, Linkerd)

With Istio, blue-green works through the same Service selector swap that Argo Rollouts performs on promotion - no trafficRouting block is needed or supported under the blueGreen strategy (that field is canary-only). Configure your VirtualService to route to the active Service by name; Argo Rollouts' selector patch on the active Service is what shifts traffic at promotion time:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-app
  namespace: production
spec:
  hosts:
  - my-app
  http:
  - route:
    - destination:
        host: my-app      # active service - Argo Rollouts patches its selector at promotion
        port:
          number: 80
      weight: 100

Istio's value in a blue-green context is connection-level draining: the sidecar proxy honors the updated Service endpoint list and drains in-flight requests from old Pods while routing new connections to the new Pods. Pair this with a preStop sleep (see failure modes below) to cover the endpoint propagation window. For weighted traffic splitting between active and preview, use the canary strategy instead.

External load balancers and API gateways

If traffic enters your cluster through an external load balancer (AWS ALB, GCP Load Balancer) or an API gateway (Kong, Ambassador, Apigee), the promotion swap at the Service selector level may not be visible to that layer - it routes to a NodePort or ClusterIP that doesn't change.

In this case, the external layer needs its own configuration update at promotion time. Argo Rollouts supports lifecycle hooks via prePromotionAnalysis and postPromotionAnalysis, but for side effects like updating an external gateway, use notification webhooks or a Rollout Lifecycle Job:

strategy:
  blueGreen:
    prePromotionAnalysis:
      templates:
      - templateName: success-rate
        args:
        - name: service-name
          value: my-app-preview
    # After automated analysis passes, a webhook can trigger external gateway updates
    # before Argo Rollouts performs the service selector swap

In practice, this usually means your CI/CD pipeline (Argo CD, GitHub Actions) handles the external gateway config update as a step coordinated with the Argo Rollouts promotion - not something Argo Rollouts manages internally.


The Hard Part: Database Schema Compatibility

Traffic routing is the easy part of blue-green. The hard part is database schema compatibility. If your new version writes a schema migration that the old version cannot handle, you do not have a blue-green deployment - you have a deployment where rollback breaks your database.

The rule for true blue-green deployability:

Every database change must be backward-compatible with the previous release during the overlap window.

This means:

  • Add columns, never remove or rename: old code ignores new columns; new code uses them. Never rename a column in the same release that migrates its data - use a multi-step expand/migrate/contract pattern across multiple releases.
  • New tables are safe: old code doesn't query them.
  • Index changes are usually safe: they don't affect read/write compatibility.
  • Constraint additions are dangerous: if you add a NOT NULL constraint or unique constraint, old code writing nulls or duplicates will fail. Make these changes schema-only in a prior release, after verifying the data is clean.
  • Enum changes: adding enum values is usually safe (old code ignores unknown values). Removing or reordering enum values is not.

The practical pattern most enterprise teams use:

Release N:   Add new_column (nullable, no default)
Release N+1: Populate new_column; old code still writes to old_column
Release N+2: Make new_column NOT NULL; drop old_column from application reads
Release N+3: Drop old_column from schema

Each of these releases can be blue-green deployed safely because each schema change is compatible with the prior release's binary.

If you cannot or will not follow this discipline, blue-green is unsafe for releases that touch the database. A Kubernetes Deployment rolling update is also unsafe in this scenario - it just fails more slowly and less visibly. The answer is the schema discipline, not a different deployment strategy.

Running migrations

Migrations should run before new Pods start receiving traffic, not during application startup. An init container or a pre-rollout Job that runs the migration before Argo Rollouts creates the preview ReplicaSet is the right model:

apiVersion: batch/v1
kind: Job
metadata:
  name: my-app-migrate-v2-1-0
  namespace: production
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: migrate
        image: my-org/my-app:v2.1.0
        command: ["./migrate", "--up"]
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: my-app-db
              key: url

Run this Job in your CI pipeline before updating the Rollout's image. If the migration fails, the Rollout never starts and production traffic is unaffected. If the migration succeeds, the new Pods start, run against the already-migrated schema, and enter the preview phase.


Enterprise Promotion Gates

autoPromotionEnabled: false creates a manual gate. In an enterprise environment, that gate should be enforced by policy, not by convention. Several patterns work:

Argo CD ApplicationSet with approval gates

If you use Argo CD, your Rollout promotion can be gated behind an Argo CD sync wave and a manual approval sync option:

# In your Argo CD Application or ApplicationSet:
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "2"

Wave 2 resources only proceed after Wave 1 (the migration Job) succeeds, and the sync policy can require manual approval for wave 2 in production.

Promotion via webhook in CI

The most common enterprise pattern: your CI/CD pipeline (GitHub Actions, GitLab CI, Tekton) drives the promotion explicitly after automated gates pass:

# In CI, after image is pushed and pre-promotion checks complete:
kubectl argo rollouts promote my-app -n production

Gate the CI step behind required approvals in your CI platform. This creates an audit trail (who approved, when, which pipeline run) that satisfies compliance requirements better than a kubectl command run by hand.

RBAC on the Rollout resource

Restrict who can promote by limiting update access to Rollout resources in production:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: rollout-promoter
  namespace: production
rules:
- apiGroups: ["argoproj.io"]
  resources: ["rollouts"]
  verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["argoproj.io"]
  resources: ["rollouts/status"]
  verbs: ["update", "patch"]

kubectl argo rollouts promote patches the Rollout resource directly, so the rollouts resource needs update/patch - not just rollouts/status. Give rollout-promoter to your CI service account only. Human engineers in production get read-only access to Rollouts by default; promotion requires going through the pipeline.


Architecture and Tradeoffs

Blue-green doubles your resource footprint during rollout. For the duration of the preview phase, you run two full ReplicaSets. previewReplicaCount lets you run a smaller preview stack, but if you need representative load testing, you need representative replica counts. Budget for 2x capacity during deployments or use auto-scaling to absorb the burst.

Session stickiness breaks atomicity. If your load balancer or ingress uses session affinity (sticky sessions by cookie or IP), users may continue hitting the old version after promotion because their session is pinned. Istio and NGINX both support session affinity configuration. For blue-green to work cleanly, session affinity should either be off, or your application should handle session tokens in a way that both versions can validate (shared session store, JWT with symmetric key).

Blue-green is not canary. Blue-green is binary: all traffic or no traffic. Canary releases - where you shift 5% of traffic to the new version, observe, then increment - are a separate strategy. Argo Rollouts supports both. For most enterprise deployments, blue-green with automated analysis provides enough confidence. Canary is better when you need real user traffic to expose issues that synthetic load cannot reproduce.

The preview service is a real attack surface. Even gated behind basic auth, my-app-preview.internal is a live endpoint running production code against your production database. Treat it accordingly: don't disable security middleware for preview, don't point it at a writable staging database that shares data with production, and don't leave it permanently accessible after a successful rollout.

scaleDownDelaySeconds is your rollback window. After promotion, the old ReplicaSet stays alive for this duration. If you need to rollback within that window, it's fast - Argo Rollouts just patches the active service selector back to the old ReplicaSet. After the window expires, the old Pods are terminated and rollback requires a new rollout from the previous image tag, which takes as long as any fresh deployment. Set this long enough to cover your post-promotion observation period (10–30 minutes is typical).

AnalysisRun metrics must be routed to the preview service, not the active service. This sounds obvious but breaks in practice when your Prometheus or observability setup scrapes by Pod label rather than by Service. Argo Rollouts injects a rollouts-pod-template-hash label into Pods that distinguishes the active and preview ReplicaSets. Make sure your Prometheus ServiceMonitor selects by this label if you need per-ReplicaSet metric isolation.


Failure Modes to Plan For

Promotion with in-flight requests still hitting old Pods

After promotion, the active Service selector switches to the new ReplicaSet. Existing connections to old Pods are not forcibly terminated - they drain naturally as they complete. If your old Pods are slow to stop accepting new connections after the Service selector switches, a brief window exists where some requests reach old Pods.

Mitigate this with a short preStop hook that sleeps before SIGTERM:

lifecycle:
  preStop:
    exec:
      command: ["sleep", "5"]

This gives the Service's endpoint controller time to propagate the selector change to kube-proxy or the CNI before the old Pods stop accepting connections. Five seconds is usually sufficient for endpoint propagation in a healthy cluster; increase to 15–30 seconds in clusters with slow endpoint propagation.

Preview ReplicaSet never becomes healthy

If the new version has a crash loop or fails readiness checks, Argo Rollouts keeps the rollout in a Progressing state waiting for the preview ReplicaSet to reach its desired ready count. The active service is unaffected - users hit only the stable version. This is the correct behavior.

Detect this via the Rollout status:

kubectl argo rollouts get rollout my-app -n production
# Look for: Status: Progressing (or Degraded)

Alert on Rollouts stuck in Progressing for more than your normal startup window (typically your initialDelaySeconds plus a safety margin). A healthy rollout should reach preview-ready within 2–3 minutes for most services.

AnalysisRun fires but metrics are missing

If your preview service receives no traffic (no synthetic load, no mirroring), Prometheus queries return no data. Argo Rollouts treats no data as inconclusive and defaults to the behavior specified by measurementRetention and timeout settings. Without careful configuration, a missing-data scenario can result in either premature promotion (if you default to success) or indefinite blocking (if you default to failure).

Set explicit successCondition expressions that handle empty result sets:

successCondition: len(result) == 0 || result[0] >= 0.99

This passes the check when no data exists (no traffic yet) and enforces the threshold once data arrives. Pair this with a minimum observation duration so the analysis doesn't complete immediately on no-data:

metrics:
- name: success-rate
  interval: 30s
  count: 10          # must complete 10 measurements (5 minutes)
  initialDelay: 60s  # wait 60s before first measurement (let load generator ramp up)

Rollback after post-promotion analysis fails

If your postPromotionAnalysis fails after the active service is already pointing to the new version, Argo Rollouts automatically enters an aborted state and switches traffic back to the previous stable ReplicaSet - unlike pre-promotion failures, this is an actual automatic rollback, not just a status update. If the old ReplicaSet is still running (within scaleDownDelaySeconds), the revert is instant: just a Service selector patch. If it has already been scaled down, recovery takes longer.

After an aborted rollout, check state and then re-attempt or roll forward:

# Check what happened:
kubectl argo rollouts get rollout my-app -n production

# Roll back to the previous image explicitly if needed:
kubectl argo rollouts undo my-app -n production

Pre-populate your runbooks with these commands. In a real incident you want to run one command, not diagnose Argo Rollouts docs.

Multiple services deploying simultaneously

If service A depends on service B and both have blue-green rollouts in progress at the same time, the combination of active and preview versions creates four possible request paths (A-active→B-active, A-active→B-preview, A-preview→B-active, A-preview→B-preview). Automated analysis tests preview in isolation, not the cross-service preview combination.

For services with strong coupling, coordinate rollout sequencing: deploy service B to production first, confirm it, then deploy service A. This reduces the dependency surface during validation. If you cannot sequence them, ensure the preview environments of both services are internally addressable so integration tests can specifically target A-preview→B-preview paths.


Practical Implementation Path

Start with a non-critical service in staging. Port one Deployment to a Rollout resource, set up both Services, and practice the manual promotion flow before adding analysis. Understand the state machine: Progressing → Paused → Promoting → Healthy.

Convert your Deployment to a Rollout without downtime. The recommended approach for production is workloadRef: create a Rollout that references the existing Deployment, and let Argo Rollouts scale down the Deployment progressively as it scales up the Rollout. Alternatively, delete the Deployment while orphaning its ReplicaSet, then apply the Rollout to adopt the running Pods:

# Delete the Deployment but keep the ReplicaSet and Pods running (--cascade=orphan):
kubectl delete deployment my-app -n production --cascade=orphan

# Apply the Rollout - the controller adopts the orphaned ReplicaSet:
kubectl apply -f my-app-rollout.yaml

# Verify the existing Pods are adopted (no new Pods should be created):
kubectl argo rollouts get rollout my-app -n production

Do NOT use kubectl scale deployment --replicas=0 for a zero-downtime migration - scaling the Deployment to zero terminates all existing Pods before the Rollout creates replacements, causing a restart. The --cascade=orphan flag is what preserves the running Pods.

Add analysis incrementally. Start with autoPromotionEnabled: false and promote manually for the first several rollouts. Once you understand what healthy metrics look like in your environment, add AnalysisTemplates with loose thresholds. Tighten thresholds over time as you build confidence. Don't set a 99.9% success rate threshold before you know your baseline.

Instrument the preview service specifically. Add a label to the preview Pods via the Rollout's template.metadata.labels (Argo Rollouts adds rollouts-pod-template-hash automatically, but you can add your own version: preview label) and configure Prometheus ServiceMonitors to scrape by that label. This ensures your AnalysisRun queries target preview metrics, not production metrics.

Build rollback into your runbooks now. Document: (1) how to check rollout status, (2) how to abort a rollout in preview, (3) how to roll back a promoted rollout, (4) who has permission to do each. Test the rollback path in staging - actually roll back a deployment. The first time you execute rollback should not be during an incident.

Set alerts on Rollout conditions, not just service health. An Argo Rollouts Degraded state may not immediately cause user-visible errors (the active service may still be healthy), but it indicates the deployment pipeline is broken. Alert on rollout_info{phase!="Healthy"} being non-zero for more than 5 minutes in any production namespace. (The metric is rollout_info with a phase label; the older rollout_phase metric exists but is deprecated.)


Mastery Check

Your team runs a blue-green rollout for a service that handles financial transactions. The new version is up in preview. Pre-promotion AnalysisRun passes - 99.8% success rate, p99 latency under 150ms. You promote. Immediately after promotion, the on-call pager fires: error rate on the service spiked to 12% for about 90 seconds, then recovered. What happened?

Answer

This is connection drain. After the active service selector switched to the new ReplicaSet, some in-flight requests were still being processed by old Pods. As those Pods began shutting down (SIGTERM fired), they dropped some connections without completing the responses - particularly any requests that arrived during the shutdown window. The 90-second duration corresponds to your terminationGracePeriodSeconds.

The fix is a preStop sleep and ensuring your application handles SIGTERM by finishing in-flight requests before exiting:

lifecycle:
  preStop:
    exec:
      command: ["sleep", "10"]
terminationGracePeriodSeconds: 60

The preStop sleep gives kube-proxy time to remove the old Pod's endpoint before SIGTERM fires. The 60-second termination period gives in-flight requests time to complete. Together, they eliminate the spike. Your blue-green swap was atomic at the Service selector level - but the underlying network layer needed time to propagate that change to every node's iptables or IPVS rules. The preStop sleep bridges that gap.