Skip to content

Multi-Cloud Kubernetes: The Honest Take

A frazzled engineer juggles flaming servers labeled AWS, GCP, and Azure while a calm colleague runs a single well-organized cluster next door.

Every few months the conversation surfaces again in architecture review meetings: should we go multi-cloud? The pitch is seductive - avoid vendor lock-in, improve resilience, use the best compute from each provider, gain negotiating leverage at renewal time. Some of those reasons are real. Most organizations citing them have not actually stress-tested whether they apply to their situation.

Multi-cloud Kubernetes is a legitimate architecture for a narrow set of problems. For everyone else, it is a very expensive way to solve the wrong problem - and the bill arrives slowly enough that you don't notice until you're well past the point where reversing course is easy. By then, you've also hired two people specifically to maintain it, so reversing course is now a headcount conversation too.


First: Be Specific About What You Actually Mean

"Multi-cloud" covers several meaningfully different things, and failing to be precise about which one you're proposing is where most of the confusion starts. Slide decks rarely clarify this, because the answer affects the budget slide.

Active-active across clouds - workloads running simultaneously on GKE and EKS, with traffic distributed across both in real time. This is the hardest version and the one most frequently featured in vendor whitepapers, conveniently illustrated with clean diagrams and no on-call rotation attached.

Active-passive DR - primary workloads on one cloud, a standby cluster on another you can fail over to. More tractable than active-active, but the gap between "we have a cluster over there" and "we can actually fail over to it under pressure, at 2am, with half the team out sick" is where most teams discover what they forgot to plan for.

Portability as insurance - running on one cloud today but maintaining the posture that you could move if needed. This is the most frequently stated reason and the hardest to maintain without constant, conscious discipline. Like a gym membership: the value is real, but only if you actually use it. Most organizations review the concept quarterly in a slide deck and call it done.

Be honest about which one you're proposing before the architecture review. The operational cost varies by roughly an order of magnitude between them.


The Reasons People Give (And How They Actually Hold Up)

"We want to avoid vendor lock-in."

This is the most cited reason and the least examined.

Here's the thing: Kubernetes itself is remarkably consistent. A Deployment is a Deployment on EKS, GKE, and AKS. The API surface is stable, well-specified, and genuinely portable - it's one of the more impressive feats of standardization the industry has pulled off. Apply a manifest on one conformant cluster and it runs the same way on another.

The lock-in lives at the layer below it. Nobody runs just the Kubernetes primitives in production. You run EKS with IRSA (IAM Roles for Service Accounts, which authenticates pods to AWS services via OIDC federation) and EBS volumes and ALB ingress annotations. Or GKE with Workload Identity (which binds Kubernetes service accounts to GCP IAM via a metadata server that intercepts credential requests) and Persistent Disk and Cloud Load Balancing. Or AKS with Microsoft Entra Workload ID (same OIDC-based concept, completely different implementation) and Azure Disk and Azure Load Balancer.

These are not interchangeable. The IAM integration mechanisms are architecturally different across all three providers. The storage class names are different. The load balancer annotations are different. Kubernetes did its job and gave you a consistent control plane - it's the managed services each provider wraps around it that create the friction.

The deeper irony: if you go multi-cloud specifically to avoid lock-in, you tend to end up locked in to the abstraction layer you built to paper over those provider differences - a lock-in entirely of your own creation, with no vendor SLA, no support contract, and one or two engineers who truly understand it. Congratulations, you have achieved full sovereignty over something you now have to maintain forever. You can file a support ticket with yourself.

The organizations that achieve genuine portability do it by running the common-denominator stack everywhere: Vault instead of cloud-native secrets managers, a software load balancer instead of provider-managed LBs, storage classes that work across environments. That is a real choice, and it works. It also means forgoing managed databases, managed message queues, and a long list of cloud-native services that dramatically reduce operational burden. Most teams are not willing to make that trade - and honestly, they usually shouldn't have to.

"We need resilience."

This one is real. Multi-cloud is almost never the right answer for it.

The failure modes that take down a cloud provider at scale are not the ones that typically affect production workloads day-to-day. Region-level outages happen; they are relatively rare and usually partial. The first line of defense is multi-region within a single cloud - which most organizations have not fully built out before they start discussing multi-cloud. Running in us-east-1 and eu-west-1 on AWS is dramatically simpler than splitting across AWS and GCP, and it handles the actual blast radius of most real incidents.

Multi-cloud resilience starts to make genuine sense in a narrow set of scenarios: a cloud provider's global systems fail entirely, an account gets suspended (it happens - and when it does, the support queue is not shorter just because you're on fire), or there's a pricing or policy change that requires rapid exit. These are real risks for some organizations and edge-case concerns for most.

The organizations for whom multi-cloud DR is genuinely worth the cost tend to have regulatory requirements around provider diversity, revenue exposure large enough that cloud availability risk has material financial impact, or legal requirements around geographic and provider distribution. If none of those apply, multi-region on one provider is almost certainly the right call - and cheaper to operate correctly.

"We can use the best compute from each provider."

Sometimes there's a genuine technical reason to run specific workloads on a second cloud - TPU access on GCP, a particular GPU SKU that's backordered on your primary provider, a managed AI service with no real equivalent elsewhere. This can be a legitimate, narrow architectural decision.

But model the egress costs before committing. Data moving between cloud providers is billed at internet egress rates by the sending side - AWS charges $0.09/GB leaving to the internet, with similar structures on GCP and Azure. Your egress bill does not care about your architectural ambitions. Move enough data between providers and the economics of "better compute elsewhere" get humbling fast.

The question is never "does provider X do Y better?" It's "does provider X do Y enough better to justify egress costs, a cross-cloud data pipeline, and the overhead of maintaining real operational expertise across two cloud environments on your team?" That's a much harder bar to clear.

"We'll have negotiating leverage at renewal."

True, and worth almost nothing if the capability isn't real. Cloud providers have seen the "we're evaluating alternatives" conversation enough times to know the difference between a team with an operational multi-cloud setup and one that spun up a proof-of-concept cluster two years ago and never touched it.

The good news: credible leverage doesn't require a full active-active production architecture. It requires that the work to migrate is understood, scoped, and demonstrably achievable. That's a different - and much cheaper - investment than actually running production across two providers.


What Actually Gets Hard

Networking

Cross-cloud networking is where multi-cloud ambitions collide with physics fastest.

Latency between providers varies significantly by geography. Geographically proximate data centers across different providers - say, AWS us-east-1 and Azure East US - can show round-trip latencies under 5ms. Intercontinental cross-cloud paths regularly exceed 130ms. For async workloads and batch processing, this is usually fine. For services making synchronous calls across the cloud boundary, every hop pays that penalty - and in a microservices architecture with multiple layers of calls, it compounds in ways that become obvious only in production load tests, which is a bad time to learn about them.

Within a single cluster, Kubernetes service discovery just works - it's one of those things the ecosystem got genuinely right. Spanning that across clouds is a different story. Kubernetes services are cluster-local by default. Bridging them requires additional infrastructure: Istio's multi-cluster setup requires a shared root CA for cross-cluster mTLS and dedicated east-west gateways for cross-network traffic. Cilium Cluster Mesh takes a different approach, using eBPF-based control plane synchronization across clusters with a clustermesh-apiserver per cluster - lower ceremony than Istio but still real operational overhead that doesn't exist in a single-cluster world. Either way, your mesh control plane becomes a shared dependency across both clouds, which means its failure mode is now everyone's problem simultaneously.

Egress costs compound quietly. Cross-cloud data movement is billed at internet egress rates on the sending side. At meaningful scale, these charges stop being line items and start being conversations with your CFO.

Storage

Stateless workloads on Kubernetes are genuinely portable - deploy the same manifest, get the same behavior. This is not an accident; it reflects years of work on the container and orchestration layer to make it so. Stateful workloads, however, don't care about your portability story.

If your application writes to a Persistent Volume, that volume exists in one cloud. Getting it to another for DR means running a replication layer: Velero for Kubernetes resource and PV backups, database-native cross-cloud replication, or a storage layer like Rook/Ceph operating across environments. These work. They also add operational complexity, introduce their own failure modes, and have RPO/RTO characteristics that need to be understood and tested before you need them for real.

Most organizations that describe their stateful workloads as "multi-cloud" are actually running stateless Kubernetes workloads that connect to a cloud-native database on a primary provider. That's fine - it just means the database isn't multi-cloud, and database portability is its own (harder) problem.

Identity and Secrets

AWS IRSA, GKE Workload Identity, and AKS Microsoft Entra Workload ID all solve the same problem - how does this pod authenticate to cloud services without hardcoding credentials? - using architecturally similar but practically incompatible mechanisms. A pod authenticating to S3 via IRSA needs a completely different configuration to authenticate to GCS. Running workloads across both clouds means either adopting a provider-neutral layer (HashiCorp Vault, or External Secrets Operator configured to pull from multiple backends) or maintaining separate identity patterns per cloud. The latter is both an operational support burden and an expanded security audit surface - two things most teams are not looking to add.

Day-2 Operations

This is the one that gets glossed over most consistently. Setting up a multi-cloud cluster is a project. Operating it indefinitely is a commitment that doesn't appear on any architecture diagram.

Your CI/CD pipelines now target multiple cluster contexts. Your upgrade schedule needs to track multiple managed Kubernetes release cadences - GKE, EKS, and AKS don't ship the same minor versions on the same timeline, and their upgrade processes differ in meaningful ways. Your on-call runbooks need to cover incidents that could originate in either cloud, which means your on-call engineers need enough familiarity with both to actually debug something at 2am rather than stare at an unfamiliar console. Your autoscaler configurations need independent tuning per platform - Karpenter on EKS, GKE's cluster autoscaler, and AKS cluster autoscaler all respond differently under load.

None of this is insurmountable. All of it is real, ongoing work that scales with the number of clouds you're running - and none of it shows up in the initial cost estimate.


When It Actually Makes Sense

To be clear: multi-cloud Kubernetes is the right call in specific situations.

Regulatory and data sovereignty requirements. Some industries and jurisdictions require data to reside in regions where only certain providers have compliant infrastructure. When your compliance framework mandates provider diversity, multi-cloud isn't a strategic choice - it's a requirement.

Post-merger infrastructure. Acquisitions regularly produce organizations with established, entrenched infrastructure on two different clouds. The migration cost may genuinely exceed the cost of operating both indefinitely - especially if the acquired entity has deep cloud-native dependencies that can't be lifted cleanly.

Genuine DR with tested failover. If your SLA requires sub-hour recovery and multi-region within one cloud is genuinely insufficient, multi-cloud active-passive DR is defensible. The key word is tested - a DR cluster that has never been failed over under realistic conditions is a hypothesis, not a capability. Run the drill.

Specialized compute with no equivalent elsewhere. Specific GPU or accelerator types, TPU access, managed AI services with a real capability gap compared to your primary provider. A specific workload, a specific technical reason - not a general philosophy.


The Organizational Cost Nobody Puts in the Deck

Every multi-cloud architecture diagram shows the topology. Almost none include the staffing model.

Running multi-cloud Kubernetes well requires engineers who understand both clouds deeply enough to diagnose incidents on either, at whatever hour those incidents prefer to arrive. It requires runbooks and on-call procedures for both platforms. It requires testing in both environments - which means CI/CD complexity and infrastructure spend for non-production clusters on both sides. It requires a framework for deciding where new workloads land, and that decision adds friction to every new service deployment.

For a platform team of five, this is a serious, sustained commitment. For a team of two, it's probably not achievable at the quality level you'd actually want when something catches fire at 5pm on the Friday before a long weekend - which, in defiance of all probability, is exactly when it will happen.

The cloud providers will happily help you get started. They will absolutely not staff your operations team.


A Practical Framework Before You Commit

Five questions worth answering honestly before the decision is made:

  1. What specific failure scenario are we actually defending against? Multi-region within one provider handles most real ones. Name the scenario that genuinely requires a second cloud.

  2. What is the egress cost model? Estimate actual data transfer between clouds based on real workload patterns. Put a monthly dollar figure on it before the architecture review, not after.

  3. Which workloads are actually portable? Audit your stateful dependencies. If your database isn't multi-cloud, your recovery story has a gap in it.

  4. Do we have the team to operate this long-term? Not the team to build it - the team to run it 18 months from now when the engineers who designed it have moved on to other things.

  5. Have we done multi-region first? Multi-cloud without solid multi-region is backwards. If a single regional outage breaks you today, the multi-cloud resilience story is premature.

If the honest answers are: a specific regulatory requirement, we've modeled the costs and they work, we've audited our state, yes, and yes - multi-cloud Kubernetes is probably the right call. Most teams can't answer all five that way, and knowing that before you build it is genuinely useful.


The Bottom Line

Multi-cloud Kubernetes isn't an architecture you adopt - it's a capability you invest in continuously. The organizations that do it well treat it as a first-class operational commitment with dedicated team capacity, tested runbooks, and clear ownership. The ones that do it poorly end up with the worst of both worlds: the complexity of two clouds, with the operational maturity of neither.

Running Kubernetes well on one cloud, across multiple regions, with solid GitOps practices and a team that genuinely knows the environment - that's already a high bar worth being proud of. It's also the right bar for most organizations. Get there before expanding the surface area.

Multi-cloud is not the next level of Kubernetes maturity. It's an orthogonal decision with its own cost structure, team requirements, and risk profile. Treat it like one - and if someone brings it up in your next architecture review without clear answers to those five questions, you now have something to hand them.