Cloud cost engineering

Spot instance economics: when interruption is fine

Spot at 70-90% off On-Demand looks irresistible. The interruption rate isn't random — diversify across instance types and the math works for far more workloads than people think.

cloudprice editorial ~4 min read

Spot Instances (AWS), Spot VMs (Azure), and Preemptible / Spot VMs (GCP) all do the same thing: sell you unused capacity at a 60-90% discount, with the catch that the provider can take it back with little warning. AWS gives you 2 minutes, GCP gives you 30 seconds, Azure gives you 30 seconds.

The instinctive reaction is "too risky for production". The actual data — once you understand how interruption works — is more nuanced.

The headline numbers

A spot snapshot from us-east-1 in 2026:

InstanceOn-demandSpotDiscount
m6i.large$0.096/hr$0.029-0.041/hr~60-70%
c6i.xlarge$0.17/hr$0.051-0.077/hr~55-70%
m7g.xlarge$0.1632/hr$0.049-0.082/hr~50-70%
g5.xlarge (GPU)$1.006/hr$0.30-0.50/hr~50-70%
r6i.4xlarge (memory)$1.008/hr$0.28-0.45/hr~55-72%

The discount is biggest in less-popular instance generations and largest sizes. m4.16xlarge spot is often 85-90% off. m6i.large spot is more like 60% off because everyone wants the small modern instance.

Interruption rates — actually measured

AWS publishes the Spot Instance Advisor, which buckets every (instance, region) pair into 0-5%, 5-10%, 10-15%, 15-20%, >20% monthly interruption rates.

From recent snapshots:

  • Modern, popular general-purpose families in major regions: m6i.large in us-east-1 is <5% interruption rate. c6i.xlarge is the same. m7g.large <5%.
  • Older generations: m4.large in us-east-1 often <5%, but in smaller regions can hit 10-15%.
  • GPU instances: Highly variable. g5.xlarge sometimes <5%, sometimes 15-20%, depending on overall demand at that moment.
  • Newest generations on launch: Often 20%+ for the first few months, because not much spare capacity yet.

The diversification strategy

Single instance type, single AZ, single spot pool = bad idea. Mix across 5-10 instance types in 2-3 AZs, use EC2 Auto Scaling Group's "capacity-optimized" allocation strategy, and effective interruption rate plummets toward zero. The ASG picks pools with the most spare capacity at any moment.

A worked example: an ASG running stateless workers, mixed across m6i.large, m6a.large, m6i.xlarge (with weights so a larger instance handles 2 jobs), and m7g.large (for ARM-compatible builds), across 3 AZs. Effective interruption rate: well under 1%. Effective discount versus on-demand: about 65%.

Workloads where Spot is the right answer

  • Stateless web workers / API workers behind a load balancer. Worker disappears, load balancer routes around it, ASG replaces it.
  • CI/CD runners. A self-hosted GitHub Actions or GitLab Runner on Spot is the cheapest possible build farm. If a job fails due to interruption, just retry. Most CI systems handle this natively.
  • Batch processing. ETL jobs, video encoding, ML training (with checkpointing). Spot was literally designed for this. Some Karpenter / Kubernetes setups achieve 80-90% Spot usage with sub-1% interruption visibility to the application.
  • Big-data / Spark / EMR. EMR has first-class Spot support including capacity-optimised allocation and "task" nodes that aren't on the critical path.
  • Dev / staging environments. Even databases — if the dev DB dies, restart from a backup. It's dev.

Workloads where Spot is wrong

  • Stateful primary databases. An RDS-style primary failover is operationally painful and customer-visible. Run primaries on On-Demand or RIs.
  • Message brokers with persistent state. Kafka brokers on Spot is asking for split-brain.
  • Long-running compute that doesn't checkpoint. A 6-hour training job that can't resume from a checkpoint is going to lose a lot of work to a single interruption.
  • Real-time / WebSocket servers with sticky sessions. Customer-visible disconnects on every Spot reclaim.
  • Anything that bakes IP allocation into application logic. Spot replacements get new IPs.

Karpenter is the right tool

For Kubernetes, Karpenter is the answer. It can mix Spot and On-Demand, respect pod-disruption budgets, drain nodes when AWS sends the interruption notice, and pick instance types from a large pool. Most production EKS clusters using Karpenter run 60-80% Spot with no operational pain.

What about GCP and Azure?

GCP Spot VMs have a flat 60-91% discount (depends on instance type). Preemption can happen anytime, and the 30-second notice is shorter than AWS's 2 minutes — design accordingly.

Azure Spot VMs have similar discounts. Eviction can be based on price or capacity; you pick the policy. Notice is also 30 seconds.

The hidden risk: capacity events

Once or twice a year, AWS has a regional capacity event (huge AI launch, big regional outage recovery) and Spot prices spike to On-Demand or near it. ASGs configured to "lowest price" silently fall over. The fix is to set a max-price ceiling and to have an On-Demand backstop in the same ASG.

Effective savings, honest

A well-diversified Spot ASG covering 60-80% of a stateless compute footprint, with the rest on Compute Savings Plans, delivers an effective compute discount of 50-65% versus pure On-Demand. That's about as good as any commitment strategy gets.

For instance-level pricing including Spot bands, see the cloudprice catalogue — every AWS row notes the spot range. Cross-check against AWS vs GCP if you're choosing between hyperscaler spot offerings.

Try it yourself
Compare list prices across all seven providers, side by side. Live snapshot updated regularly.