Spot instance economics: when interruption is fine
Spot at 70-90% off On-Demand looks irresistible. The interruption rate isn't random — diversify across instance types and the math works for far more workloads than people think.
Spot Instances (AWS), Spot VMs (Azure), and Preemptible / Spot VMs (GCP) all do the same thing: sell you unused capacity at a 60-90% discount, with the catch that the provider can take it back with little warning. AWS gives you 2 minutes, GCP gives you 30 seconds, Azure gives you 30 seconds.
The instinctive reaction is "too risky for production". The actual data — once you understand how interruption works — is more nuanced.
The headline numbers
A spot snapshot from us-east-1 in 2026:
| Instance | On-demand | Spot | Discount |
|---|---|---|---|
| m6i.large | $0.096/hr | $0.029-0.041/hr | ~60-70% |
| c6i.xlarge | $0.17/hr | $0.051-0.077/hr | ~55-70% |
| m7g.xlarge | $0.1632/hr | $0.049-0.082/hr | ~50-70% |
| g5.xlarge (GPU) | $1.006/hr | $0.30-0.50/hr | ~50-70% |
| r6i.4xlarge (memory) | $1.008/hr | $0.28-0.45/hr | ~55-72% |
The discount is biggest in less-popular instance generations and largest sizes. m4.16xlarge spot is often 85-90% off. m6i.large spot is more like 60% off because everyone wants the small modern instance.
Interruption rates — actually measured
AWS publishes the Spot Instance Advisor, which buckets every (instance, region) pair into 0-5%, 5-10%, 10-15%, 15-20%, >20% monthly interruption rates.
From recent snapshots:
- Modern, popular general-purpose families in major regions:
m6i.largeinus-east-1is <5% interruption rate.c6i.xlargeis the same.m7g.large<5%. - Older generations:
m4.largeinus-east-1often <5%, but in smaller regions can hit 10-15%. - GPU instances: Highly variable.
g5.xlargesometimes <5%, sometimes 15-20%, depending on overall demand at that moment. - Newest generations on launch: Often 20%+ for the first few months, because not much spare capacity yet.
The diversification strategy
Single instance type, single AZ, single spot pool = bad idea. Mix across 5-10 instance types in 2-3 AZs, use EC2 Auto Scaling Group's "capacity-optimized" allocation strategy, and effective interruption rate plummets toward zero. The ASG picks pools with the most spare capacity at any moment.
A worked example: an ASG running stateless workers, mixed across m6i.large, m6a.large, m6i.xlarge (with weights so a larger instance handles 2 jobs), and m7g.large (for ARM-compatible builds), across 3 AZs. Effective interruption rate: well under 1%. Effective discount versus on-demand: about 65%.
Workloads where Spot is the right answer
- Stateless web workers / API workers behind a load balancer. Worker disappears, load balancer routes around it, ASG replaces it.
- CI/CD runners. A self-hosted GitHub Actions or GitLab Runner on Spot is the cheapest possible build farm. If a job fails due to interruption, just retry. Most CI systems handle this natively.
- Batch processing. ETL jobs, video encoding, ML training (with checkpointing). Spot was literally designed for this. Some Karpenter / Kubernetes setups achieve 80-90% Spot usage with sub-1% interruption visibility to the application.
- Big-data / Spark / EMR. EMR has first-class Spot support including capacity-optimised allocation and "task" nodes that aren't on the critical path.
- Dev / staging environments. Even databases — if the dev DB dies, restart from a backup. It's dev.
Workloads where Spot is wrong
- Stateful primary databases. An RDS-style primary failover is operationally painful and customer-visible. Run primaries on On-Demand or RIs.
- Message brokers with persistent state. Kafka brokers on Spot is asking for split-brain.
- Long-running compute that doesn't checkpoint. A 6-hour training job that can't resume from a checkpoint is going to lose a lot of work to a single interruption.
- Real-time / WebSocket servers with sticky sessions. Customer-visible disconnects on every Spot reclaim.
- Anything that bakes IP allocation into application logic. Spot replacements get new IPs.
Karpenter is the right tool
For Kubernetes, Karpenter is the answer. It can mix Spot and On-Demand, respect pod-disruption budgets, drain nodes when AWS sends the interruption notice, and pick instance types from a large pool. Most production EKS clusters using Karpenter run 60-80% Spot with no operational pain.
What about GCP and Azure?
GCP Spot VMs have a flat 60-91% discount (depends on instance type). Preemption can happen anytime, and the 30-second notice is shorter than AWS's 2 minutes — design accordingly.
Azure Spot VMs have similar discounts. Eviction can be based on price or capacity; you pick the policy. Notice is also 30 seconds.
The hidden risk: capacity events
Once or twice a year, AWS has a regional capacity event (huge AI launch, big regional outage recovery) and Spot prices spike to On-Demand or near it. ASGs configured to "lowest price" silently fall over. The fix is to set a max-price ceiling and to have an On-Demand backstop in the same ASG.
Effective savings, honest
A well-diversified Spot ASG covering 60-80% of a stateless compute footprint, with the rest on Compute Savings Plans, delivers an effective compute discount of 50-65% versus pure On-Demand. That's about as good as any commitment strategy gets.
For instance-level pricing including Spot bands, see the cloudprice catalogue — every AWS row notes the spot range. Cross-check against AWS vs GCP if you're choosing between hyperscaler spot offerings.