Skip to main content
BearPlex Arsenal · deep research

Cloud Cost Optimisation Playbook.

Fifty checks across the six levers that actually move five-to-seven-figure cloud bills: commitments, rightsizing, spot, storage tiering, egress, and the FinOps practice that keeps savings taken.

50
Checks
6
Levers
30
Primary sources

Self-reported cloud waste hit 29 percent of spend in 2026, the first rise in five years, and it is not because teams lack dashboards. Waste regenerates wherever rate levers are mismanaged, workloads are overprovisioned, and nobody owns the bill. The median effective savings rate on AWS compute sits at 15 percent across roughly $3B of analyzed spend, against published commitment discounts of up to 72 percent.

This playbook compresses the six levers that actually move five-to-seven-figure bills into 50 concrete checks, each paired with the published evidence for why it matters.

Savings are not taken once; they are kept. The last lever is the operating practice that stops waste from regenerating.
Pillar 01

Commitments and rate optimization.

Reserved instances and savings plans are the cheapest savings available because they require no engineering work. Yet against published AWS discounts of up to 72 percent, the median organization realizes only a 15 percent effective savings rate on AWS compute, and the worst performers go negative, paying more than on-demand (ProsperOps, 2025). The trap is managing to coverage and utilization, which are input metrics that can both read 100 percent while you lose money. Manage to effective savings rate, sequence commitments after rightsizing, and commit to the floor, never the average.

  • 01

    Track effective savings rate as your single commitment KPI, not coverage or utilization. ProsperOps documented an environment at 100 percent RI coverage and 100 percent utilization that still cost about 16 percent more than running everything on-demand, so the dashboard metrics most teams celebrate can hide outright losses.

  • 02

    Rightsize the fleet before buying any commitment. A three-year reservation against an oversized fleet locks the inefficiency in for the full term, because nobody downsizes an instance they have already paid for.

  • 03

    Analyze at least 60 days of usage history before any purchase. Commitments bought against a short or unrepresentative window inherit seasonal peaks as permanent fixed cost (OneUptime engineering guidance).

  • 04

    Commit to roughly 80 percent of the observed usage floor, never the average. OneUptime's worked guidance is to find the minimum hourly rate and commit to 80 percent of it, keeping the rest on-demand, because committing to the average converts elastic spend into a fixed liability that bills whether or not the workload exists.

  • 05

    Layer flexible compute savings plans over the stable base before any instance-family commitments. Instance-specific commitments bought first strand you when teams migrate families or regions, while the flexible layer follows the workload.

  • 06

    Buy on a rolling ladder of smaller tranches instead of one annual purchase. A single large tranche freezes your discount posture for one to three years and removes your ability to respond to rightsizing wins or workload churn.

  • 07

    Review the commitment portfolio against actual usage monthly. Datadog found only 29 percent of organizations buy enough discounts to cover even half of eligible spend, and drift in either direction (under-coverage or stranded reservations) burns money silently every hour.

  • 08

    Benchmark your ESR against the published distribution and set a target. The median sits at 15 percent and the 98th percentile at 47 percent (ProsperOps), and the median was 0 percent as recently as 2023, so a team with no target is most likely saving close to nothing.

Pillar 02

Rightsizing and workload efficiency.

Utilization data says overprovisioning is getting worse, not better: average Kubernetes CPU utilization fell to 10 percent in Cast AI's 2025 benchmark, and Datadog attributes 83 percent of container spend to idle resources. Engineers overprovision because availability incidents are career events and waste is not, so rightsizing only sticks when recommendations are trustworthy and acting on them is safe. That starts with real telemetry, especially memory.

  • 09

    Install memory telemetry on every instance before acting on any rightsizing recommendation. AWS Compute Optimizer sees no memory data without the CloudWatch agent, and CPU-only downsizing pushes memory-bound services into OOM kills that poison engineering trust in all future recommendations.

  • 10

    Require at least two weeks of full telemetry before any downsize action. Resizing on a partial window misses weekly batch jobs and month-end peaks, and one avoidable incident costs more goodwill than the instance saved.

  • 11

    Compare container resource requests against measured usage every sprint. Datadog found 29 percent of container spend is workload idle, meaning containers reserving CPU and memory they never touch.

  • 12

    Set Kubernetes requests from observed peak percentiles, not developer guesses. Cast AI measured average cluster CPU utilization at 10 percent, meaning the default guess reserves roughly ten times the compute the workload actually uses.

  • 13

    Attack cluster idle by tightening node autoscaling and bin packing. Cluster idle (nodes overprovisioned beyond what workloads request) is the single largest slice of container waste at 54 percent of container spend (Datadog).

  • 14

    Sweep for zombie resources monthly: unattached volumes, idle load balancers, orphaned IPs, stale snapshots. These bill at full rate forever, attached to nothing, and no utilization dashboard ever flags them.

  • 15

    Schedule non-production environments off outside working hours. A dev environment running 168 hours a week to serve 50 hours of work pays a roughly 3x premium for nothing.

  • 16

    Migrate stateless workloads to ARM instances where supported. AWS prices Graviton instances up to 20 percent below comparable x86 instances, and organizations already on ARM route 18 percent of their EC2 compute spend there (Datadog), a rare same-workload, cheaper-silicon win.

  • 17

    Treat your largest line items as engineering investigations, not procurement problems. Segment's deep dive into a runaway AWS bill yielded about $1M in annual savings, and a major culprit was a single hot DynamoDB partition key created by leftover test code, a bug no discount lever could fix (InformationWeek).

Pillar 03

Spot strategy.

Spot pricing offers up to 90 percent off on-demand, and realistic production programs land at 59 to 77 percent savings depending on the mix (Cast AI, 2025). The discipline is in the failure modes: AWS gives two minutes of warning, ignores PodDisruptionBudgets when reclaiming nodes, and guarantees nothing about replacement capacity. Spot is an architecture decision, not a checkbox.

  • 18

    Use the price-capacity-optimized allocation strategy as your default. AWS names it the first preference and default for most spot workloads; in AWS's own comparison the lowest-price strategy hit a 20 percent interruption rate versus 3 percent for price-capacity-optimized.

  • 19

    Diversify each spot fleet across multiple instance types, sizes, and every availability zone. A fleet drawing from one or two pools turns a routine capacity reclaim into an outage instead of a transparent replacement.

  • 20

    Enable capacity rebalancing so replacements launch on rebalance recommendations. Waiting for the two-minute interruption notice means racing the clock, while rebalance signals usually arrive earlier and let replacements warm up first.

  • 21

    Engineer for the two-minute notice with checkpointing, lifecycle hooks, and fast drains. The warning window is two minutes and replacements are not guaranteed: in One2N's October 2024 production incident, replacement capacity that normally arrives within about three minutes failed to show up, turning a routine reclaim into extended downtime.

  • 22

    Keep single-replica and stateful critical services off spot entirely. AWS does not respect PodDisruptionBudgets when reclaiming spot capacity, so the Kubernetes safety rails you rely on simply do not apply (One2N).

  • 23

    Monitor your spot-to-on-demand ratio and alert when fallback drifts. Autoscalers quietly fall back to on-demand when spot pools dry up, and the savings you budgeted for evaporate without anyone noticing.

  • 24

    Check Spot Instance Advisor interruption bands before selecting instance types. AWS publishes per-type reclaim frequency in monthly bands from under 5 percent to over 20 percent, and picking blind means inheriting the worst pools.

  • 25

    Budget spot savings at 59 to 77 percent, not the 90 percent marketing ceiling. Cast AI's benchmark measured 59 percent average savings for mixed clusters and 77 percent for spot-only, and forecasting the ceiling sets your program up to look like a failure.

Pillar 04

Storage tiering and lifecycle.

Storage tiering is real money: S3 Intelligent-Tiering has saved customers more than $6 billion since 2018, and its archive tiers save up to 95 percent for rarely accessed objects (AWS). The fine print flips the math on the wrong bucket, though: a 128 KB auto-tiering minimum, per-object monitoring fees, 90-day minimum duration charges, and 40 KB of metadata overhead per archived object. Tier by access pattern, and read the pricing page before the policy ships.

  • 26

    Use explicit lifecycle rules for predictable access patterns and Intelligent-Tiering only for unknown ones. Logs, temp files, and compliance archives have known lifetimes, so paying Intelligent-Tiering monitoring fees to rediscover what you already know is pure overhead.

  • 27

    Check object size distribution before enabling Intelligent-Tiering on any bucket. Objects under 128 KB are never auto-tiered and always bill at frequent-access rates (AWS), so a small-object bucket gets nothing from the storage class.

  • 28

    Price the per-object monitoring fee against projected savings for buckets with millions of objects. At $0.0025 per 1,000 objects monthly, a bucket of hundreds of millions of tiny objects can pay more in monitoring than tiering ever saves.

  • 29

    Verify objects will live past 90 days before archiving to Glacier tiers. Objects archived to Glacier Instant Retrieval and Glacier Flexible Retrieval are charged a 90-day minimum storage duration, so archiving short-lived data means paying for storage you already deleted (AWS S3 pricing).

  • 30

    Account for 40 KB of per-object metadata overhead in any archive migration plan. Each object archived to Glacier Flexible Retrieval or Deep Archive carries 8 KB charged at S3 Standard rates plus 32 KB at Glacier rates, which quietly erases the savings on small objects (AWS S3 pricing).

  • 31

    Set expiration rules for data nobody will ever read again. The cheapest storage class is deletion, and unexpired build artifacts, debug logs, and intermediate datasets compound monthly forever.

  • 32

    Delete unattached volumes and prune snapshot chains on a schedule. Block storage detached from any instance bills at full rate, and snapshot sprawl is the most common storage line item nobody owns.

  • 33

    Review the storage class mix quarterly against actual access logs. Access patterns drift as products age, and AWS prices untouched objects up to 68 percent lower one tier down, so data parked one tier too high is a standing overpayment (AWS Intelligent-Tiering pricing).

Pillar 05

Egress and data transfer traps.

Data transfer is where architecture decisions become invoices. Cross-AZ charges hit 98 percent of organizations and make up nearly half of transfer costs (Datadog), and the default routing in a private subnet sends same-region S3 traffic through a billable NAT gateway when a free endpoint exists. The 2024 exit-fee waivers and the EU Data Act killed most of the leaving tax, but day-to-day operational egress is untouched and still the real money.

  • 34

    Add VPC gateway endpoints for S3 and DynamoDB in every VPC. Geocodio paid $907 in one day routing same-region S3 traffic through a NAT gateway at $0.045 per GB, when the gateway endpoint that fixes it costs nothing.

  • 35

    Break NAT gateway data processing out as its own monitored line item. It hides inside the EC2-other bucket on most bills, so a five-figure routing mistake can sit invisible inside an aggregate line nobody questions.

  • 36

    Map cross-AZ traffic between chatty services and enable topology-aware routing. Cross-AZ transfer hits 98 percent of organizations and is largely self-inflicted architecture, which means it is also largely fixable (Datadog).

  • 37

    Offload internet-facing traffic to a CDN before it leaves the cloud at retail rates. Raw internet egress is among the highest-margin items on the bill: Cloudflare estimates AWS marks up US and EU egress bandwidth by roughly 80x its cost, a competitor's estimate but a useful anchor on egress margins.

  • 38

    Enable cost anomaly detection on day one. It is free, and it caught Geocodio's NAT spike within days while teams without it learn about transfer surprises on the monthly invoice.

  • 39

    Price data gravity before committing to any multi-region or multi-cloud design. Replication and inter-region transfer fees can dominate the compute savings the design was supposed to deliver, and they recur forever.

  • 40

    Use the exit-fee waivers as negotiating leverage at renewal. AWS, Google, and Microsoft all waived switching egress in early 2024 (InfoQ), and the EU Data Act prohibits all switching charges including egress from January 12, 2027 (Kemp IT Law), so the lock-in premium baked into your current discount is negotiable.

  • 41

    Colocate chatty service pairs in the same zone where the availability budget allows. Paying per-gigabyte rates for internal microservice chatter is an architecture tax with no resilience benefit when both services share a failure domain anyway.

Pillar 06

FinOps practice that holds the gains.

Tactics decay: a decade of tooling later, self-reported waste still sits at 29 percent and just rose for the first time in five years (Flexera), because savings regenerate as waste unless ownership and cadence change. The evidence points one direction: where engineering has ownership of cloud costs, 81 percent of organizations say spend is about where it should be (CloudZero). Build the smallest practice that fits your bill, and measure value with unit economics rather than absolute spend.

  • 42

    Make engineering teams own their cloud costs, with finance as partner rather than police. Only about a third of organizations report an exact understanding of where cloud spend goes, while 81 percent of those with engineering ownership say spend is about where it should be (CloudZero).

  • 43

    Allocate 100 percent of spend to owning teams through tags, accounts, or namespaces. Only one in four organizations achieve full cost allocation (CloudZero), and shared-bucket costs are where waste hides because cutting them is nobody's job.

  • 44

    Report unit economics (cost per transaction, per customer, per request) alongside totals. Absolute spend cannot answer the CFO's real question; rising spend with falling cost per unit is growth, and only the unit number proves it.

  • 45

    Set hard quota caps on serverless and autoscaling services, because budgets are alarms, not brakes. Milkie Way burned $72,000 in a few hours on a $7 budget when Cloud Run defaults fanned out a recursive job, and GCP billing data ran at least a day behind actual usage, so the alert could only arrive after the damage was done.

  • 46

    Run a weekly cost review with same-week anomaly response. A monthly cadence gives every leak up to 30 days of free runway; a weekly review caps the blast radius of any pricing change, misconfiguration, or runaway autoscaler at seven days.

  • 47

    Pull observability and SaaS spend into scope, not just IaaS. Monitoring scales with cardinality rather than value, and one Datadog customer, identified by The Pragmatic Engineer as Coinbase, paid roughly $65M for a single year of Datadog usage.

  • 48

    Adopt the FOCUS billing standard if you run more than one cloud. Normalized billing columns are the prerequisite plumbing for any allocation or unit-cost effort, and AWS, Azure, and Google Cloud all now ship native FOCUS exports (FinOps Foundation).

  • 49

    Size the practice to the bill: a disciplined quarterly cadence can beat a standing FinOps team at lower spend. The FinOps Foundation's own maturity model says the goal should never be to reach the highest maturity in every capability; mature only what returns business value, which for smaller or stable bills can be a lightweight recurring review.

  • 50

    Instrument cost per deploy decision: make engineers see the price of what they ship. 69 percent of IT leaders report cloud budget overruns, and the minority who stay on budget credit accurate forecasting (66 percent) and proactive spend monitoring (61 percent), not after-the-fact cleanup (Gartner Peer Community).

Sources

Every statistic in this playbook was re-verified against its primary source in June 2026. The receipts ship with the page.

What now

Use it. Then bring us the bill.

If the kit shows red flags you can't fix in a quarter, that's the conversation we're built for. Cost work pays for itself or it should not happen; the playbook shows where to look, and our infrastructure pod can run the levers with you.

Talk to engineering
The door is open

Bring the problem.We bring the discipline.

Tell us which world your problem lives in, or let the diagnostic find out. The first conversation is with an engineer, not an account manager.

ISO 27001 certified · NDA-first process · SOC 2 Type II in progress