Skip to main content
BearPlex Arsenal · deep research

The SaaS Scalability Blueprint.

Forty-nine checks across six disciplines for scaling from 1K to 1M users, distilled from what Notion, Figma, Shopify, Slack, and Amazon's own builders actually did.

49
Checks
6
Disciplines
38
Primary sources

Most scaling failures are self-inflicted: teams adopt Google-shaped architecture before exhausting boring levers, then the database, the queue, and the cloud bill break in predictable ways. Across production Kubernetes clusters, average CPU utilization sits at 8 percent before optimization, and Datadog ties 83 percent of container spend to idle resources.

This blueprint distills what Notion, Figma, Shopify, Slack, and Amazon's own builders actually did into 49 checks across six disciplines: database sharding, caching, queues, auto-scaling, multi-tenancy, and cost-aware scaling.

Run it as an audit: every check you fail has a named company that already paid for the lesson.
Pillar 01

Database scaling and sharding.

The database fails first and most expensively, but rarely for the reason teams expect: Notion was forced to shard by VACUUM stalls and transaction ID wraparound risk, not slow queries, while Figma bought years of runway with caching, read replicas, and vertical partitioning before sharding at all. The AKF Scale Cube gives the order of operations: exhaust cloning (x-axis), partition by tenant (z-axis), and treat functional decomposition (y-axis) as the expensive last resort, which is also Martin Fowler's monolith-first observation applied to data. Pick your shard key once and early, because everything downstream inherits it.

  • 01

    Exhaust vertical scaling, read replicas, and vertical partitioning before you shard. Figma grew its database stack almost 100x since 2020 on caching, read replicas, and a dozen vertically partitioned databases before horizontally sharding at all, so sharding early burns months you do not need to spend.

  • 02

    Alert on Postgres VACUUM health and transaction ID age as first-class signals. Notion's forcing function was VACUUM stalling consistently and TXID wraparound posing an existential threat, a failure mode that halts all writes and arrives suddenly rather than gradually.

  • 03

    Choose the tenant or workspace ID as your shard key now and include it in every table and composite primary key. Notion sharded by workspace ID because every block belongs to exactly one workspace, and retrofitting a partition key into a live schema is the most expensive part of every documented sharding migration.

  • 04

    Pick a highly divisible logical shard count and reject power-of-two schemes. Notion chose 480 logical shards explicitly because 480 is divisible by a lot of numbers, contrasting it with 512, whose power-of-two factors force you to double hardware at every step; their own advice is to pick values with a lot of factors.

  • 05

    Migrate with double writes, an audited backfill, and dark-read verification before cutover. Notion's re-shard cost users at worst about a second of a saving spinner because dark reads compared old and new databases before switchover, not after.

  • 06

    Skip index creation during bulk data copies and rebuild indexes afterward. This one change cut Notion's shard sync time from 3 days to 12 hours, which is the difference between a weekend migration and a week of exposure.

  • 07

    Restore a backup on a schedule and measure the wall-clock time it takes. GitLab lost six hours of production data when its backup and replication mechanisms turned out to be broken, misconfigured, or never enabled, because a backup that has never been restored does not exist.

  • 08

    Write your cross-region failover and split-brain policy by hand before automating it. GitHub's 43-second network partition became 24 hours of degraded service when automated failover promoted database servers that were missing a brief period of unreplicated writes.

  • 09

    Use proven sharding middleware where your engine has one, Vitess for MySQL or Citus for Postgres, instead of inventing query routing. Slack runs 2.3 million queries per second at peak with 2ms median and 11ms p99 latency on Vitess, and a hand-rolled router will re-discover the failure modes those projects solved years ago.

Pillar 02

Caching strategies.

Caching is the cheapest 10x in this playbook and the easiest one to get burned by. The working canon is small: cache-aside with TTLs for almost everything, write-through where read-after-write correctness matters, and stampede protection the moment a cache fronts an expensive query. The real trap is a service that only works cache-warm: that is a hidden hard dependency, which is why the Amazon Builders' Library tells you to run load tests with caches disabled.

  • 10

    Default to cache-aside with a TTL on every key, and add jitter to expiry times. Synchronized expiry turns one hot key into a thundering herd against the database, the failure mode the Amazon Builders' Library catalogs in detail.

  • 11

    Add request coalescing or leases before any cache fronts an expensive query. Facebook's memcache fleet regulates regeneration to one lease token per key every 10 seconds so a single client rebuilds the value, a mechanism the NSDI paper credits with cutting peak database query rate from 17K to 1.3K per second.

  • 12

    Run load tests with caching disabled or fully cold. The Amazon Builders' Library says to run load tests with caches disabled to validate resilience, and the first cold restart after an outage is the worst possible time to learn how your service behaves cache-cold.

  • 13

    Use soft TTLs so you can serve stale data when the origin is degraded. The Builders' Library pattern pairs a soft TTL with a hard TTL so existing cache data keeps serving while refreshes fail, instead of converting an origin brownout into a full outage as misses pile up.

  • 14

    Cache negative results with their own shorter TTL. Amazon's builders use a negative cache with a different TTL than positive entries, because uncached error and not-found responses let a single bad identifier hammer the database at full request rate.

  • 15

    Combine write-through with lazy loading on paths where users read their own writes. Cache-aside alone serves stale data right after a write, which users perceive as data loss even when it is not.

  • 16

    Move ElastiCache workloads from Redis OSS to Valkey. AWS prices ElastiCache for Valkey 33 percent lower on serverless and 20 percent lower on node-based clusters than other supported engines, with zero-downtime upgrades from Redis OSS, one of the cheapest wins available.

  • 17

    Set p99 latency budgets per endpoint and treat cache hit rate as an input to them. Deloitte measured an 8.4 percent retail conversion lift from a 0.1 second mobile speed improvement, so latency regressions are revenue regressions.

Pillar 03

Queue architecture.

Production queue design composes two named patterns: queue-based load leveling absorbs the peak, and competing consumers scales the drain side. The contrarian correction matters just as much: a queue absorbs variance, not sustained overload, and if arrival rate exceeds service rate on average, a bigger queue only schedules a bigger crash. Start with Postgres-backed jobs and graduate to Kafka on symptoms, not vibes.

  • 18

    Put a queue between bursty producers and any service that falls over at peak. Queue-based load leveling is the documented difference between a traffic spike and an outage, because the queue absorbs what the service cannot.

  • 19

    Scale consumer count, not queue capacity, when depth grows. Competing consumers is how throughput actually scales, while deeper buffers just add latency and defer a larger failure.

  • 20

    Enforce idempotent consumers and a dead-letter policy before going to production. At-least-once delivery is the default everywhere, and one poison message without a dead-letter path can wedge an entire worker fleet.

  • 21

    Build explicit backpressure and load shedding at the front door instead of buffering everything. As Fred Hebert puts it, queues do not fix overload: slow systems are the canary in the overload coal mine, and a blindly applied buffer just accumulates in-flight data to lose sooner or later.

  • 22

    Alert on queue depth and message age, and feed them to your autoscaler. Backlog is the truest demand signal you have, and CPU on workers lies about how far behind you are.

  • 23

    Start with Postgres-backed jobs and write down the graduation symptoms in advance. Gunnar Morling names the cues: long-running consumer transactions cause MVCC bloat and WAL pile-up, and vacuum failing to keep up with the change rate is the sign Postgres-as-queue is out of runway, so run sustained performance tests rather than brief ones.

  • 24

    Batch producers aggressively before adding brokers. Google Cloud measured 1KB to 10KB Kafka producer batches more than tripling throughput and cutting latency from 608ms to 171ms, making batch size the highest-leverage tuning knob.

  • 25

    Reject queue-per-tenant and service-per-integration topologies. Segment's 140+ microservices with a queue each destroyed velocity and paged the on-call engineer for routine load spikes until they consolidated back into a single service fed by one system.

Pillar 04

Auto-scaling and capacity.

Autoscaling is a thermostat, which is AWS's own analogy for target tracking, not a defense against spikes, and every reactive policy lags by minutes. The policy hierarchy is settled: target tracking by default, step scaling only when you need explicit control, predictive scaling for known daily and weekly cycles, and pre-scaling plus load shedding for anything truly sudden. Calibrate your ambitions first: Stack Overflow served roughly 6,000 requests per second from nine on-prem web servers running a tuned monolith, so know what one well-run box can do before architecting for a fleet.

  • 26

    Instrument latency, traffic, errors, and saturation per service before enabling any autoscaler. Google's four golden signals, with 99th percentile latency over a small window such as one minute as a very early signal of saturation, are the minimum telemetry an autoscaler can safely act on.

  • 27

    Default to target tracking and add predictive scaling only for proven recurring cycles. AWS describes target tracking as a thermostat, positions step scaling as the option when you want greater control, and aims predictive scaling at cyclical traffic whose daily and weekly patterns it mines from your history.

  • 28

    Never key scaling decisions solely on CPU for IO-bound or network-bound services. Slack's January 2021 outage scaled the web tier DOWN during overload because network problems made threads wait and CPU drop, then tried to add 1,200 servers in 14 minutes to compensate.

  • 29

    Load-test the provisioning path itself, including OS limits and cloud quotas. Slack's emergency scale-up hit the Linux open-files limit and exceeded an AWS quota in the middle of the incident it was meant to fix.

  • 30

    Keep monitoring and alerting outside the failure domain they watch. Slack debugged that outage partially blind because its dashboarding service depended on the same overloaded transit gateways as production.

  • 31

    Scale workers on queue backlog with KEDA or your platform's equivalent, with scale-to-zero for bursty jobs. KEDA is a CNCF graduated project with a catalog of 70+ scalers and genuine scale-to-zero, tying worker autoscaling directly to backlog and cost instead of CPU proxies.

  • 32

    Pre-scale for known events and ship load shedding for unknown ones. Reactive policies lag by minutes, so a launch or flash sale finishes before the autoscaler reacts and shedding is the only graceful failure left.

  • 33

    Fit load-test data to the Universal Scalability Law before buying hardware. Gunther's model decomposes throughput limits into contention and coherency, and a non-zero coherency term means throughput goes retrograde as you add capacity, which you want to learn for thousands of dollars instead of millions.

Pillar 05

Multi-tenancy and isolation.

The shared vocabulary is the AWS SaaS Lens: silo, pool, and bridge models, chosen per service behind one control plane. Shopify's hardest-won lesson is that sharding alone is not enough, because without full isolation a failure in one shard can spiral into a platform-wide outage, which is what their pods architecture fixed. Noisy neighbors arrive through predictable channels: CPU-heavy tenant queries, memory pressure, IO saturation from heavy writes, lock contention, and connection pool exhaustion.

  • 34

    Put tenant_id on every table and in every query from day one. It is simultaneously your shard key, your cost meter, and your isolation boundary, and retrofitting it is the most expensive migration in SaaS.

  • 35

    Choose silo, pool, or bridge deliberately per service, all behind one shared control plane. The SaaS Lens is explicit that even a silo environment relies on shared identity, onboarding, and operations; without those it is not multi-tenancy, it is N single-tenant deployments you cannot afford to operate.

  • 36

    Emit tenant-dimensioned metrics alongside the golden signals. Service-level telemetry cannot see fairness, so without per-tenant breakdowns you find the noisy neighbor by customer complaint.

  • 37

    Guard the noisy-neighbor channels with query resource limits, throttled bulk writes, and per-tenant connection pooling. Neon's catalog shows how one tenant's complex queries can monopolize CPU, heavy unindexed writes can saturate disk IO, and clients without pooling can exhaust max_connections for everyone.

  • 38

    Fully isolate shards into pods or cells so one unit's failure cannot spread. Shopify found that simply sharding the databases was not enough; each set of shops had to live on a fully isolated set of datastores so a failure could not spiral into a platform outage, with tooling like Pod Mover to relocate pods between data centers.

  • 39

    Adopt cell-based architecture once you can capacity-test one complete cell. AWS's guidance is concrete: with 10 cells, a failure in one leaves 90 percent of requests unaffected, and capacity planning becomes stamping out a known unit instead of guessing.

  • 40

    Keep the cell or pod router brutally thin. AWS calls the cell router the thinnest possible layer, responsible for routing requests to the right cell and only that, because any logic in it folds the whole platform back into a single blast radius.

  • 41

    Enforce per-tenant rate limits and quotas at the API edge. Without explicit quotas, your fairness policy is whichever tenant retries hardest.

Pillar 06

Cost-aware scaling.

Cost is an architecture property, not a finance afterthought: choices made at 1K users set the gross-margin curve at 1M, and a16z put the stakes at an estimated 100 billion dollars of market value lost across 50 top public software companies due to cloud impact on margins. The data says the first lever is rightsizing, not repatriation, since production clusters average 8 percent CPU utilization and 83 percent of container spend is tied to idle resources. Even Amazon's Prime Video team cut one workload's infrastructure cost 90 percent by collapsing a serverless pipeline back into a monolith, which is the FinOps optimize phase done with an engineering hat on.

  • 42

    Rightsize requests from observed usage before adding any capacity. Datadog finds most container and serverless workloads use less than 25 percent of requested CPU and less than half of requested memory, so buying more of what already sits idle is the default failure.

  • 43

    Track idle spend share as a first-class KPI next to uptime. With 83 percent of container spend tied to idle resources (Datadog), an autoscaler sitting on generous requests just scales the waste.

  • 44

    Meter cost per tenant, even coarsely. 22 percent of survey respondents have no idea what their unit costs are (CloudZero), and per-tenant cost is the only number that reveals which customers are unprofitable.

  • 45

    Buy reserved instances and savings plans only after rightsizing and idle cleanup. The FinOps optimize phase spans usage optimization (rightsizing, idle cleanup) and rate optimization, and a commitment locked onto an oversized fleet locks the waste in for one to three years.

  • 46

    Move steady workloads to ARM and cheaper managed engines. The share of AWS Lambda functions on Arm doubled from 9 to 19 percent in two years (Datadog) and Valkey undercuts other ElastiCache engines by 20 to 33 percent, both near-zero-effort margin wins.

  • 47

    Pay for elasticity only where load is spiky, and reserve or own capacity where it is steady. 37signals projects well over 10 million dollars saved across five years after replacing a 3.2 million dollar annual cloud bill with about 700K dollars of owned hardware, and the principle generalizes even where the exit does not.

  • 48

    Model infrastructure spend as a gross-margin lever in board materials. a16z's analysis shows the cost curve compounds into valuation, so the architecture team is managing a P&L line whether it admits it or not.

  • 49

    Alert on cost anomalies the way you alert on errors. 84 percent of organizations call managing cloud spend their top challenge and budgets already run 17 percent over (Flexera), so a surprise bill is an incident discovered a month late.

Sources

Every pattern in this blueprint traces to a named engineering team's published account or a primary benchmark, re-verified in June 2026.

What now

Use it. Then bring us the bill.

If the kit shows red flags you can't fix in a quarter, that's the conversation we're built for. These are the patterns our enterprise-platform pods reach for, and the blueprint doubles as the audit we run when a stack starts creaking.

Talk to engineering
The door is open

Bring the problem.We bring the discipline.

Tell us which world your problem lives in, or let the diagnostic find out. The first conversation is with an engineer, not an account manager.

ISO 27001 certified · NDA-first process · SOC 2 Type II in progress