What Still Needs Load Testing
This post closes the Black Skies architecture series; the series hub is the thesis post, which claimed a 10,000-player battle can play like a 100-player skirmish by keeping CP small and making the AP seams part of the game. The posts since built the machinery, and every one of them ended with a list of unmeasured numbers. This post is the ledger where those debts are collected, because a credible engineering blog should not say "we solved it." It should say exactly what would prove the claim and what happens if the proof fails.
The components are proven. The combination is the hypothesis.
Nothing in this architecture is individually novel, and that is by design. H3 spatial indexing runs in production at Uber, handling billions of geospatial operations a day. Redis Streams as a durable, replayable event log follows Redis's own documented patterns at scales of millions of events per second. Redis 7 sharded pub/sub does real-time fan-out at the scale Discord's infrastructure team has written about publicly. Aurora PostgreSQL Limitless provides conditional-update CAS without the single-leader bottleneck of consensus systems. SWIM gossip is a published protocol with a decade of production lineage through HashiCorp's Serf and its maintained successor memberlist, the library inside Consul and Nomad today. Relay-based fan-out with interest management is how Figma, Discord, and Riot all decouple simulation from delivery.
What is novel is the combination, and specifically the claim that a card-driven, discrete-cell, 2 Hz tactical game can assemble these components into one contiguous 10,000-player battle space with no time dilation. That claim is a design hypothesis. The proof comes from benchmarks, not architecture diagrams.
Where the line currently sits
One layer has been run for real and one has been built for real, and the difference is worth a sentence. Single-tile load testing at full saturation has held its tick budget. The coordination machinery, fencing, promotion, trim safety, transfer, admission, sits behind more than a thousand test methods and a compose-based integration stack, with two named gates open before production trust: live Redis Cluster across masters, and live Aurora Limitless with EXPLAIN-verified single-shard ownership writes. The full AP mesh at massive scale has not been run, pending cloud credit allocation, and one caveat applies even after it is: bot-driven load is not a substitute for real users. Bots reconnect politely, click predictably, and never do the one weird thing ten thousand humans reliably do. The synthetic run can validate the thresholds. Only a live event validates the game.
The benchmark table
Each row is a workload, a threshold, and a pre-committed action if the threshold fails. The action column is the important one. It was written before the tests, so a failure triggers a design revision that was already reasoned through, not a rationalization invented under deadline pressure.
The rows now split into two tables, because they fail in two different ways. Performance rows have thresholds and pre-committed degradations. Correctness rows have no thresholds at all, only zeros and exactly-ones, and a miss there is not a tuning problem.
Performance rows
| Test | Workload | Success threshold | Failure action |
|---|---|---|---|
| Saturated tile tick | 2,401 occupants + 800 objects, 20% active | p99 < 200ms | Tile subdivision or reduce action complexity |
| Hot tile fan-out | 10K players viewing same 7 tiles | Publish-to-client p99 < 100ms | Broker tier or relay rebalance |
| Boundary storm | 500 crossings/sec across 7 tiles | Handoff p99 < 100ms | Transfer batching or edge hysteresis |
| Relay reconnect storm | 5K reconnects, 7-tile viewport | Snapshot p99 < 250ms | Jitter, singleflight, cache, or admission control |
| Redis tick deadline | Hot shard under concentrated battle | Stream XADD p99 < 100ms | Dedicated hot-tile shards |
| Wire format validation | Real card-event schema, 100 active entities | Per-client delta < 1 KB/tick | Schema redesign or bit-packing |
| Mass warp cold landing | 1,000-ship fleet to empty tile, queue draining at capacity-fill | Per-pull admission p99 < 500ms | Clearance tuning, capacity reserve, Warden k-throttle last |
| Capacity-fill drain | Queue holds more ships than the tile fits; one tick elapses | Every clear placement filled that tick inside the 500ms budget | Clearance-search cost reduction |
| Warp enqueue storm | Half the CCU enqueues in one tick (5,000 at the 10K tier) | Enqueue commit p99 < 100ms; payload within stream entry limits | Fleet-manifest byte compression; enqueue sharding |
| Spawn mint burst | The entire CCU spawns inside a 2-second event start | 5,000 epoch mints/sec sustained at the 10K tier (scales with the cap) | Pre-warm from enqueue watermarks; staged join queue |
| Departure WAIT cost | Sustained border combat; every commit carries a departure | At most 1 extra RTT per departure-bearing commit; others unaffected | Selective-WAIT scope audit; replica placement |
| Probe respawn | Tile killed; consumer probes trigger rebirth | Cold probe to first tick p99 < 2s | Checkpoint cadence or warm-pool tuning |
Correctness rows
| Test | Workload | Required result | If it fails |
|---|---|---|---|
| Split-brain chaos | Kill -9 primary, partition old owner | Stale write accepts = 0 | Fix fencing before production |
| Old epoch stream leak | Stale primary writes after promotion | Relays consume stale events = 0 | Epoch subscription fix |
| Kill in every window | Ship destroyed in each crossing phase, including the in-flight tick | Exactly one death event, attributed to that tick's record holder | Damage-routing fix; the rule itself is fixed by design |
| Trim floor safety | Six consumer classes at worst lag; aggressive trimming | Zero reads behind the trim point, ever | Watermark accounting fix; release blocker |
| Sequence-vector repair | Owner failover with 1,000 attached clients mid-battle | Client-perceived event gap = 0 after repair | Repair protocol fix; widen replay window |
| Structural duplicates | Full chaos suite: crossings, failovers, retries combined | Entities in two committed states = 0 | Release blocker; no public player until fixed |
| GC false positive | Idle tiles holding viewport refs and NPC jobs, under churn | Wrongful collection = 0 | Conjunction term audit |
| Duplicate admission | Same client action ID retried inside and after the hot window | One ledger write; byte-identical terminal result returned | Idempotency store fix |
Several correctness rows already run in miniature inside the suite today, trim-floor accounting, fencing rejection, duplicate-admission replay, sequence-vector repair, which is what the previous section meant by where the line sits. Every population number in the performance table waits for the mesh.
Any row failing its requirement invalidates a core assumption. The performance rows degrade into their pre-committed actions. The correctness rows do not degrade at all, which is what the next section is about.
The two numbers that define working
The performance rows measure speed. The correctness rows measure truth, and two SLIs sit above them all, pointing in opposite directions.
fencing_token_reject_rate should be non-zero. Every rejection means a stale owner tried to write and the fencing from the ownership post stopped it. A healthy system under churn rejects things constantly. Zero rejections during a chaos run does not mean the system is clean; it means the test failed to create the conflicts it was supposed to create. The reject rate is a first-class SLI, and its job is to prove the dangerous path is being exercised. Spikes still get investigated, because each one has a source.
stale_write_accept_count must be zero, always, and a single non-zero observation is a release blocker. There is no threshold to tune and no graceful degradation. One stale write accepted means an entity duplicated or a battle forked, and no performance number anywhere else in the table buys that back.
The pair encodes the whole testing philosophy: maximize how often the safety mechanisms fire, and prove they never fail when they do.
Failure is in the plan
The benchmark plan assumes some rows will miss. Each round of testing will surface consequences the design did not anticipate, and that is expected and intentional. The architecture is built to be tested and revised, not to be right on the first pass. This series has already shown that loop running before any massive-scale test existed: etcd was removed when writing the test plan surfaced its throughput ceiling, the owner cache was deleted when the routing path made it redundant, and tombstones left the GC design when convergence turned out not to need them. Each simplification came from adversarial thinking about how the system would be proven. The load tests are the same process with electricity.
What the big run will not demonstrate
Honesty also means scoping the claim. When the massive-scale run happens, a passing table demonstrates the architecture under synthetic load. It says nothing about several problems that are deliberately out of scope for this series: anti-cheat beyond server-authoritative card validation, client prediction between 2 Hz ticks, matchmaking and social systems, persistence beyond DynamoDB checkpoints, and observability. One scoping decision is worth naming because it is load-bearing: relays must be regional. A Tokyo client hitting a us-east-1 relay pays 150 to 200ms of round trip, which blows the 100ms delivery budget before the architecture does anything at all. The single-region benchmark proves the design, not the deployment.
Closing the series
The thesis deserves its final form, with its qualifier attached. If the stress tests in this plan hold, a player should never feel an infrastructure limitation, because the limitation is the game: discrete cells, card-driven actions, bounded targets, cooldown-gated tempo. Those are the rules, not the workarounds. The strongest part of this architecture is the game-design constraints, and the hardest remaining work is proving the numbers. That is the right order. Design the constraints first, then measure.
When the run happens, the findings, the surprises, and the design changes they force will be published in a follow-up, whichever way the numbers go. If the table holds, that post is a victory lap with graphs. If it does not, it will be the more interesting post.