BLACK SKIES ARCHITECTURE SERIES · PART 8 OF 8 · 8 MIN

What Still Needs Load Testing

This post closes the Black Skies architecture series; the series hub is the thesis post, which claimed a 10,000-player battle can play like a 100-player skirmish by keeping CP small and making the AP seams part of the game. The posts since built the machinery, and every one of them ended with a list of unmeasured numbers. This post is the ledger where those debts are collected, because a credible engineering blog should not say "we solved it." It should say exactly what would prove the claim and what happens if the proof fails.

The components are proven. The combination is the hypothesis.

Nothing in this architecture is individually novel, and that is by design. H3 spatial indexing runs in production at Uber, handling billions of geospatial operations a day. Redis Streams as a durable, replayable event log follows Redis's own documented patterns at scales of millions of events per second. Redis 7 sharded pub/sub does real-time fan-out at the scale Discord's infrastructure team has written about publicly. Aurora PostgreSQL Limitless provides conditional-update CAS without the single-leader bottleneck of consensus systems. SWIM gossip is a published protocol with a decade of production lineage through HashiCorp's Serf and its maintained successor memberlist, the library inside Consul and Nomad today. Relay-based fan-out with interest management is how Figma, Discord, and Riot all decouple simulation from delivery.

What is novel is the combination, and specifically the claim that a card-driven, discrete-cell, 2 Hz tactical game can assemble these components into one contiguous 10,000-player battle space with no time dilation. That claim is a design hypothesis. The proof comes from benchmarks, not architecture diagrams.

Where the line currently sits

One layer has been run for real and one has been built for real, and the difference is worth a sentence. Single-tile load testing at full saturation has held its tick budget. The coordination machinery, fencing, promotion, trim safety, transfer, admission, sits behind more than a thousand test methods and a compose-based integration stack, with two named gates open before production trust: live Redis Cluster across masters, and live Aurora Limitless with EXPLAIN-verified single-shard ownership writes. The full AP mesh at massive scale has not been run, pending cloud credit allocation, and one caveat applies even after it is: bot-driven load is not a substitute for real users. Bots reconnect politely, click predictably, and never do the one weird thing ten thousand humans reliably do. The synthetic run can validate the thresholds. Only a live event validates the game.

The benchmark table

Each row is a workload, a threshold, and a pre-committed action if the threshold fails. The action column is the important one. It was written before the tests, so a failure triggers a design revision that was already reasoned through, not a rationalization invented under deadline pressure.

The rows now split into two tables, because they fail in two different ways. Performance rows have thresholds and pre-committed degradations. Correctness rows have no thresholds at all, only zeros and exactly-ones, and a miss there is not a tuning problem.

Performance rows

Test Workload Success threshold Failure action
Saturated tile tick 2,401 occupants + 800 objects, 20% active p99 < 200ms Tile subdivision or reduce action complexity
Hot tile fan-out 10K players viewing same 7 tiles Publish-to-client p99 < 100ms Broker tier or relay rebalance
Boundary storm 500 crossings/sec across 7 tiles Handoff p99 < 100ms Transfer batching or edge hysteresis
Relay reconnect storm 5K reconnects, 7-tile viewport Snapshot p99 < 250ms Jitter, singleflight, cache, or admission control
Redis tick deadline Hot shard under concentrated battle Stream XADD p99 < 100ms Dedicated hot-tile shards
Wire format validation Real card-event schema, 100 active entities Per-client delta < 1 KB/tick Schema redesign or bit-packing
Mass warp cold landing 1,000-ship fleet to empty tile, queue draining at capacity-fill Per-pull admission p99 < 500ms Clearance tuning, capacity reserve, Warden k-throttle last
Capacity-fill drain Queue holds more ships than the tile fits; one tick elapses Every clear placement filled that tick inside the 500ms budget Clearance-search cost reduction
Warp enqueue storm Half the CCU enqueues in one tick (5,000 at the 10K tier) Enqueue commit p99 < 100ms; payload within stream entry limits Fleet-manifest byte compression; enqueue sharding
Spawn mint burst The entire CCU spawns inside a 2-second event start 5,000 epoch mints/sec sustained at the 10K tier (scales with the cap) Pre-warm from enqueue watermarks; staged join queue
Departure WAIT cost Sustained border combat; every commit carries a departure At most 1 extra RTT per departure-bearing commit; others unaffected Selective-WAIT scope audit; replica placement
Probe respawn Tile killed; consumer probes trigger rebirth Cold probe to first tick p99 < 2s Checkpoint cadence or warm-pool tuning

Correctness rows

Test Workload Required result If it fails
Split-brain chaos Kill -9 primary, partition old owner Stale write accepts = 0 Fix fencing before production
Old epoch stream leak Stale primary writes after promotion Relays consume stale events = 0 Epoch subscription fix
Kill in every window Ship destroyed in each crossing phase, including the in-flight tick Exactly one death event, attributed to that tick's record holder Damage-routing fix; the rule itself is fixed by design
Trim floor safety Six consumer classes at worst lag; aggressive trimming Zero reads behind the trim point, ever Watermark accounting fix; release blocker
Sequence-vector repair Owner failover with 1,000 attached clients mid-battle Client-perceived event gap = 0 after repair Repair protocol fix; widen replay window
Structural duplicates Full chaos suite: crossings, failovers, retries combined Entities in two committed states = 0 Release blocker; no public player until fixed
GC false positive Idle tiles holding viewport refs and NPC jobs, under churn Wrongful collection = 0 Conjunction term audit
Duplicate admission Same client action ID retried inside and after the hot window One ledger write; byte-identical terminal result returned Idempotency store fix

Several correctness rows already run in miniature inside the suite today, trim-floor accounting, fencing rejection, duplicate-admission replay, sequence-vector repair, which is what the previous section meant by where the line sits. Every population number in the performance table waits for the mesh.

Any row failing its requirement invalidates a core assumption. The performance rows degrade into their pre-committed actions. The correctness rows do not degrade at all, which is what the next section is about.

The two numbers that define working

The performance rows measure speed. The correctness rows measure truth, and two SLIs sit above them all, pointing in opposite directions.

fencing_token_reject_rate > 0 every rejection is a stale owner STOPPED. zero rejections during chaos means the test failed to create conflicts, not that the system is clean stale_write_accept_count = 0 always. no threshold to tune, no graceful degradation. one accepted stale write is a duplicated entity or a forked battle, and it is a release blocker maximize how often the safety mechanisms fire; prove they never fail when they do
The two numbers that define working, pointing in opposite directions: the fence must fire constantly, and it must never fail when it does.

fencing_token_reject_rate should be non-zero. Every rejection means a stale owner tried to write and the fencing from the ownership post stopped it. A healthy system under churn rejects things constantly. Zero rejections during a chaos run does not mean the system is clean; it means the test failed to create the conflicts it was supposed to create. The reject rate is a first-class SLI, and its job is to prove the dangerous path is being exercised. Spikes still get investigated, because each one has a source.

stale_write_accept_count must be zero, always, and a single non-zero observation is a release blocker. There is no threshold to tune and no graceful degradation. One stale write accepted means an entity duplicated or a battle forked, and no performance number anywhere else in the table buys that back.

The pair encodes the whole testing philosophy: maximize how often the safety mechanisms fire, and prove they never fail when they do.

Failure is in the plan

The benchmark plan assumes some rows will miss. Each round of testing will surface consequences the design did not anticipate, and that is expected and intentional. The architecture is built to be tested and revised, not to be right on the first pass. This series has already shown that loop running before any massive-scale test existed: etcd was removed when writing the test plan surfaced its throughput ceiling, the owner cache was deleted when the routing path made it redundant, and tombstones left the GC design when convergence turned out not to need them. Each simplification came from adversarial thinking about how the system would be proven. The load tests are the same process with electricity.

What the big run will not demonstrate

Honesty also means scoping the claim. When the massive-scale run happens, a passing table demonstrates the architecture under synthetic load. It says nothing about several problems that are deliberately out of scope for this series: anti-cheat beyond server-authoritative card validation, client prediction between 2 Hz ticks, matchmaking and social systems, persistence beyond DynamoDB checkpoints, and observability. One scoping decision is worth naming because it is load-bearing: relays must be regional. A Tokyo client hitting a us-east-1 relay pays 150 to 200ms of round trip, which blows the 100ms delivery budget before the architecture does anything at all. The single-region benchmark proves the design, not the deployment.

Closing the series

The thesis deserves its final form, with its qualifier attached. If the stress tests in this plan hold, a player should never feel an infrastructure limitation, because the limitation is the game: discrete cells, card-driven actions, bounded targets, cooldown-gated tempo. Those are the rules, not the workarounds. The strongest part of this architecture is the game-design constraints, and the hardest remaining work is proving the numbers. That is the right order. Design the constraints first, then measure.

When the run happens, the findings, the surprises, and the design changes they force will be published in a follow-up, whichever way the numbers go. If the table holds, that post is a victory lap with graphs. If it does not, it will be the more interesting post.