BLACK SKIES ARCHITECTURE SERIES · PART 8 OF 8 · 8 MIN

What Still Needs Load Testing

This post closes the Black Skies architecture series; the series hub is the thesis post, which claimed a 10,000-player battle can play like a 100-player skirmish by keeping CP small and making the AP seams part of the game. The posts since built the machinery, and every one of them ended with a list of unmeasured numbers. This post is the ledger where those debts are collected, because a credible engineering blog should not say "we solved it." It should say exactly what would prove the claim and what happens if the proof fails.

The components are proven. The combination is the hypothesis.

Nothing in this architecture is individually novel, and that is by design. H3 spatial indexing runs in production at Uber, handling billions of geospatial operations a day. Redis Streams as a durable, replayable event log follows Redis's own documented patterns at scales of millions of events per second. Redis 7 sharded pub/sub does real-time fan-out at the scale Discord's infrastructure team has written about publicly. Aurora PostgreSQL Limitless provides conditional-update CAS without the single-leader bottleneck of consensus systems. SWIM gossip is a published protocol with a decade of production lineage through HashiCorp's Serf and its maintained successor memberlist, the library inside Consul and Nomad today. Relay-based fan-out with interest management is how Figma, Discord, and Riot all decouple simulation from delivery.

What is novel is the combination, and specifically the claim that a card-driven, discrete-cell, 2 Hz tactical game can assemble these components into one contiguous 10,000-player battle space with no time dilation. That claim is a design hypothesis. The proof comes from benchmarks, not architecture diagrams.

Where the line currently sits

One layer has been run for real and one has been built for real, and the difference is worth a sentence. Single-tile load testing at full saturation has held its tick budget. The coordination machinery, fencing, promotion, trim safety, transfer, admission, sits behind more than a thousand test methods and a compose-based integration stack, with two named gates open before production trust: live Redis Cluster across masters, and live Aurora Limitless with EXPLAIN-verified single-shard ownership writes. The full AP mesh at massive scale has not been run, pending cloud credit allocation, and one caveat applies even after it is: bot-driven load is not a substitute for real users. Bots reconnect politely, click predictably, and never do the one weird thing ten thousand humans reliably do. The synthetic run can validate the thresholds. Only a live event validates the game.

The benchmark table

Each row is a workload, a threshold, and a pre-committed action if the threshold fails. The action column is the important one. It was written before the tests, so a failure triggers a design revision that was already reasoned through, not a rationalization invented under deadline pressure.

The rows now split into two tables, because they fail in two different ways. Performance rows have thresholds and pre-committed degradations. Correctness rows have no thresholds at all, only zeros and exactly-ones, and a miss there is not a tuning problem.

Performance rows

Test	Workload	Success threshold	Failure action
Saturated tile tick	2,401 occupants + 800 objects, 20% active	p99 < 200ms	Tile subdivision or reduce action complexity
Hot tile fan-out	10K players viewing same 7 tiles	Publish-to-client p99 < 100ms	Broker tier or relay rebalance
Boundary storm	500 crossings/sec across 7 tiles	Handoff p99 < 100ms	Transfer batching or edge hysteresis
Relay reconnect storm	5K reconnects, 7-tile viewport	Snapshot p99 < 250ms	Jitter, singleflight, cache, or admission control
Redis tick deadline	Hot shard under concentrated battle	Stream XADD p99 < 100ms	Dedicated hot-tile shards
Wire format validation	Real card-event schema, 100 active entities	Per-client delta < 1 KB/tick	Schema redesign or bit-packing
Mass warp cold landing	1,000-ship fleet to empty tile, queue draining at capacity-fill	Per-pull admission p99 < 500ms	Clearance tuning, capacity reserve, Warden k-throttle last
Capacity-fill drain	Queue holds more ships than the tile fits; one tick elapses	Every clear placement filled that tick inside the 500ms budget	Clearance-search cost reduction
Warp enqueue storm	Half the CCU enqueues in one tick (5,000 at the 10K tier)	Enqueue commit p99 < 100ms; payload within stream entry limits	Fleet-manifest byte compression; enqueue sharding
Spawn mint burst	The entire CCU spawns inside a 2-second event start	5,000 epoch mints/sec sustained at the 10K tier (scales with the cap)	Pre-warm from enqueue watermarks; staged join queue
Departure WAIT cost	Sustained border combat; every commit carries a departure	At most 1 extra RTT per departure-bearing commit; others unaffected	Selective-WAIT scope audit; replica placement
Probe respawn	Tile killed; consumer probes trigger rebirth	Cold probe to first tick p99 < 2s	Checkpoint cadence or warm-pool tuning

Correctness rows

Test	Workload	Required result	If it fails
Split-brain chaos	Kill -9 primary, partition old owner	Stale write accepts = 0	Fix fencing before production
Old epoch stream leak	Stale primary writes after promotion	Relays consume stale events = 0	Epoch subscription fix
Kill in every window	Ship destroyed in each crossing phase, including the in-flight tick	Exactly one death event, attributed to that tick's record holder	Damage-routing fix; the rule itself is fixed by design
Trim floor safety	Six consumer classes at worst lag; aggressive trimming	Zero reads behind the trim point, ever	Watermark accounting fix; release blocker
Sequence-vector repair	Owner failover with 1,000 attached clients mid-battle	Client-perceived event gap = 0 after repair	Repair protocol fix; widen replay window
Structural duplicates	Full chaos suite: crossings, failovers, retries combined	Entities in two committed states = 0	Release blocker; no public player until fixed
GC false positive	Idle tiles holding viewport refs and NPC jobs, under churn	Wrongful collection = 0	Conjunction term audit
Duplicate admission	Same client action ID retried inside and after the hot window	One ledger write; byte-identical terminal result returned	Idempotency store fix

Several correctness rows already run in miniature inside the suite today, trim-floor accounting, fencing rejection, duplicate-admission replay, sequence-vector repair, which is what the previous section meant by where the line sits. Every population number in the performance table waits for the mesh.

Any row failing its requirement invalidates a core assumption. The performance rows degrade into their pre-committed actions. The correctness rows do not degrade at all, which is what the next section is about.

The two numbers that define working

The performance rows measure speed. The correctness rows measure truth, and two SLIs sit above them all, pointing in opposite directions.

The two numbers that define working, pointing in opposite directions: the fence must fire constantly, and it must never fail when it does.

fencing_token_reject_rate should be non-zero. Every rejection means a stale owner tried to write and the fencing from the ownership post stopped it. A healthy system under churn rejects things constantly. Zero rejections during a chaos run does not mean the system is clean; it means the test failed to create the conflicts it was supposed to create. The reject rate is a first-class SLI, and its job is to prove the dangerous path is being exercised. Spikes still get investigated, because each one has a source.

stale_write_accept_count must be zero, always, and a single non-zero observation is a release blocker. There is no threshold to tune and no graceful degradation. One stale write accepted means an entity duplicated or a battle forked, and no performance number anywhere else in the table buys that back.

The pair encodes the whole testing philosophy: maximize how often the safety mechanisms fire, and prove they never fail when they do.

Failure is in the plan

The benchmark plan assumes some rows will miss. Each round of testing will surface consequences the design did not anticipate, and that is expected and intentional. The architecture is built to be tested and revised, not to be right on the first pass. This series has already shown that loop running before any massive-scale test existed: etcd was removed when writing the test plan surfaced its throughput ceiling, the owner cache was deleted when the routing path made it redundant, and tombstones left the GC design when convergence turned out not to need them. Each simplification came from adversarial thinking about how the system would be proven. The load tests are the same process with electricity.

What the big run will not demonstrate

Honesty also means scoping the claim. When the massive-scale run happens, a passing table demonstrates the architecture under synthetic load. It says nothing about several problems that are deliberately out of scope for this series: anti-cheat beyond server-authoritative card validation, client prediction between 2 Hz ticks, matchmaking and social systems, persistence beyond DynamoDB checkpoints, and observability. One scoping decision is worth naming because it is load-bearing: relays must be regional. A Tokyo client hitting a us-east-1 relay pays 150 to 200ms of round trip, which blows the 100ms delivery budget before the architecture does anything at all. The single-region benchmark proves the design, not the deployment.

Closing the series

The thesis deserves its final form, with its qualifier attached. If the stress tests in this plan hold, a player should never feel an infrastructure limitation, because the limitation is the game: discrete cells, card-driven actions, bounded targets, cooldown-gated tempo. Those are the rules, not the workarounds. The strongest part of this architecture is the game-design constraints, and the hardest remaining work is proving the numbers. That is the right order. Design the constraints first, then measure.

When the run happens, the findings, the surprises, and the design changes they force will be published in a follow-up, whichever way the numbers go. If the table holds, that post is a victory lap with graphs. If it does not, it will be the more interesting post.