The Substrate Tiles Never Think About
This post is part of the Black Skies architecture series; the series hub is the thesis post. The lifecycle post ended on a handoff: a tile's master submits a DAG of tasks and collects results, and everything between submission and result is the pool's concern. This post is the pool. It is the layer the game never sees, and the layer where most of the money and most of the latency live.
Capacity with known item sizes
Every capacity system on earth wants the same impossible thing: to know what its workloads will cost before they cost it. EVE Online cannot predict what a fight will demand, because player count, fitting choices, and module spam compose into an open-ended bill. Black Skies can, and the reason is the thesis paying out one more time. The game rules that bound a tile, the ingress ceiling, the cell count, the card costs and cooldowns, also bound the plan a tile's master can possibly submit. Every tile has a printed worst case, computable before the tile ever heats up.
That single property changes the character of everything below it. Capacity planning is usually forecasting; here it is bin-packing with known item sizes. A scheduler that knows the ceiling of every item it places gets to do things schedulers normally cannot do safely, and most of this post is a tour of those things.
Stateless by decree
The compute layer is built from TileHosts, and the first rule is that a TileHost is capacity, not a home. No durable state lives on one, ever. Tile state is authoritative in the owning master's memory, committed to Redis every tick, and checkpointed to DynamoDB; the executors that do the grunt work hold nothing past the task in their hands. This is one principle applied without exception: put durable state in systems designed for durability, put compute in systems designed for replaceability, and never mix the two.
Statelessness is what makes everything else in this post cheap. A worker that holds nothing can be replaced by anything, anywhere, including, as we will get to, by something that is not a pod at all.
The dispatch budget
At 2 Hz, a tick completes when the slowest task on the DAG's critical path returns, so the substrate's first job is making task dispatch nearly free. The enemy has two faces: fixed per-task overhead, which multiplies by the depth of the plan rather than its width, and tail latency, because a wave of a hundred parallel tasks finishes at roughly its own p99.
The literature here is unusually direct, because Ray hit this exact problem and wrote it down. Centralized schedulers in the Spark lineage make placement decisions at tens of milliseconds, which is disqualifying inside a 500ms tick. Ray's answers map almost one to one onto this design: schedule locally first and escalate only under pressure, decouple dispatch from scheduling entirely, lease workers instead of placing tasks, and favor placements where the task's inputs already live, with a zero-copy shared-memory store making same-node inputs free.
Translated into this system, the moves are these. The master holds a leased working set of executors sized to its recent plan width, so on the tick path there is no scheduling event at all, only a message on an already-established gRPC stream; placement decisions happen off-tick, at lease churn. The DAG condensation from the lifecycle post is the first-order latency tool, not an optimization: collapse the plan to minimize critical-path depth, and inline any task smaller than roughly ten times the dispatch overhead instead of shipping it. The locality ladder runs in-master, then same-node over shared memory, then same-cluster, then cross-cluster as spillover only, and tasks carry references rather than state, because leased executors keep warm slices of tile state delta-updated across ticks. Cross-cluster dispatch inside an AZ is nearly free at the wire, well under a millisecond, so it rides flat L3/L4 pod networking directly, with no kube-proxy or service mesh hop allowed on the tick path; the latency in multi-cluster systems lives in the routing layers, not the distance. And because tick time equals max over tasks, the critical path gets the Tail at Scale treatment: hedged requests fired after a percentile delay, safe because tasks are pure functions over tile state, with chronically slow executors rotated out at lease renewal. One entry condition belongs in print before anyone builds this: parallel dispatch inside a tick preserves replay determinism only if every cross-task write is commutative or accumulator-buffered and merged in declared order, because leasing, stealing, and hedging reorder execution by design. The single-threaded loop running today gets determinism for free; the planner has to buy it back, and that purchase is a standing rule about how systems write state, not an implementation detail.
Homing leases by arithmetic
Where a tile's lease lives is a decision the substrate can make unusually well, again because of the ceilings. At lease churn, placement weighs three inputs: the tile's current operating point against its known worst case, the live utilization of each candidate cluster, and adjacency, because a tile's boundary traffic wants its neighbors' leases nearby.
The ceilings also size the warm reserves. A cluster's ready-pool depth is derived from the printed worst cases of the hot tiles homed there, not from a utilization guess. Quiet tiles, which are most of the galaxy, are statistically multiplexed against the shared pool, and the multiplexing is safe oversubscription rather than hopeful oversubscription, for the same reason airline overbooking works when you know the seat count: every participant has a hard cap. A tile trending toward hot gets its lease grown ahead of its arrival at the ceiling, which sounds like prediction but is arithmetic, the current trajectory projected against a bound that already exists.
The honest risk in this scheme is correlation. Oversubscription math assumes tiles heat independently, and a galaxy-scale event heats a neighborhood together. That failure mode is named in the benchmark plan below rather than waved away.
Lambda is a bridge, not a tier
When a spike outruns the warm reserves, the pool does not make the game wait for nodes. It overflows tasks to Lambda while real capacity is provisioned behind it, so the function tier covers exactly the gap between a spike's arrival and a node's, and tasks come home when the pods land.
This is legal by construction. The lifecycle post defined the contract: a worker is fungible compute applied to a task, with no durable state and no name worth remembering. A Lambda invocation satisfies that contract as fully as a warm pod does, which is the swappability rule doing load-bearing work for the second time, after the etcd-to-Aurora swap.
Two constraints keep the bridge honest. First, Lambda has no shared-memory locality and meaningful cold-start variance, so only tasks the DAG can tag as offloadable qualify: state-light inputs, off the tick's critical path. NPC brains, background simulation, spawn tasks. The critical path never leaves leased pods, where hedging protects it. Second, making Lambda useful mid-spike means keeping a small provisioned-concurrency buffer warm, and that is reservation creeping back into an allocation story. It is worth paying as insurance, but it should be named as the tradeoff it is rather than hidden inside the word serverless.
Three speeds underneath
Beneath the leases and the bridge, the pool itself is refilled by a pipeline with three speeds, and the pool manager's trigger is reserve depth, not CPU, replenishing before the pool empties rather than after.
The fast path claims a warm slot from the ready queue in under a second, modeled on Agones' atomic Ready-to-Allocated flip for game servers. The medium path is negative-priority placeholder pods that reserve real capacity on existing nodes and are instantly preempted when real work needs the space, which conveniently triggers the slow path on its own. The slow path is Karpenter provisioning a right-sized node in 45 to 60 seconds by calling the cloud API directly. The Lambda bridge from the previous section exists precisely to cover that last number. Stack the layers and the design goal of the whole pipeline falls out: gameplay never blocks on pod scheduling, at any speed of demand.
Clusters are failure domains
Each Kubernetes cluster is treated as a single failure domain, sized and operated so that losing one whole cluster is a capacity event, not a data event, because nothing durable lived there. Warm reserves in each cluster are sized to absorb the failover of at least one peer, and multi-cluster allocation routes new leases and spawns toward healthy clusters while redirecting away from a failed one. A node loss, in these terms, is many simultaneous follower promotions; an AZ loss is many cold starts against checkpoints; neither introduces machinery the lifecycle post did not already describe, only multiplicity. One honest asymmetry belongs here too: this section's machinery governs the plane that runs tiles, on Kubernetes. The gateway tier runs on ECS Fargate and the relays scale as their own fleet, so the substrate is really two compute planes with different failure modes, and the claims in this post are scoped to the one that ticks. The precedent at scale is Uber's Federator, which manages fleets of fifty-plus clusters at five to seven and a half thousand hosts each, launching over a million pods a day across them.
Placement inside the hierarchy follows the locality ladder in reverse: an executor working a tile's tasks wants the same node as that tile's master for the shared-memory tier, and a tile's lease wants the same AZ as its neighbors for the boundary streams. The failure math and the latency math turn out to want the same topology, which is the kind of coincidence worth designing for.
A scrubber, not a sheriff
The lifecycle post committed this layer to one auditing process per cluster and promised the mechanics here. The scrubber fleet divides its work by Redis slot range and Aurora shard, not by the tiles homed in each cluster, because leases migrate at churn and tile-keyed coverage would develop gaps and double-coverage as they wander. Auditing the storage planes instead of the tenants makes coverage total by construction, regardless of where any tile currently lives.
What it checks is bookkeeping, never semantics: that the owner hash, the Aurora row, and gossip agree about each tile for longer than one convergence window, that no stream lacks a live owner, that no lease has leaked past its tile, that sampled entities' live owners agree with their registry record versions, and that sampled checkpoints actually deserialize, validated offline against DynamoDB so real corruption is caught without touching a live tile. What it does about findings is raise alarms and hand them to the convergence machinery that already exists. It is read-only by construction, which is what keeps a fleet of cluster-wide auditors from quietly becoming the centralized manager the lifecycle post spent its closing section refusing to build.
What is proven and what is not
One status line belongs above the obligations. The tile plane that runs today is a Kubernetes tick-loop service with distributed presence indexing behind tests; the leased per-task substrate this post specifies is the target, reachable by extracting the planner interface first. The IAM and Redis ACL boundaries and the Karpenter warm-pool behavior are likewise named external gates in the build plan rather than finished claims. Beyond that, this layer assembles patterns with long production pedigrees, Ray's leasing model, Agones' allocation flip, Karpenter's provisioning, Federator's multi-cluster shape, but the assembly is the least exercised part of the architecture, and the new model adds obligations the ledger does not yet carry as rows: per-task dispatch overhead p99 at each tier of the locality ladder, the saturated tile's DAG critical path against the 200ms tick threshold, the effect of hedging on tick p99.9, oversubscription safety under correlated heating, offload latency by task class on the Lambda bridge, the spike-to-backfill handoff time from first overflow to last task coming home, and the cost of lease churn itself. Each needs a threshold and a pre-committed failure action before the massive-scale run, in keeping with the rule the rest of the series follows: the constraints were designed first, and the numbers decide what survives.
The point of being invisible
Nothing in this post is visible from inside a battle, and that is the success criterion. The lifecycle post earned the line that a worker is an invocation, not a server. The substrate's entire job is to keep that sentence true while ten thousand players arrive at once: leases instead of scheduling, arithmetic instead of forecasting, a bridge instead of a wait, and three speeds of refill underneath. The game asks for work to be done. Everything else is someone else's problem, by design, and this post was the someone else.