BLACK SKIES ARCHITECTURE SERIES · PART 2 OF 8 · 13 MIN

The Life of a Tile

This post is part of the Black Skies architecture series. The series hub, which argues that a 10,000-player battle can play like a 100-player skirmish if consistency is kept small and the seams are designed into the game, is the place to start. This post opens the black box that argument stood on: the tile itself, from birth to death.

The thesis post described a tile from the outside: one H3 resolution-11 hexagon, 2,401 tactical cells, one authority, a fixed 2 Hz tick, bounded in everything it accepts and emits. That exterior is deliberately boring. From the outside, a tile is one epoch and one stream. Inside, it is a small distributed system with a birth, a working life, several ways to survive dying, and a way to die on purpose.

One epoch outside, a planner inside

A tile is not a single process, and it is not really a cluster either. It is a master in front of a shared pool of task executors, and the relationship is closer to Spark's driver and executors than to a spatial sharding scheme. Each tick, the master compiles the work in front of it, queued intents, neighbor inputs, simulation steps, into a DAG of tasks, condenses what can be condensed the way Spark collapses narrow transformations into stages, and submits the result for execution. The master owns tick synchronicity, the ownership epoch, and the plan, and nothing else. It does not allocate executors, count them, or scale them. It asks for work to be done and collects the answers, and everything between submission and result is the pool's concern, not the tile's. The bounded ingress queue from the thesis post is purely backpressure and a tripwire; nothing scales off it, because at the tile level there is nothing left to scale. The pool itself, and what makes per-task compute cheap enough to lean on, gets its own post.

The DAG is not limited to simulation. Lifecycle work rides the same rails, so spawning a neighboring tile is just another task on the plan, executed by whichever worker picks it up. One scheduling model covers everything a tile does.

Inside a tile: the master compiles each tick into a DAG of tasks and submits it against a shared executor pool. Only the master has an identity; the workers are invocations.

A useful way to hold this: from the tile's point of view, everything except coordination is serverless. The master is the only component with an identity worth keeping, because it holds the epoch and the tick. The grunt work is fungible compute, engaged per task and gone when the task returns, with no durable state and no name the tile needs to remember. The substrate underneath is warm-pooled Kubernetes pods rather than a cloud FaaS, but the consumption model is the one serverless actually sells: capacity decoupled from any fixed resource, allocation instead of reservation. A worker is an invocation, not a server.

That framing is one instance of a rule the whole architecture follows: every component is defined as a concept with a contract, and the implementation behind the contract can be swapped without the rest of the system noticing. "Worker" means fungible compute applied to a task, and warm-pooled pods are merely today's answer to it. The same rule is why epoch minting could move from etcd to Aurora when the test plan exposed etcd's throughput ceiling: the concept was "mint a fenced ownership epoch," nothing outside the authority layer consumed anything more specific than that, and so nothing outside the authority layer changed. Swappability here is not an aspiration in a design doc. It has already been exercised once, on the most correctness-critical component in the system.

Under this model, elasticity stops being a mechanism and becomes a side effect. A saturated tile submits a heavy plan, a quiet tile submits almost nothing, and the cost of each tracks the size of its plan with no mode switch, no upgrade path, and no scaling decision anyone has to make. An earlier design ran quiet tiles as a single process and promoted them to a full cluster under load, which meant detecting the threshold and surviving the transition. The task model deleted that machinery the same way the entity rule deleted the garbage collection condition list. The galaxy is mostly quiet tiles, and a quiet tile is a master with a nearly empty plan, which is nearly free. None of this is visible past the boundary. Workers come and go per task; the epoch and the stream do not.

How a tile is born

Every spawn starts the same way: something needs a tile that does not exist. A ship flies toward empty space. A mass warp targets an unspawned tile. A neighbor notices a hole where a tile used to be. In all three cases the needing party runs the same mechanism: a deterministic spiral search anchored at the location in question, which settles on the tile to create, followed by the Aurora conditional update from the ownership post to claim creation, followed by pulling a warm process from the ready pool so booting never waits on a scheduler.

The deterministic spiral is also the deduplication mechanism, and this is the part worth slowing down for. When a fleet of two hundred ships converges on the same coordinates, two hundred actors all need a tile to exist. Because the spiral is deterministic, all two hundred compute the same answer from the same anchor. Their creation attempts collide on one Aurora row, exactly one wins, and the other hundred and ninety-nine fail into a pointer at the winner. Convergence is not coordinated; it falls out of everyone running the same pure function over the same geometry.

Two hundred actors, one deterministic spiral, one Aurora row, one winner. The losers get a pointer, not an error.

The genuinely adversarial cases, where gossip is stale or a partition hides the winner, are handled by the same split-brain machinery the ownership post covers, because a contested spawn is just a contested ownership claim with no incumbent.

Six streams that are more than plumbing

A live tile maintains bidirectional RPC streams to each of its six neighbors, carrying boundary state as coalesced Protobuf with low-watermark propagation, the same fan-in discipline Discord's Manifold uses to keep chatty neighbors from scaling costs. The geometry caps this channel absolutely: six neighbors, fixed shared edges, no exceptions.

Readers who know H3 will object on cue: an icosahedron-based grid cannot be all hexagons, and every H3 resolution carries exactly twelve pentagons, cells with five neighbors instead of six. The gameplay answer is simpler than the special case it replaces: those twelve tiles are gravitic anomalies, inaccessible to any entity, full stop. The cost is twelve tiles out of the 237,279,209,162 resolution-11 cells on the sphere, leaving 237,279,209,150 fully hexagonal tiles, and the 24,012 resolution-15 cells inside the forbidden twelve, leaving 569,707,381,169,150 accessible cells. The benefit is that the six-neighbor invariant becomes unconditional for every tile a ship can ever occupy, which means the boundary transfer protocol, the spiral search, and these streams never carry a five-neighbor branch. It is the thesis in miniature one more time: a gravitic anomaly is a more interesting answer than an if-statement, and it deletes the if-statement.

Six neighbors, six bidirectional streams, no exceptions: the twelve five-neighbor pentagons every H3 grid carries are gravitic anomalies no ship can enter, so the invariant holds for every reachable tile.

But the streams earn their keep twice, because they double as liveness and coordination. A healthy stream is a heartbeat. A broken one is a signal that something on the other end needs attention, and that signal is load-bearing for both failover and garbage collection below. There is no separate health-check infrastructure for tiles watching tiles. The data path is the health check.

Three failures, three answers

Failures inside a tile resolve in escalating tiers, each more expensive and rarer than the last.

Three failure tiers, three answers, escalating cost. Nothing above a lost task needs external help until the whole tile is gone, and even then the consumers rebuild it.

A task fails to return. The master resubmits it, the same way Spark retries a task from a lost executor. Whether the executor died, stalled, or vanished is the pool's problem; the tile never knew its name to begin with. No epoch change, no external visibility. This is the cheapest failure in the system.

The master dies. A warm follower, a standby master that has been tailing the committed stream all along, promotes itself by minting the next epoch through the Aurora conditional update. The warm path costs one to two ticks, 500 to 1,000ms, of delayed promotion, not lost state.

Everything dies. With no valid follower, recovery falls to whoever needed the tile, and someone always does if the tile matters. Every consumer of a tile already talks to it constantly: players hold live connections and send actions, NPC brains run as continuous long-lived scheduled jobs issuing intents, neighbors push boundary updates over their streams. When normal traffic stops getting answers, the requester escalates to an empty request, a probe carrying nothing but the question of whether anyone is home. An unanswered probe triggers a respawn through the exact same spiral-search birth path used for any new tile, booting from the latest DynamoDB checkpoint plus the surviving stream replay window. The same probe path re-homes individual ships after a recovery: the tile reads the entity's location record in DynamoDB and places it as close to the recorded cell as the clearance search allows.

There is no health-check infrastructure watching tiles, because there does not need to be. The trigger set is the GC condition inverted: a tile holding a non-ephemeral entity has, by definition, either an NPC job or a connected player actively poking it, and the neighbor streams cover the rest. A tile that nothing is poking has no one to miss it, and a tile in that state would have been garbage collecting anyway. Detection and deletion agree about what aliveness means, because both are computed from the same fact.

How a tile dies

Garbage collection is self-driven, and its core condition collapsed into one sentence as the design matured: no non-ephemeral entities exist on the tile. Asteroids and environmental effects are ephemeral; they keep nothing alive. Player ships, NPC ships, stations, and deployed structures are non-ephemeral; any one of them holds the tile open. Permanent landmarks prevent collection automatically, not by a special flag, but because a station is non-ephemeral by definition. An earlier design enumerated five separate conditions with a permanent-tile exemption list; the entity classification absorbed most of them.

Beyond the entity check, the audit is a conjunction, and every clause is consumer-derived: no live viewport references, all inbound neighbor streams closed longer than a grace period, no transfers or reservations pending, and the tile's epoch still matching the current ownership record. The viewport clause looks like a manager and is the opposite: watchers hold references with TTLs, so a tile of empty space that sits inside someone's viewport stays alive to serve frames, while a tile holding a station stays alive with no watchers at all. What is there and who is watching are different questions, and collection requires both answers to be nobody. The grace period matters because of the principle above: an RPC disconnect is a liveness signal, not proof that it is safe to delete.

Then the tile simply stands down. An earlier design ended GC with a CAS tombstone to mark the death; the current design writes nothing, because every mechanism that needs to learn about the death already has a way to learn it. The owner hash TTL lapses, gossip stops vouching, the ownership row ages out. Death, like birth, requires no announcement.

What checks for corruption

Death by emptiness is handled. The sharper question is a tile that is alive and wrong: still ticking, still answering probes, producing garbage. The tempting answer is a watcher, one pod per cluster that inspects every tile's state and acts on what it finds. That answer fails this post's own test three ways. A component that enumerates tiles, holds a view of them, and carries authority over them is a Tile Manager in a trench coat. Judging corruption requires understanding game semantics, which makes the watcher a second implementation of the rules engine, guaranteed to drift from the first. And anything with the power to condemn a tile is a new source of split brain, because a checker that wrongly kills a healthy tile is an adversarial owner with a badge.

So the design splits the concern along its natural seam. Semantic wrongness is caught by consumers, who are already positioned to notice: relays detect sequence gaps, neighbors validate boundary state at the seams, and a tile emitting nonsense escalates through the same probe-and-respawn path as a tile emitting silence, because wrongness and silence are the same event to a consumer that cannot use the output. Bookkeeping drift is caught by a per-cluster scrubber in the tradition of S3's background auditors and ZFS scrubs: a read-only process that continuously verifies the planes agree, owner hash against Aurora row against gossip for longer than a convergence window, streams against live owners, checkpoints against deserializability, and sampled entities' live owners against their registry records, so even the rarest duplicate has a detector with an alarm instead of waiting for a reload. It emits alarms and feeds the convergence machinery that already exists. It has no authority to kill anything, and that is the design rather than a limitation: a scrubber that can only raise its hand cannot become the thing this post exists to avoid. How it is sharded and run belongs to the substrate post.

The pattern has one named special case: the frozen tile, alive enough to renew its lease, too wedged to tick. It passes the question of existence and fails the question of progress, and consumers catch it the way they catch everything, because to a relay a stalled sequence is silence with a pulse. A watchdog can be pointed at exactly this state, and in this design it may only raise its hand: detection feeds the same probe-and-respawn path, and nothing with a watch list ever carries a kill switch.

How a tile is found

Routing deserves one paragraph here and a full post elsewhere. The lookup path is SWIM gossip first, an in-memory answer that is usually right; the Redis owner hash second, fresh within one tick; and a redirect from whoever answered last, carrying the true owner's address. The ownership post covers why that path never touches the database and why stale answers are harmless rather than prevented.

No one is in charge, on purpose

Walk back through the lifecycle and notice who initiated each event. Birth: whoever needed the tile. Capacity: no one at all, because per-task execution made scaling a side effect instead of a decision. Failover: the tile's own follower, or failing that, whoever was talking to it, a connected player, an NPC job, a neighbor stream. Death: the tile's own audit of its contents. There is no centralized Tile Manager. Tiles manage their own lifecycle.

That absence is a design decision with teeth. A spawner service, a reaper cron, or a lifecycle manager would be a component that has to scale with the galaxy, fail independently of it, and hold a globally consistent view of millions of tiles, which is precisely the kind of component the thesis post promised to avoid. Instead, every lifecycle event is initiated by the party with firsthand knowledge that it is needed, and the authority check is the same per-tile conditional update everything else uses. The galaxy has no shepherd. It has tiles that know how to be born, how to notice their neighbors dying, and how to leave without a note.

What is proven and what is not

The lifecycle splits cleanly into what runs and what is specified. Running with tests: the claim-and-promotion path this post's births and failovers ride, and the presence model, source-aware viewport references with TTLs and the active-tile index, which is the implemented half of the GC conjunction. Specified but not yet code: the probe-triggered respawn path, the non-ephemeral entity clause of the audit, the scrubber, and the watchdog's detector-only framing, all planned to the slice level. What this post adds to the unmeasured list is specific to detection and judgment: how long consumer-probe respawn takes from first unanswered probe to first tick, and whether the GC conditions ever collect a tile that something still needed, which must be zero for the same reason a stale write accept is zero. Both belong to the chaos run, and the chaos and mass-warp rows in the series ledger cover contested spawns and spawn bursts.

What comes next

A tile that lives also ticks. What one of those ticks actually does, the admission ledger upstream of it, the single write that commits it, and the machinery that delivers it without coupling, is the commit path post.