BLACK SKIES ARCHITECTURE SERIES · PART 4 OF 8 · 11 MIN

Split-Brain Resolution Without a Global Coordinator

This post is part of the Black Skies architecture series; the series hub is the thesis post. The commit path post covered how a tile's committed state gets delivered and made durable, and it leaned on one assumption: at any instant there is a clean answer to "who owns tile X." This post is about the moments when that assumption breaks, and why the system is built so that breaking it is boring.

Split-brain is a workload, not an anomaly

In a galaxy of millions of tiles, ownership conflicts are not rare failures to be engineered around the edges. They are routine traffic. Two neighbors discover the same unowned tile in the same tick and both try to spawn it. A network partition isolates a healthy primary mid-battle, and a follower promotes itself while the old primary keeps ticking on the far side. SWIM gossip routes an action to a processor that lost ownership half a second ago.

The classic answer is a global coordination service, and the classic implementation is etcd. That was the original design here, and the thesis post already gave away the ending: etcd is a CP system with global scope. One Raft leader serializes every mint, for every tile in the galaxy, and writing the test plan surfaced that ceiling under exactly the burst this game produces. A wave of tile spawns, or 10,000 players accepting missions in the same window, arrives at the leader as a wave of serialized writes. The design requirement penciled against burst scenarios runs to a million epoch mints per second. No single Raft group gets there.

The replacement is not a cleverer coordinator. It is the absence of one.

The conditional update is the coordinator

Tile ownership lives in Aurora PostgreSQL Limitless, sharded directly by tile H3 ID. Every ownership transition, whether a promotion or a re-spawn of a released tile, is one conditional update against the tile's row:

UPDATE tiles
SET epoch = epoch + 1, owner = @owner
WHERE h3_id = @tileId AND epoch = @expectedEpoch
RETURNING epoch

A tile no one has ever owned may have no row yet, so the very first claim is the same bet in upsert form — INSERT ... ON CONFLICT (h3_id) DO UPDATE ... WHERE tiles.epoch = 0 — which inserts epoch 1 when the row is missing, claims a never-owned row the same conditional way, and which the database resolves to exactly one winner either way. Zero rows affected or returned means someone else won. One row means you own the tile, and the returned epoch is your fencing token. There is no coordinator in the middle, no leader election, no lock service. The conditional update is the coordinator. Two processors racing to spawn the same tile both run the same statement, and the database's own concurrency control yields exactly one winner. The loser learns it lost in the same round trip it would have used to win.

Two processors race the same conditional update. The database yields exactly one winner, and the loser learns the winner's identity in the same round trip.

Sharding by tile ID is the same move the whole architecture makes, applied to its own authority layer. Each mint is still strongly consistent, but the consistency domain is one tile's row, not a galaxy-wide log. Epochs are likewise per-tile sequences, not a global counter, so mint capacity scales with the shard count instead of through one number everyone increments. CP, kept small. The database world runs the same play under different names: CockroachDB fences every range behind epoch-based leases and moves leadership the way tiles move owners, which is comforting precedent for the one part of this design that is supposed to be boring.

A companion detail makes the downstream arithmetic work: the event sequence is tile-lifetime monotonic, stored beside the epoch in the owner hash and never reset by a promotion. Epochs change; the sequence only climbs. That is what lets any consumer detect a gap with one comparison, and it is why a relay can repair across a failover without ever knowing a failover happened.

Why not just keep ownership in Redis, which is already on the per-tick path? Because Redis does not mint epochs. Aurora does. Redis replication is asynchronous, so during a Redis failover a processor could complete an ownership write against an instance that is about to lose that write. An authority layer that can silently forget who it crowned is not an authority layer. Redis holds a fast copy of the answer. It never holds the pen.

Epochs make staleness harmless

The epoch from that RETURNING clause is a fencing token in the CockroachDB range-lease tradition: a monotonically increasing number that every committed write carries, so any component can compare two claims of ownership and pick the survivor without asking anyone.

Fencing only works if it is enforced everywhere a stale owner could do damage, so the enforcement has three parts:

Aurora refuses the acquisition. A stale processor trying to retake or extend ownership runs the conditional update with an old epoch and affects zero rows. It cannot become the owner again by accident.

The Redis function refuses the commit, and installs the replacement. Every per-tick commit from the commit path post presents its epoch against the {tile:X}:owner hash. A higher epoch installs itself, address and all, in the same atomic call as its first append, so a new owner's takeover and first write are indivisible. A lower epoch has its entire batch rejected before a single event touches the stream.

The rejection carries the handoff. The failure reply includes the current epoch and the SWIM contact of the replacement owner. The superseded processor redirects every client it is holding to the new owner, corrects its own gossip state, and disconnects, and that disconnection triggers garbage collection of the obsolete processor. The loser does not just stop doing damage. It repairs the routing around itself on the way out.

One consequence is worth underlining because it shapes the commit path post's entire layer: nothing downstream of the stream ever arbitrates ownership. The commit function guarantees nothing enters a stream that the owner hash did not vouch for at the moment of the write, so the bridge and the relays never consult Aurora and never decide who owns anything; each keeps a single current-epoch comparison as a last fence, measured by its stale-drop counter. Split brain is resolved at the door, and the rooms behind it stay simple because their entire share of the problem is one integer.

The read path never touches the database

Minting is rare. Routing is constant. Every player action needs an answer to "which processor owns this tile right now," and putting Aurora on that path would make the authority layer a per-action dependency, which is the etcd mistake wearing a different logo.

So the routing read path has three layers, each faster than a database query, and Aurora is on none of them:

The routing ladder: gossip answers instantly and is usually right, the owner hash is fresh within one tick, and a redirect costs one bounded round trip. The authority database is on none of these paths.

SWIM first. Cluster membership metadata, gossiped SWIM-style, carries tile ownership, so the default lookup is an in-memory read against local state. Instant, and usually right.

Redis second. Gossip is eventually consistent, and at this scale it will sometimes be wrong. That is a known property of the protocol, not a bug, so the architecture's job is to make the staleness harmless. The {tile:X}:owner hash is refreshed by the Redis function every tick, which means it reflects the current owner within 500ms. One HGET resolves a stale SWIM answer.

Redirect last. If both are stale, the receiving processor rejects the request and returns the current owner's address, which it has from the same owner hash. The misrouted client or stale primary repoints immediately. One wasted round trip, bounded. And the redirect carries the in-flight action's fate with it: the gateway resubmits the action once, in-band, against the named owner, so it lands a round trip late instead of dying. A routing miss is the system's miss. Fizzle is reserved for the fiction, a dodge, a blockade, interference, never for plumbing.

The winner of an Aurora CAS never has to announce itself through a notification system, because the two channels that routing already reads are updated as a side effect of simply being the owner: SWIM gossips the new membership metadata, and the first tick commit refreshes the owner hash. An earlier design had a dedicated owner cache fed by etcd watches. It was deleted, not migrated. SWIM is the cache.

One rule keeps this ladder honest as the cluster grows: warm followers tail their tile's stream to be ready for promotion, and for nothing else. They never answer a read. An asynchronously replicated copy that answers authoritative questions is not a cache, it is an exploit generator, and every read class in this system already has a better home: frames come from the fan-out plane, locations from the registry, truth from the owner in process. The follower's only job is to become the owner. Until that moment it is silent.

Dying without tombstones

Garbage collection is the quiet half of ownership, and an earlier design over-engineered it with tombstone records to mark dead tiles. The current design has no tombstones, because every mechanism that needs to learn about a dead tile already has a way to learn it. A superseded owner does not even wait: the redirect flow ends with it disconnecting, and the disconnect triggers its collection on the spot. Convergence handles the tiles that simply go quiet. The tile stops writing. The owner hash's TTL, roughly 30 seconds and refreshed every tick while alive, expires on about the same clock SWIM takes to notice the silence. The Aurora row ages out under retention. Nothing has to be told the tile died. Everything converges on it.

The TTL alignment is deliberate: there is no useful state where Redis still vouches for an owner that gossip has already buried, or the reverse, for longer than one convergence window.

What a player actually sees

The headline scenario on a clock: partition, promotion at epoch n+1, the old owner's rejection carrying the handoff, redirect, and self-collection. The battle pauses about one tick.

Walk the headline scenario through the machinery. A partition isolates a primary mid-battle. Its follower detects the silence, wins epoch n+1 through the conditional update, replays the stream tail, and makes its first commit, which atomically installs n+1 and its own address in the owner hash. The old primary, healthy and oblivious on its side of the partition, finishes a tick and presents a commit at epoch n. The function rejects the batch, and the failure reply tells it everything it needs: the epoch that beat it and where the new owner lives. It redirects its connected clients to the new owner, corrects its gossip state, and disconnects, which triggers its own garbage collection. Nothing it simulated after the takeover survives, because none of it committed past the install.

One edge needs naming. In the gap between the Aurora win and the install, the old primary's last tick or two can still commit at epoch n, because the owner hash has not yet heard about n+1. Those deltas do not get to redefine reality, but nothing has to repair them either, because the remediation is replay. The writes were committed through the function while the owner hash still vouched for their author, so they live in the tile-lifetime stream like everything else; the new owner's boot replay incorporates them into its authoritative state, and the bridge's current-epoch filter stops them from animating live. A few hundred milliseconds of a superseded owner's work reaches players as state instead of frames. A watching client may see the world look slightly off for a fraction of a second until the new epoch's first frames land, which at 2 Hz is on the order of one tick, the same magnitude as ordinary network jitter. The blip is the entire cost, and nothing was broken in the making of it.

From inside the battle, the rest of the cost is a pause bounded by failure detection plus one promotion CAS plus one redirect, after which everyone is replaying against the same committed stream from the commit path post. The game rules absorb the rest: at 2 Hz with cooldown-gated card plays, a sub-second ownership transfer fits inside the rhythm players already experience as normal. That is the thesis again. The seam exists. The design keeps it somewhere the game never looks.

What is proven and what is not

The CAS mechanics are exercised by a test suite now, not by scale, and the difference is the point of this section. Claim semantics, promotion with tile-lifetime sequence continuity, stale-write rejection at the function, and stale-epoch drops downstream all have source and contract coverage, including SQL-contract tests for the Aurora conditional update, with a compose stack behind the integration set. Two gates stand between that and trusting it in production, both named in the build plan: the same paths against a live Redis Cluster across masters, and live Aurora Limitless with EXPLAIN proof that the ownership DML stays single-shard. What remains pure benchmark territory, and what the plan obligates before the 10,000-player claim graduates from argument to report: sustained epoch mint throughput on sharded Aurora against the burst requirement, split-brain recovery time from partition to client redirect at p99, routing staleness rates under heavy membership churn, and that superseded-tail suppression plus replay converge every watching client after a contested takeover. Each row has a failure threshold and a fallback. If Aurora minting falls short under burst, spawn admission gets rate-shaped upstream before the authority model itself is touched.

What comes next

Ownership answers who runs a tile. It says nothing about the entities that refuse to stay inside one. The motion post covers the five-phase boundary transfer, collisions at the seam, what happens to a ship crossing into a tile at the exact moment that tile dies, and mass warp landings that span multiple tiles at once.