BLACK SKIES ARCHITECTURE SERIES · PART 3 OF 8 · 13 MIN

Redis Coordination for Real-Time Tile State

This post is part of the Black Skies architecture series; the series hub is the thesis post, which argued that a 10,000-player battle can play the same as a 100-player skirmish because consistency versus availability is a design choice. This post is about the machinery inside a single tile's commit path: how authoritative state gets committed, delivered, and made durable every 500 milliseconds, with none of those three jobs allowed to interfere with the others.

The shape of the problem

A tile is a CP island: a bounded hex region of 2,401 H3 cells, processed by exactly one owner, ticking at a fixed 2 Hz. Every 500ms the tile processor (the tile's master, in the lifecycle post's terms) resolves the queued intents, applies the game rules, and produces a batch of committed events.

Three things have to happen to that batch, and they have wildly different requirements.

It has to commit atomically under the tile's current ownership epoch, so a stale processor that lost a partition fight can never corrupt state.

It has to reach every watching client fast. The budget is 100ms from commit to client, with roughly 600 bytes per client per tick on the wire. This is the highest-volume traffic in the system and it tolerates loss, because the next tick is 500ms away and gaps are repairable.

It has to survive losing Redis entirely. Durability is a slow-path concern measured in seconds of recovery window, not milliseconds of delivery latency.

The naive design is one Redis cluster doing all three, with the tile processor publishing its own events. That design couples everything to everything: a throttled durability write slows delivery, a delivery storm starves the commit path, and a crash between commit and publish silently drops events. The actual design splits the work across two Redis clusters and three small services, each with exactly one job.

The whole layer on one chart: six actors, one job each. Read any failure's blast radius straight off the arrows.

Three keys, two clusters

The coordination cluster holds three keys per tile:

{tile:X}:owner
{tile:X}:stream
{tile:X}:snapshot

The hash tag keeps all three on one slot, which is what makes an atomic server-side commit possible.

The key names are deliberately boring. An earlier draft baked the ownership epoch into the stream name as a safety mechanism, so a stale primary would be writing to a stream nobody read. That safety now lives in the events themselves: every committed event carries the epoch that committed it, and every consumer inspects it. Names stay stable, discovery stays trivial, and the fencing guarantee moves to where it can actually be enforced.

The owner key is a hash holding the current owner's address and epoch, with a TTL of roughly 30 seconds. An actively ticking tile refreshes it every 500ms, so for a healthy tile it never expires. If the tile dies, the entry evaporates on about the same clock that SWIM takes to detect the failure. The TTL is a convergence mechanism, not an authority mechanism. Redis does not mint epochs. Aurora does, and that path belongs to the ownership post.

The stream is the append-only log of committed events, one batch per tick.

The snapshot is the full serialized tile state, written by a second epoch-fenced function on the same slot every sixty ticks — once per thirty seconds at the 2 Hz tick rate. The snapshot function enforces its own invariants atomically: it rejects sequence regressions, rejects snapshots claiming ticks that were never committed, and stamps the blob with a checksum and the owner's contact. Replay cost on recovery is bounded by the trim threshold, not by how long the battle has been running.

Trimming is where a lazy design loses data politely, so it is not keyed to the snapshot alone. The stream is trimmed behind the minimum of six watermarks: the snapshot's sequence, the bridge's acknowledged entry, the relay tier's replay floor, the bridge's pending backlog, any consumer group's pending entries, and the checkpointer's last durable write. Whichever consumer is furthest behind defines the tail, so no reader, live or recovering, can ever need an entry that has already been deleted. The floor costs a few hash reads per trim and buys the sentence every recovery path in this series leans on: the stream still has it.

The trim floor is the minimum of six consumer watermarks. The slowest reader defines the tail, so nothing that anyone still needs can be deleted.

The second cluster is the fan-out cluster, running Redis 7 sharded pub/sub. It exists because delivery is the highest-volume traffic in the system and it is at-most-once by design. Putting it on separate hardware means a battle's delivery load can never starve the coordination functions that matter most under stress. SPUBLISH hashes the channel to a slot, so broadcast traffic stays on the nodes that own it instead of flooding the whole cluster, which is the property that made sharded pub/sub viable here where classic cluster-wide PUBLISH would not be. Sharded pub/sub is the normal case; a tile drawing extreme spectator load escalates to a dedicated broker tier, the same move Discord makes for its largest servers, and that escalation is already written into the benchmark plan as a failure action rather than improvised when a battle makes the news.

One function per tile per tick

The commit itself is a Redis function, not an ad hoc Lua script shipped with every call. Once per tick, the tile processor invokes it with the tick's deltas, its epoch, and its SWIM contact information. The function compares the presented epoch against the owner hash and takes one of three paths.

One atomic server-side function, three paths. Ownership enforcement, owner-hash maintenance, and the append share one slot, so no window exists between them.

Equal epoch is the normal case: refresh the TTL, append the batch. A higher epoch is a new owner's first commit, so the function installs the new epoch and SWIM contact into the owner hash and then appends, which makes takeover and first write one atomic step with no window between them. A lower epoch means the writer is stale, and the entire batch is rejected.

The rejection is not a bare error. The failure reply carries the current epoch and the SWIM contact of the owner that replaced the writer, which is what lets a superseded processor redirect its clients and stand down instead of retrying into a wall. That handoff is the ownership post's subject.

The important property is that ownership enforcement, owner-hash maintenance, and the append all live inside one atomic server-side operation on one slot. There is no window where the stream holds events the owner hash did not vouch for, which is also why everything downstream of the stream gets to be simple.

The bridge

The processor does not publish its own events, because commit-then-publish from a single process has an unfixable ordering problem. Publish before commit, and a crash delivers events that never became real. Commit before publish, and a crash drops delivery with no record that anything was missed.

The bridge dissolves the problem instead of handling it. It is a small service that tails the committed streams — adaptive polling that re-reads within ten milliseconds while events are flowing and decays toward a hundred-millisecond ceiling when a shard goes quiet — and republishes each batch to the fan-out cluster with SPUBLISH. It can only forward what is already committed, so the race does not exist by construction. This is the same instinct behind log-first architectures like Kafka: make the durable log the source of truth and let delivery be a consumer of it, never a sibling racing it.

The bridge never arbitrates ownership, but it is not blind either. The commit function guarantees that nothing enters a stream without the owner hash vouching for it at the moment of the write, so the hard question is settled at the door; the bridge then keeps one integer comparison as the last fence, publishing only events whose payload epoch matches the current owner and counting what it drops. The drop case is narrow and deliberate: a superseded owner's final committed events, written legitimately in the instant before a promotion, leave the live feed as events and re-enter as state through the new owner's replay. Fencing at the door buys simplicity in every room behind it; the one-comparison filter is defense in depth with a metric on it.

The bridge is read-only against coordination and write-only against fan-out. It holds no game state and makes no decisions at all.

The checkpointer

Durability is DynamoDB's job, and the service that feeds it is deliberately not the bridge. Coupling DynamoDB write latency to fan-out delivery would let one throttled or cross-AZ write slow down how fast events reach relays, which is exactly the coupling this whole design exists to prevent.

The checkpointer is the dumbest service in the system, on purpose. On a cadence, it reads {tile:X}:snapshot from the coordination cluster and writes the blob to DynamoDB. It does not reconstruct state from the stream. It does not understand the game. If it falls behind or restarts, the worst case is a slightly older checkpoint, which means a slightly longer stream replay on cold-start recovery. No data loss, just a wider recovery window. That is the correct failure mode for a durability path: degrade the recovery clock, never the gameplay clock.

One number this design owes the reader plainly: losing the coordination cluster outright loses every tick since the last checkpoint, because the stream dies with it. That is the recovery point objective, and at 2 Hz with checkpoints every few seconds, the window is a handful of seconds of game time. Whether that loss is acceptable depends on what the ticks contained. For a prototype and for load testing, yes. For competitive combat, probably not. For anything touching the economy, no. The upgrade paths are known and priced: shorter checkpoint cadence, a durable log such as Kafka or LogDevice behind the stream, or write-through to DynamoDB before acknowledging a tick. The current architecture deliberately targets the first case and says so, and the one economy-adjacent write already lives outside the blast radius, because card exhaustion commits to DynamoDB before any tile ever sees the action.

The relay's two paths

Relays are the processes holding client WebSockets, and they connect to both clusters, for different reasons.

The hot path is sharded pub/sub from the fan-out cluster, every tick, at-most-once. Signal runs its online message delivery on the same bet: Redis pub/sub for the fast path, with durability handled entirely elsewhere.

The cold path is a direct read of snapshot plus stream tail from the coordination cluster. It fires when a player opens a new viewport, when a relay detects a sequence gap, or when a relay restarts. It is on-demand, not per-tick.

An earlier version of this design claimed relays only ever touch the fan-out cluster. That was too strong, and pretending otherwise would have hidden a real load source on the coordination cluster: gap repair during reconnect storms is precisely when coordination is already under the most pressure. Naming the cold path is what makes it measurable.

Before the tile: the admission ledger

Everything above assumed actions arrive at a tile as queued intents, and what happens to them first depends on which of two classes each belongs to, because the system deliberately runs two admission doors. Economic actions, card plays and anything else that spends or mints, are admitted at the gateway against a durable ledger in DynamoDB: the card or cooldown is spent there, conditionally and versioned, before any tile sees the intent, and the admission writes an artifact keyed by the client's action ID plus an admission ID. From that moment the action has a single terminal result it can reach, applied, fizzled, rejected, or redirected, and the result is replayable: a client that reconnects and asks about an action it sent gets the same answer every time, because the answer derives from the artifact, not from whichever process happens to remember. A card that moves a ship is still a card; it takes this door.

Plain movement takes the cheaper door on purpose. A travel intent is validated and rate-shaped at the gateway but never written to the ledger, because it has no economy to protect and its durable record already exists one layer down: an applied move is a committed event in the tile's replicated stream, and a blocked one resolves in the tile's arbitration as an in-fiction miss. For cards, the ledger is the record; for movement, the stream is the ledger. The split is also the cost model: durable admission scales with card plays, which hand size and cooldowns bound tightly, not with movement, which dominates intent volume in any game played at this rhythm. Card and cooldown authority living in the ledger rather than in tile state is why a tile crash can never refund or double-spend a card, and why this post's recovery story never has to mention the economy. Two doors, one shape: every action of either class still reaches exactly one outcome.

Two admission doors, one shape: economic actions spend durably before any tile sees them; movement's durable record is the committed stream itself.

The economic door follows a pattern payments infrastructure settled years ago: Stripe's idempotency keys made ambiguous retries safe by recording the result the first time and replaying that answer forever after, and the admission artifact is the same move with a card instead of a charge. The failure it prevents has a long name in this industry too: duplication exploits, the item dupes that have plagued MMOs whenever a crash or a retry window let the same spend land twice. Here the spend is durable before the intent exists anywhere else, the terminal result is cached against the action ID, and a retry can only ever fetch the answer it already received.

Six actors, one job each

The whole layer reduces to a short table of who touches what:

The gateway validates every intent, admits the economic ones against the durable ledger, and forwards the rest; it owns the economy and never touches tile state. The tile processor writes to coordination, through the Redis function, once per tick. The bridge reads streams from coordination and publishes to fan-out. The checkpointer reads snapshots from coordination and writes to DynamoDB. The relay consumes fan-out on the hot path and reads coordination on the cold path. Aurora sits off to the side as the ownership authority and never appears on the per-tick path at all.

None of them does two jobs. None of them couples fast-path latency to slow-path durability. When something breaks, the blast radius reads directly off the table. DynamoDB throttles: the checkpointer lags and nobody else notices. A fan-out node dies: a tick of notifications is lost and relays repair from the stream. A reconnect storm hits: the cold path loads the coordination cluster, and that load is isolated from the fan-out traffic serving everyone who stayed connected.

Proven versus hypothesized

The same honesty rule from the thesis post applies, and the ground truth under it has moved, so the split has three tiers now.

Exercised by the test suite: the function's fencing semantics, same-epoch append, install-on-greater takeover inside the commit call, fail-closed on a missing owner or an anonymous takeover, rejections that return the current owner's contact; the six-watermark trim floor; the bridge's watermarking, dead-lettering, and current-epoch drops; the checkpointer's contracts and checksum detection; and the admission ledger's terminal-result and replay invariants. More than a thousand test methods across the HEX and gateway suites sit behind this layer, with a compose-based integration stack underneath the unit tests.

Implemented but gated on live infrastructure, named as external proof gates in the build plan: the same function path against a real Redis Cluster across masters, and Aurora ownership with EXPLAIN-verified single-shard DML on live Limitless.

Hypothesis until the benchmark, unchanged: bridge throughput at saturated-tile output without falling behind the tick, fan-out pressure per relay against the 600 bytes per client per tick figure, cold-path read load during a reconnect storm, snapshot-and-trim cost inside the tick budget on a hot tile, and end-to-end commit-to-client p99 against the 100ms budget. Each has a failure threshold and a fallback. If the bridge cannot keep up, it shards by tile range. If cold-path reads contend with commits, recovery reads move to replicas. The architecture survives individual numbers failing. It does not survive pretending they have been measured.

What comes next

Everything in this post assumed one clean answer to a dirty question: who owns the tile right now? The ownership post is about the cases where that question briefly has two answers: Aurora epoch minting, the fencing model in full, and split-brain resolution without a global coordinator on the hot path.