Orchestrating 5000 Workers Without Distributed Locks: Rediscovering TDMA

3 pointsposted a month ago
by Horos

Item id: 46386053

4 Comments

Horos

a month ago

Thanks for the pointer to asyncmachine! Let me clarify HOROS architecture since there's some confusion.

HOROS uses time slots for orchestrator clones on a SINGLE machine by default. Not distributed - 5 Go processes share the same kernel clock:

Runner-0: T=0s, 10s, 20s... (slot 0) Runner-1: T=2s, 12s, 22s... (slot 1) Runner-2: T=4s, 14s, 24s... (slot 2)

Zero network, zero clock drift. Just local time.Sleep().

Your approach (logical clocks) solves event ordering in distributed systems. HOROS solves periodic polling - workers can be idle for hours with no events to increment a logical clock. Wall-clock fires regardless.

Different primitives: - Logical clocks: "Event A before Event B?" (causality) - TDMA timers: "Is it your turn?" (time-slicing)

For cross-machine workflows, we use SQLite state bridges:

Machine-Paris Machine-Virginia ┌─────────────┐ ┌──────────────┐ │ Worker-StepA│ │ Worker-StepC │ │ completes │ │ waits │ │ ↓ │ │ ↑ │ │ output.db │ │ input.db │ └──────┬──────┘ └──────▲───────┘ │ │ └──→ bridge.db ←─────────────────┘ (Litestream replication)

bridge.db = shared SQLite with state transitions StepBridger daemon polls bridge.db, moves data between steps

State machines communicate through data writes, not RPC. Each node stays single-machine internally (local TDMA).

Re: formatting - which results were unclear? Happy to improve.

pancsta

a month ago

I do a lot of logical-clock based synchronization using asyncmachine.dev (also in Go), you may want to check it out as “human time” can be error prone and not “tight”. It does involve forming a network state machines, but connections can be partial and nested.

Your results are very hard to read due to formatting, but the idea is interesting.

wazokazi

a month ago

The workers sit idle for n-1 out of n time slices. As n gets larger, amount of work being done approaches zero.

Horos

a month ago

TDMA schedules the orchestrators (lightweight checks), not the workers (heavy jobs).

Orchestrators: Active 1/n of time (~10ms to check state) Workers: Run continuously for hours once started

T=0s: Orchestrator-0 checks → starts job (runs 2 hours) T=2s: Orchestrator-1 checks → job still running T=10s: Orchestrator-0 checks again → job still running

Think: traffic lights (TDMA) vs cars (drive continuously).

Work throughput is unchanged. TDMA only coordinates who checks when.