Building Zue Part 3: Herding Cats (Distributed Consensus)
The “Herding Cats” Problem
Writing to a file is easy (Part 1). Sending that file over a network is annoying (Part 2). But getting three computers to agree on what that file contains? That is pure chaos.
In Zue, I implemented Raft (or at least, a student’s interpretation of it).
The rules are simple:
- One Leader: The boss. Clients only talk to the boss.
- Quorum: The boss can’t promise anything until a majority (2 out of 3) agree.
- No Rollback: Once something is committed, it stays committed. Like a bad tattoo.
The Happy Path (When Nothing Breaks)
In a perfect world, this happens:
The Leader writes locally (fast!), sends to followers (fast!), and waits. As soon as one follower says “Got it!”, we have a majority (Leader + 1 Follower = 2/3). We commit.
The “Oh No” Path (When Everything Breaks)
What happens if a Follower goes offline, comes back 10 minutes later, and tries to join?
It’s missing 10,000 records.
If I sent them all at once, the network would choke. If I blocked the Leader to help it, the whole cluster would die.
The Repair State Machine
I built a background process called tickRepair. It runs every 2 seconds and checks for stragglers. If it finds one, it feeds it small batches of data (100 records at a time) until it catches up.
This allows Zue to self-heal. I can kill a node, restart it, and watch the logs as it frantically eats data until it’s back in sync. It’s oddly satisfying.
Not so Final Final Thoughts (The Metrics)
After 3,400+ lines of Zig and too much caffeine:
- 200+ Tests (unit-test, integration-tests, stateful replication-tests over loopback network interface using different processes).
- Sub-millisecond writes thanks to the
mmap+ append-only design. - Zero Locks in the hot path. Single-threaded event loops ftw.
Was it harder than using SQLite? Yes. Did I learn more? Absolutely.
If you want to read the code (or just roast it), it’s all open source.
GitHub: github.com/lostcache/zue