Building Zue Part 1: Files Are Just Arrays (If You're Brave)
The “Oh No” Moment
It was 3 AM. My OS kernel assignment was “working.” I had processes switching context. I had a networking stack that… technically networked.
But I was bored.
For this project I wanted to learn how computers talk to other computers without setting the server room on fire. I wanted to know why Kafka is so fast and why S3 never loses my cat photos.
So I did the logical thing: I threw away my kernel (I had a ton of fun, definitely picking it up later) and decided to build a distributed storage engine in Zig.
I call it Zue. This is Part 1 of my journey, where I learn that files are actually just arrays, and the OS is hiding things from me.
The Philosophy: Append-Only
The first rule of high-performance storage: Random I/O is the enemy.
Spinning disks hate jumping around. Even NVMe drives prefer a straight line. So Zue is an Append-Only Log. We never overwrite data. We just add it to the pile.
When you “update” a key in Zue, we don’t go back and change the old value. we just write a new record at the end. The old one is dead to us (until compaction runs, which is a problem for Future Me).
The Log Structure
Here is how Zue actually organizes bytes on disk. It’s not just one giant file—it’s a series of specified Segments.
%%{init: {'theme':'base', 'themeVariables': {'fontSize':'23px'}, 'flowchart':{'nodeSpacing': 40, 'rankSpacing': 40}}}%%
flowchart LR
subgraph Log [Log]
direction TB
S1[Segment 0]
S2[Segment 1]
S3[Segment ...]
end
subgraph Segment [Segment]
direction TB
IndexFile[Index File]
LogFile[Log File]
end
subgraph Index ["Index Entry (12B)"]
direction TB
I_CRC["CRC (4B)"]
I_Off["Rel Offset (4B)"]
I_Pos["Position (4B)"]
end
subgraph Record ["Log Record"]
direction TB
R_CRC["CRC (4B)"]
R_TS["Timestamp (8B)"]
R_Key_Len["Key len (4B)"]
R_Key["Key"]
R_Val_Len["Value len (4B)"]
R_Val["Value"]
end
S2 --> Segment
IndexFile --> Index
LogFile --> Record
style Log fill:transparent,stroke:#6366f1,stroke-width:3px
style Segment fill:transparent,stroke:#6366f1,stroke-width:3px
style Index fill:transparent,stroke:#6366f1,stroke-width:3px
style Record fill:transparent,stroke:#6366f1,stroke-width:3px
style S1 fill:transparent,stroke:#10b981,stroke-width:3px
style S2 fill:transparent,stroke:#10b981,stroke-width:4px
style S3 fill:transparent,stroke:#10b981,stroke-width:3px
style IndexFile fill:transparent,stroke:#3b82f6,stroke-width:3px
style LogFile fill:transparent,stroke:#3b82f6,stroke-width:3px
style I_Off fill:transparent,stroke:#ef4444,stroke-width:3px
style I_Pos fill:transparent,stroke:#ef4444,stroke-width:3px
style I_CRC fill:transparent,stroke:#ef4444,stroke-width:3px
style R_CRC fill:transparent,stroke:#f59e0b,stroke-width:3px
style R_TS fill:transparent,stroke:#f59e0b,stroke-width:3px
style R_Key_Len fill:transparent,stroke:#f59e0b,stroke-width:3px
style R_Key fill:transparent,stroke:#f59e0b,stroke-width:3px
style R_Val_Len fill:transparent,stroke:#f59e0b,stroke-width:3px
style R_Val fill:transparent,stroke:#f59e0b,stroke-width:3px
Why segments? Because deleting a 10TB file to free up space is… not graceful. Deleting a 1GB segment file involving old data? Instant.
3. The Sparse Index
If you have a 10GB log file, how do you find key “user:123”?
Scan everything? O(N). Too slow.
Index every key? O(1). Too much RAM.
Zue uses a Sparse Index. We only write down the location of every Nth record (specifically, every 4KB of data).
To find “Banana”:
Check the index. “Banana” is between “Apple” (Offset 1000) and “Cat” (Offset 5000).
Jump to Offset 1000.
Scan forward linearly until you hit “Banana”.
We trade a tiny bit of CPU (linear scan) for massive RAM savings. It’s the “good enough” engineering principle in action.
4. The Cheat Code: mmap
Upon some research, I found that every file operation required read() and write() syscalls. This meant every request went through the kernel, forcing the kernel to do its thing—checking permissions, copying buffers, and generally context-switching my CPU cycles away.
It’s slow.
If I already proved to the kernel that “I am not the bad guy,” it’s just really dumb to keep doing that again and again for every single request.
I wanted speed.
Enter mmap (memory-mapped I/O, aka the “Trust Me Bro” card). It lets you pretend a file on disk is just an array in RAM. You want to read byte 4000? data[4000]. You want to write? data[5000] = 0xFF. The OS handles the dirty work of flushing pages to disk.
In Linux, when your file grows, you call mremap. The kernel says “Sure thing,” and seamlessly expands your virtual memory mapping. It’s fast, atomic, and beautiful.
In macOS? mremap does not exist.
I found this out the hard way: segfault.
To make Zue work on my MacBook, I had to implement a workaround: