Skip to content

blob storage for large atoms#985

Draft
matthew-levan wants to merge 29 commits intoml/64from
ml/bob
Draft

blob storage for large atoms#985
matthew-levan wants to merge 29 commits intoml/64from
ml/bob

Conversation

@matthew-levan
Copy link
Copy Markdown
Contributor

@matthew-levan matthew-levan commented Mar 28, 2026

Blob Store for Very Large Atoms

Summary

Introduces a blob store that lets Urbit handle files larger than 32 MiB without
ever loading them fully into the loom. Large atoms are stored as ordinary files
under $pier/.urb/bob/ and referenced on the loom by a new "bob atom" — a
24-byte stub that carries a mug (content hash) and a sequence number in place of
the actual bytes. The rest of the runtime — serialization, event log, IPC, Clay
sync — handles bob atoms transparently.

Motivation

The loom is a fixed-size memory-mapped region. Before this change, committing a
2 GiB file via |commit %base required allocating a 2 GiB atom on the loom,
jamming it for the event log, and sending the jammed bytes over the king/serf
pipe — tripling+ peak memory usage and making large file support practically
impossible.

Design

Bob atoms (pkg/noun/allocate.h, imprison.c, retrieve.c)

A new indirect atom variant. The MSB of u3a_atom.len_w (u3a_blob_flag,
0x80000000) marks an atom as a bob. The remaining fields store:

  • mug_h — MurmurHash3 of the file content (used as the loom noun mug and as
    the blob bucket directory name)
  • buf_w[0] — sequence number within the mug bucket

u3i_blob(mug, seq) allocates a bob atom. All retrieve functions (u3r_met,
u3r_bytes, u3r_bit, u3r_half, u3r_halfs, u3r_chubs, u3r_sing,
u3r_mug) detect bob atoms and materialize them on demand by loading from disk,
operating on the result, then freeing it. The mug is cached in mug_h after
first computation so u3r_mug only reads from disk once per atom lifetime.

The allocator (allocate.c) calls a registered bob_free_f callback when a bob
atom's refcount reaches zero, deleting the blob file. The callback is registered
by disk.c after the pier is opened.

_ca_take_atom masks len_w with u3a_blob_mask when computing the allocation
size, so bob atoms are copied correctly across road transitions.

Blob store (pkg/vere/blob.c, blob.h)

Files live at $pier/.urb/bob/<mug>/<seq>. A per-bucket lockfile
(<mug>/lock) holds the next available sequence number as an ASCII decimal,
protected by fcntl advisory locking. On write, a byte-for-byte dedup scan
checks existing files in the bucket before allocating a new sequence slot.

Earth (the king process) is the sole writer. Mars (the serf) is read-only.

Key functions:

  • u3_blob_save() — write from a byte buffer, with dedup
  • u3_blob_save_fd() — write from an open file descriptor via mmap, avoiding
    a full heap allocation for large files
  • u3_blob_load() / u3r_blob_load() — load a blob into a loom atom via
    mmap + u3i_bytes, releasing the mapping immediately after the copy

Ram/Tap serialization (pkg/noun/serial.c, serial.h)

A new wire format that extends jam with a 2-bit fixed tag scheme:

00 = normal atom   (mat-encoded value)
01 = blob ref      (mat(mug) + mat(seq))
10 = cell
11 = backref

Wire format: [magic "RAM\0" 4B][version 0x01 1B][ram bits...]

u3s_ram_xeno() encodes a noun tree, emitting bob atoms as compact ~10-byte
blob refs. u3s_tap_xeno() decodes ram bytes, reconstructing bob atoms from
mug+seq pairs.

Ram replaces jam/cue for all event log entries and IPC messages. Old jam-encoded
data (VER1/VER2 epochs) is still readable via fallback to u3s_cue_xeno.

IPC (pkg/vere/newt.c, lord.c, mars.c, vere.h)

The newt wire protocol gains a version byte: 0x00 = jam (legacy), 0x01 = ram.
u3_newt_send_vers() selects the encoding. All send paths use ram (0x01); all
receive paths try tap first, falling back to cue for backward compatibility.
u3_meat gains a ver_y field to carry the version through the receive path.

Epoch version (pkg/noun/version.h, disk.c)

U3E_VER3 (epoch version 3) marks epochs that use ram-encoded event log entries.
_disk_epoc_load() handles VER2→VER3 migration by rolling over to a new epoch.
u3_blob_init() is called on epoch creation and load to ensure .urb/bob/
exists.

Unix Clay sync (pkg/vere/io/unix.c)

_unix_update_file() and _unix_initial_update_file() detect files larger than
U3_BLOB_THRESH (32 MiB), stream them into the blob store via
u3_blob_save_fd(), and send a bob atom to Clay instead of loading the bytes
into the loom.

_unix_write_file_hard() handles bob atoms in %ergo write-back by streaming
directly from the blob store file to the destination in 64 KiB chunks, without
allocating a heap buffer. _unix_write_file_soft() short-circuits the
"has-it-changed?" disk read for bob atoms by comparing mugs directly.

Mesa (pkg/vere/io/mesa.c)

Reassembled packets larger than U3_BLOB_THRESH are blobified before being
handed to Arvo, keeping large network payloads out of the loom.

What this does not change

  • Jam/cue format is unchanged. u3s_jam_xeno and u3s_cue_xeno* are
    unmodified.
  • No Hoon or Arvo changes. Bob atoms are a pure runtime optimization; Nock
    semantics are preserved.
  • No protocol-level changes visible outside the king/serf boundary.

Known limitations

  • u3r_mug_bytes takes a c3_h (uint32_t) length parameter, so the mug is
    unreliable for files larger than 4 GiB. This bounds the reliable operating
    range.
  • u3i_bytes similarly takes c3_w (uint32_t), capping loom atom size at 4 GiB.
  • Each retrieve call on a bob atom loads the file from disk. If the same bob atom
    is accessed many times during one Nock event, the file is re-read each time.
    A per-event materialization cache is a possible follow-on improvement.
  • The 2× peak loom pressure during |commit (Clay stores the file atom in both
    lat.ran and mim.dom) is an architectural property of Clay, not addressable
    at this layer. See 2x.md for analysis and options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant