blob storage for large atoms#985
Draft
matthew-levan wants to merge 29 commits intoml/64from
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Blob Store for Very Large Atoms
Summary
Introduces a blob store that lets Urbit handle files larger than 32 MiB without
ever loading them fully into the loom. Large atoms are stored as ordinary files
under
$pier/.urb/bob/and referenced on the loom by a new "bob atom" — a24-byte stub that carries a mug (content hash) and a sequence number in place of
the actual bytes. The rest of the runtime — serialization, event log, IPC, Clay
sync — handles bob atoms transparently.
Motivation
The loom is a fixed-size memory-mapped region. Before this change, committing a
2 GiB file via
|commit %baserequired allocating a 2 GiB atom on the loom,jamming it for the event log, and sending the jammed bytes over the king/serf
pipe — tripling+ peak memory usage and making large file support practically
impossible.
Design
Bob atoms (
pkg/noun/allocate.h,imprison.c,retrieve.c)A new indirect atom variant. The MSB of
u3a_atom.len_w(u3a_blob_flag,0x80000000) marks an atom as a bob. The remaining fields store:mug_h— MurmurHash3 of the file content (used as the loom noun mug and asthe blob bucket directory name)
buf_w[0]— sequence number within the mug bucketu3i_blob(mug, seq)allocates a bob atom. All retrieve functions (u3r_met,u3r_bytes,u3r_bit,u3r_half,u3r_halfs,u3r_chubs,u3r_sing,u3r_mug) detect bob atoms and materialize them on demand by loading from disk,operating on the result, then freeing it. The mug is cached in
mug_hafterfirst computation so
u3r_mugonly reads from disk once per atom lifetime.The allocator (
allocate.c) calls a registeredbob_free_fcallback when a bobatom's refcount reaches zero, deleting the blob file. The callback is registered
by
disk.cafter the pier is opened._ca_take_atommaskslen_wwithu3a_blob_maskwhen computing the allocationsize, so bob atoms are copied correctly across road transitions.
Blob store (
pkg/vere/blob.c,blob.h)Files live at
$pier/.urb/bob/<mug>/<seq>. A per-bucket lockfile(
<mug>/lock) holds the next available sequence number as an ASCII decimal,protected by
fcntladvisory locking. On write, a byte-for-byte dedup scanchecks existing files in the bucket before allocating a new sequence slot.
Earth (the king process) is the sole writer. Mars (the serf) is read-only.
Key functions:
u3_blob_save()— write from a byte buffer, with dedupu3_blob_save_fd()— write from an open file descriptor viammap, avoidinga full heap allocation for large files
u3_blob_load()/u3r_blob_load()— load a blob into a loom atom viammap+u3i_bytes, releasing the mapping immediately after the copyRam/Tap serialization (
pkg/noun/serial.c,serial.h)A new wire format that extends jam with a 2-bit fixed tag scheme:
Wire format:
[magic "RAM\0" 4B][version 0x01 1B][ram bits...]u3s_ram_xeno()encodes a noun tree, emitting bob atoms as compact ~10-byteblob refs.
u3s_tap_xeno()decodes ram bytes, reconstructing bob atoms frommug+seq pairs.
Ram replaces jam/cue for all event log entries and IPC messages. Old jam-encoded
data (VER1/VER2 epochs) is still readable via fallback to
u3s_cue_xeno.IPC (
pkg/vere/newt.c,lord.c,mars.c,vere.h)The newt wire protocol gains a version byte:
0x00= jam (legacy),0x01= ram.u3_newt_send_vers()selects the encoding. All send paths use ram (0x01); allreceive paths try tap first, falling back to cue for backward compatibility.
u3_meatgains aver_yfield to carry the version through the receive path.Epoch version (
pkg/noun/version.h,disk.c)U3E_VER3(epoch version 3) marks epochs that use ram-encoded event log entries._disk_epoc_load()handles VER2→VER3 migration by rolling over to a new epoch.u3_blob_init()is called on epoch creation and load to ensure.urb/bob/exists.
Unix Clay sync (
pkg/vere/io/unix.c)_unix_update_file()and_unix_initial_update_file()detect files larger thanU3_BLOB_THRESH(32 MiB), stream them into the blob store viau3_blob_save_fd(), and send a bob atom to Clay instead of loading the bytesinto the loom.
_unix_write_file_hard()handles bob atoms in%ergowrite-back by streamingdirectly from the blob store file to the destination in 64 KiB chunks, without
allocating a heap buffer.
_unix_write_file_soft()short-circuits the"has-it-changed?" disk read for bob atoms by comparing mugs directly.
Mesa (
pkg/vere/io/mesa.c)Reassembled packets larger than
U3_BLOB_THRESHare blobified before beinghanded to Arvo, keeping large network payloads out of the loom.
What this does not change
u3s_jam_xenoandu3s_cue_xeno*areunmodified.
semantics are preserved.
Known limitations
u3r_mug_bytestakes ac3_h(uint32_t) length parameter, so the mug isunreliable for files larger than 4 GiB. This bounds the reliable operating
range.
u3i_bytessimilarly takesc3_w(uint32_t), capping loom atom size at 4 GiB.is accessed many times during one Nock event, the file is re-read each time.
A per-event materialization cache is a possible follow-on improvement.
|commit(Clay stores the file atom in bothlat.ranandmim.dom) is an architectural property of Clay, not addressableat this layer. See
2x.mdfor analysis and options.