A verified compression library in Lean 4, covering zlib/zstd via C FFI, pure-Lean native implementations, and formal proofs of correctness. Starting from an empty repository, this plan builds the entire library from project scaffolding through verified, optimized compression.
Very provable. These are pure mathematical functions. Tractable theorems:
- Equivalence of the table-driven implementation to the bit-by-bit polynomial definition
- Compositionality of incremental computation:
crc32(crc32(init, a), b) = crc32(init, a ++ b) - Equivalence to a polynomial specification (CRC32 as division in GF(2)[x])
The key structural insight: decompression is deterministic (given a valid DEFLATE stream, there is exactly one output), but compression is not (different compression levels produce different valid bitstreams that all decompress to the same output).
The natural theorem is:
∀ (data : ByteArray) (level : Fin 10), inflate (deflate data level) = data
Proving this requires:
- Formally specify what constitutes a valid DEFLATE bitstream
- Prove
inflatecorrectly decodes any valid bitstream - Prove
deflateproduces a valid bitstream - Prove the composition
Step 2 is the hardest implementation work but the most tractable proof. Step 3 is where the real pain lies -- the compression side has complex heuristics (lazy matching, level-dependent strategies) that are correct but messy to reason about.
Significantly harder than DEFLATE. The Zstd format involves:
- Finite State Entropy (FSE) coding (more complex than Huffman)
- Multiple interleaved bitstreams
- A dictionary system
- Repeat offset codes
- Frame/block structure with multiple block types
Estimated at 3-5x the work of DEFLATE.
Start with naive, easy-to-reason-about Lean implementations and prove correctness properties about these first. These initial versions prioritize clarity over performance: simple loops instead of table lookups for CRC32, straightforward linear scans instead of hash chains for LZ77 matching, direct bit-by-bit Huffman decoding instead of table-based multi-bit lookups.
Once the core proofs are established against these simple implementations,
we add progressively more efficient versions and prove they produce
identical output to the unoptimized versions. The roundtrip proofs then
transfer automatically: if inflate_fast is proven equivalent to
inflate_simple, and we have proven
inflate_simple (deflate data level) = data, then
inflate_fast (deflate data level) = data follows immediately.
Rather than implementing all nine DEFLATE compression levels at once, we work through them progressively:
- Level 0 (stored blocks, no compression): trivial to implement and prove correct
- Level 1 (greedy LZ77, fixed Huffman codes): the simplest real compression, and the first interesting roundtrip proof
- Levels 2-4 (lazy matching, increasing effort): each level extends the match-finding heuristic while producing equally valid DEFLATE bitstreams
- Levels 5-9 (dynamic Huffman codes, maximum effort): the most complex compression strategies
Each level produces a valid DEFLATE bitstream, so the decompressor proof covers all of them. The per-level work is proving that each compression strategy produces a valid bitstream, not re-proving decompression.
After the initial roundtrip theorem is established, optimization follows a generational approach. The key insight is that optimizing verified code is fundamentally different from optimizing unverified code: you cannot simply swap in a faster algorithm and check if tests pass — you must prove the new version agrees with the old one.
There are two approaches, chosen based on the scope of the change:
Approach 1: In-place modification. For small, local changes (e.g. replacing a linear scan with a hash lookup in one function), modify the implementation directly and update the proofs that it still satisfies its specification. This works well when the proof structure survives the change with modest adjustments.
Approach 2: Generational refinement. For larger changes (restructuring data flow, changing representations, rewriting a pipeline stage), build a new implementation alongside the existing one:
- Write the faster implementation (Gen N+1) as a new definition
- Prove
genN_plus_1 = genN(or the relevant component equivalence) - Transfer the roundtrip theorem across the equivalence
- The old implementation (Gen N) remains in the codebase — theorems reference it
This naturally produces a chain: Gen 0 (spec) → Gen 1 (first native) → Gen 2 (optimized) → ... Each generation exists in the codebase because theorems reference it. The chain is the proof trail.
Identifying what to optimize. Benchmarking means profiling, not just timing. Use Lean's profiling tools to identify where time is actually spent — allocation patterns, ByteArray copying, List-to-Array conversions, unnecessary intermediate structures. These are all algorithmic choices that can be improved in the next generation. A List-to-Array conversion that dominates runtime is as much a target for generational refinement as a change from linear to binary search.
Chain compression. Over time, the chain of generations may become long. Periodically, it is worth "compressing the chain" — proving that Gen N+2 directly equals Gen N, eliminating Gen N+1 as an intermediate. This simplifies the codebase and the proof dependencies. But don't do this reflexively: if an intermediate generation captures a genuinely different algorithmic idea, keep it — it serves as documentation of the design evolution. Compress when the intermediate's only purpose is as a proof stepping stone and the direct proof is clearly shorter. Sometimes it may also be possible to permute changes in the chain, bringing related improvements together and enabling further compressions.
When to stop. Optimization is open-ended. A reasonable stopping point: when the native Lean implementation is competitive with or faster than the FFI-wrapped C libraries on representative workloads!
Each native implementation follows a formalization-first cycle:
- Type signature of the new function, implemented as
:= sorry - Specification theorems stating the desired properties, with
:= by sorryproofs - Implementation of the function body
- Conformance tests verifying agreement with the FFI/library version
- Proofs of the specification theorems
This same pattern applies to optimized versions: the specification theorems are equivalence with the unoptimized version, and the roundtrip/correctness proofs transfer automatically via the equivalence.
Writing specs before implementations ensures the work stays grounded in what we can actually prove, and prevents implementations from getting ahead of the formalization.
The work is organized into five tracks: a bootstrapping A-track, a linear B-track (DEFLATE verification pipeline), and three tracks C, D, and E that can be worked on concurrently once their dependencies are met. Two cross-cutting concerns run throughout.
Dependency graph:
A1 → A2 → A3
↓
B1 → B2 → B3 → B4
│ │
↓ ↓
D C
E (independent, needs A2)
- A1–A3 are sequential: project setup, FFI wrappers, archive formats.
- B1–B4 are sequential: native implementations and verified roundtrip. B1 depends on A3 (needs FFI wrappers and test infrastructure).
- C (size limit elimination) depends on B3: fuel-based proofs must exist before they can be replaced with well-founded recursion.
- D (benchmarking and optimization) depends on B2: native implementations must exist before they can be profiled.
- E (Zstd) is independent of the B-track: it follows the same methodology for a different compression format. It needs A2 (FFI wrappers for Zstd) but can otherwise start at any time.
- Cross-cutting concerns (characterizing properties, ZipForStd) apply from B1 onwards and are not gated on any particular phase.
Planning agents should spread work across tracks wherever dependencies allow, rather than finishing one track before starting another.
Lake project scaffolding, toolchain configuration, and build system.
Deliverables:
lakefile.leanwith Lake build configurationlean-toolchainpinning the Lean versionshell.nixproviding zlib, zstd, and pkg-config for NixOS- C FFI scaffolding:
pkg-configdetection for zlib/zstd link flags, static linking support, platform detection ARCHITECTURE.mddocumenting the project structureREADME.mdwith build and usage instructions
C FFI bindings to system compression libraries. These provide the fast path and serve as the reference implementation for conformance testing.
Deliverables:
- Zlib FFI: gzip compress/decompress (whole-buffer and streaming), raw DEFLATE compress/decompress, zlib-format compress/decompress
- Checksum FFI: CRC32 and Adler32 via zlib
- Zstd FFI: compress/decompress (whole-buffer and streaming)
- File helpers:
compressFile,decompressFilefor each format - Streaming API: incremental compression/decompression with bounded memory, stream piping between sources and sinks
- C implementation in
c/with overflow guards and allocation failure checks - Basic tests for each FFI function
Pure Lean archive format implementations and conformance test harness.
Deliverables:
- Binary packing: little-endian/big-endian read/write for multi-byte integers
- Tar archives: GNU tar format with PAX extended headers, streaming
decompression of
.tar.gzvia FFI - ZIP archives: local file headers, central directory, EOCD record, ZIP64 support, DEFLATE and stored methods, UTF-8 filenames
- Security hardening: path traversal prevention, checksum validation on parse, truncation detection
- Conformance test infrastructure: random input generation, cross-implementation testing (native compress + FFI decompress and vice versa once native implementations exist), edge cases (empty, single byte, large, incompressible)
Pure Lean CRC32 and Adler32 with proofs of equivalence to a formal specification. Conformance tests against the FFI implementations.
Deliverables:
- Native Lean CRC32 (table-driven)
- Native Lean Adler32
- Proof: table-driven CRC32 matches bit-by-bit definition
- Proof: incremental computation composes correctly
- Conformance test harness comparing native vs FFI on random inputs
Pure Lean DEFLATE decompressor (inflate only). No proofs yet.
Deliverables:
- Huffman decoder (fixed and dynamic trees)
- LZ77 back-reference resolution with sliding window
- Gzip/Zlib framing (header/trailer parsing, checksum verification)
- Conformance tests against zlib on a corpus of compressed data
- Integration as an alternative backend for existing ZIP/tar code paths
The overarching goal:
∀ (data : ByteArray) (level : Fin 10), inflate (deflate data level) = .ok data
(Both inflate and deflate return Except String ByteArray; the
theorem says compression always succeeds and decompression of
well-formed output never fails.)
The roundtrip decomposes into five building blocks. The decompressor (B3) contributes the "right half" of each invertibility theorem; the compressor (B4) contributes the "left half." Each building block is independently meaningful.
For levels 1–9 (compressed blocks), the data flows through three invertible layers:
data → matchLZ77 → symbols → huffEncode → bits → pack → compressed
│
compressed → unpack → bits → huffDecode → symbols → resolveLZ77 → data
Level 0 (stored blocks) bypasses LZ77 and Huffman entirely — the roundtrip reduces to stored-block framing (BB4) alone.
For levels 1–9, the full proof chains three invertibility steps:
inflate (deflate data level)
= resolveLZ77 (huffDecode (unpack (pack (huffEncode (matchLZ77 data)))))
= resolveLZ77 (huffDecode (huffEncode (matchLZ77 data))) -- bitstream
= resolveLZ77 (matchLZ77 data) -- Huffman
= data -- LZ77
(Simplified — actual proof routes through block framing via BB4.)
resolveLZ77 (matchLZ77 data level) [] = some data.toListThe LZ77 matcher produces a symbol stream; resolving it reconstructs the original data. This holds for all compression levels because every valid match (greedy, lazy, or literal) describes the same data.
Level 0 (stored blocks, no LZ77): Emits stored blocks with literal bytes. Roundtrip is literal copy-through. Trivial.
Level 1 (greedy matching, fixed Huffman): At position p, the
matcher either emits Literal data[p] or finds the longest match at
distance d and emits BackRef len d. Proof by induction on p with
invariant resolveLZ77 remaining acc = some data.toList where
acc = data[0..p]:
- Literal: extends
accby one byte, direct. - BackRef: matcher verified
data[p..p+len] = data[p-d..p-d+len]. BycopyLoop_eq_ofFn, the copy producesdata[p..p+len], extendingacctodata[0..p+len].
Levels 2–4 (lazy matching): At position p, finds match M1, then
checks p+1 for a longer M2. If M2 is longer, emits
Literal data[p]; BackRef M2; otherwise BackRef M1. Same invariant
— proof branches on matcher's decision but data[p..] is produced
either way.
Levels 5–9 (dynamic Huffman, more search): LZ77 validity is identical to levels 2–4; only which matches are found differs, not their correctness. Dynamic Huffman affects the encoding layer (BB2), not the LZ77 layer.
-- Given valid canonical code lengths producing prefix-free tables:
decodeSymbols litLens distLens (encodeSymbols litLens distLens syms ++ rest)
= some (syms, rest)Requires that litLens and distLens produce valid prefix-free codes
(established by fromLengths success + ValidLengths), and that the
symbol list is well-formed (literals < 256, lengths in 3–258, distances
in 1–32768, terminated by end-of-block).
The single-symbol case is the core lemma:
decode_prefix_free says that for (cw, sym) in a prefix-free table,
decode table (cw ++ rest) = some (sym, rest). Extension to symbol
sequences is induction on the symbol list.
For length/distance sub-encoding (codes 257–285 + extra bits), the encode side reverses the table lookup. Invertibility follows because the tables are monotone and extra bits fill the gaps exactly.
bytesToBits (bitsToBytes bits) = bitsHolds when bits.length is a multiple of 8 (byte-aligned). In
practice, DEFLATE block encoding pads the final byte, so the theorem
applies to the padded bit sequence; the padding itself is handled at
the block framing level (BB4).
- Stored: LEN/NLEN header + literal copy. Trivially invertible.
- Fixed Huffman: Shared constant tables. Reduces to BB2.
- Dynamic Huffman: Tree description (code lengths via RLE + code-length Huffman) must round-trip. Mechanically detailed and the hardest framing sub-proof.
- Block sequence: BFINAL (1 bit) + BTYPE (2 bits) are fixed-width. Symbol 256 terminates compressed blocks. BFINAL=1 marks last block.
- Cross-block back-references: LZ77 references can span block
boundaries (RFC 1951 §2). The decompressor passes accumulated output
across blocks. Invariant: after block k, output =
data[0..pₖ].
Chain BB1–BB4 to obtain the full roundtrip theorem.
| Building block | Decompressor half (B3) | Compressor half (B4) |
|---|---|---|
| BB1: LZ77 | resolveLZ77 correctness, copyLoop_eq_ofFn |
matchLZ77 + fundamental theorem |
| BB2: Huffman | decode_prefix_free, tree correspondence |
huffEncode + sequence roundtrip |
| BB3: Bitstream | readBit_toBits, readBits_toBits |
bitsToBytes + pack roundtrip |
| BB4: Framing | decodeStored_correct, decodeHuffman_correct, inflateLoop_correct, decodeDynamicTrees_correct |
Block framing encoder, dynamic tree encoding |
| BB5: Composition | — | Full roundtrip theorem |
The decompressor contributes the "right half" of each building block. The verification has two layers, reflecting the structure of DEFLATE:
Characterizing properties — mathematical content independent of any particular implementation:
- Huffman codes are prefix-free and canonical
- Kraft inequality bounds symbol counts
- Prefix-free codes decode deterministically (
decode_prefix_free) - Bitstream operations correspond to LSB-first bit lists
- LZ77 overlap semantics:
copyLoop_eq_ofFn - RFC 1951 tables verified by
decide - Fuel independence of spec decode
fromLengthssuccess impliesValidLengths
Algorithmic correspondence — the native implementation agrees with
a reference decoder that transcribes RFC 1951 §3.2.3 pseudocode into
proof-friendly style (List Bool, Option monad, fuel):
- Stored blocks
- Huffman blocks (literal, end-of-block, length/distance cases)
- Block loop with position invariants
- Dynamic Huffman tree decode
The reference decoder (Deflate.Spec.decode) is not a mathematical
specification — it is the RFC algorithm in proof-friendly dress. At
the block-dispatch level, the RFC IS pseudocode; correspondence with
a faithful transcription is the appropriate verification tool. The
mathematical content lives in the building blocks (Huffman, bitstream,
LZ77), not in the top-level algorithm.
The compressor contributes the "left half" of each building block plus the composition into the full roundtrip theorem.
Deliverables (progressive, each level adds to the previous):
- Level 0: stored blocks, literal copy. Roundtrip trivial.
- Level 1: greedy LZ77 matching, fixed Huffman codes. Establishes BB1 for greedy matching; combines with BB2–BB4 for the first non-trivial end-to-end roundtrip.
- Levels 2–4: lazy matching. Same BB1 invariant with branching.
- Levels 5–9: dynamic Huffman codes. Adds dynamic tree encoding roundtrip (BB4) on top of BB1.
- Conformance tests: zlib can decompress our output and vice versa.
- Full roundtrip theorem:
inflate (deflate data level) = .ok data.
Depends on: B3
The initial roundtrip theorem uses fuel parameters (explicit recursion
bounds) to ensure termination. This imposes a data size limit on the
theorem — e.g. data.size < N for some concrete N. The limit is an
artifact of the proof technique, not a fundamental constraint: DEFLATE
decompression always terminates (each iteration consumes at least one
bit), and compression always terminates (each step advances at least one
byte in the input).
This track has two stages:
C1: Raise bounds to huge. If the fuel is computed from data.size,
the bound may already be large. If it is a fixed constant, raise it to
an astronomically large value (exabyte scale). This should require only
modest proof adjustments and eliminates the bound as a practical concern.
C2: Eliminate fuel entirely. Replace fuel-based recursion with well-founded recursion on a decreasing measure (remaining input length for decompression, remaining unprocessed bytes for compression). This requires restructuring the recursive functions so that Lean's termination checker can verify progress. The result is a roundtrip theorem with no size restriction at all:
∀ (data : ByteArray) (level : Fin 10), inflate (deflate data level) = .ok data
C2 is a significant redesign of the proof-friendly spec functions. The native implementations will likely use structural patterns that terminate naturally — the main work is in the spec layer and the proofs.
Depends on: B2
This track is open-ended. It has two dimensions: runtime performance (how fast the code runs) and compression quality (how small the output is). Both matter, and both follow the same generational refinement methodology.
Comparison targets. The native implementations should be compared against multiple reference compressors, not just zlib:
- zlib — the ubiquitous baseline
- miniz_oxide — Rust reimplementation of miniz, widely used
- libdeflate — optimized DEFLATE library, often faster than zlib
- zopfli — Google's maximum-compression DEFLATE encoder (very slow, very small output — the upper bound on DEFLATE quality)
Different targets illuminate different things: zlib and miniz_oxide are the "match this" bar for general use; libdeflate shows what's possible with careful optimization; zopfli shows the theoretical compression quality ceiling within the DEFLATE format.
Reporting tools. Build and maintain dashboards that make it easy to see where we stand:
- Runtime comparison: native vs each reference library, across data patterns (constant, cyclic, text, incompressible), sizes (1KB to 100MB), and compression levels (0–9). Report throughput in MB/s for both compression and decompression.
- Compression quality comparison: output sizes for the same inputs across native levels and each reference library's levels. Report both absolute sizes and ratios relative to the best-known result (zopfli).
- Regression tracking: store benchmark results so that optimization work can be validated against a baseline and regressions caught.
These tools are first-class deliverables, not afterthoughts.
Methodology:
- Benchmark runtime and compression quality against reference implementations on representative workloads.
- Profile using Lean's profiling tools to identify where time is spent — not just which functions are slow, but why: allocation patterns, ByteArray copying, intermediate List/Array conversions, redundant traversals. For compression quality, analyze where the native compressor makes worse choices than the reference (shorter matches found, suboptimal block boundaries, less efficient Huffman trees).
- Identify the highest-impact improvement target across both dimensions.
- Implement the optimization using generational refinement (Approach 1 or 2 from the Optimization Strategy section).
- Prove correctness of the optimized version.
- Re-benchmark to measure the improvement.
- Repeat from step 2.
This cycle continues until the native implementations are competitive with or faster than the C libraries on both runtime and compression quality. At that point, the FFI wrappers become optional — the native path is verified, performant, and compresses well.
Independent — needs A2 for FFI wrappers, but can otherwise start at any time.
Native Lean Zstd implementation following the same B1–B4 methodology. This is a separate multi-month to multi-year project.
Deliverables:
- FSE (Finite State Entropy) decoder and encoder
- Zstd frame/block parsing
- Decompressor with proofs (following the B2–B3 pattern)
- Compressor with roundtrip proof (following the B4 pattern)
The same C and D tracks apply to Zstd once its B-track equivalent is sufficiently advanced.
These are not phases — they apply throughout all work from B1 onwards.
Beyond roundtrip correctness, prove mathematical properties that are independent of any particular implementation. These deepen the verification and catch classes of bugs that roundtrip alone cannot.
Examples, to be pursued as the relevant algorithms are implemented:
- Prefix-freeness and Kraft inequality for Huffman codes
- Compression ratio bounds: stored blocks have at most ~0.05% overhead; fixed Huffman has known worst-case expansion; dynamic Huffman is bounded by the entropy of the symbol distribution
- Determinism of decompression: any correct DEFLATE decompressor must produce the same output for the same input (follows from prefix-freeness + deterministic LZ77 resolution)
- CRC32 as polynomial division in GF(2)[x], beyond just table-implementation equivalence
These properties are not gated on optimization or size limit work. They can and should be pursued as soon as the relevant algorithms exist, starting from B1.
During proof work, lemmas about ByteArray, Array, List, Nat,
and BitVec are routinely developed that have nothing to do with
compression — they are general-purpose facts missing from Lean's
standard library.
These should be:
- Collected in
ZipForStd/as they are developed - Kept general (no compression-specific assumptions)
- In a form where collaborators could upstream them to Lean's standard library via PRs
This is part of every phase, not a separate effort. A healthy ZipForStd pipeline demonstrates that the verification methodology scales and contributes back to the ecosystem.
- Performance: An initial naive Lean DEFLATE implementation will be slower than zlib (which is hand-tuned C with SIMD specializations). The FFI implementations serve as the fast path until Track D closes the gap through iterative profiling and verified optimization.
- Specification gap: RFC 1951 is informal English. Senjak & Hofmann (2016) formalized it in Coq (see References below), but no Lean 4 formalization exists. Formalizing the spec is a prerequisite for proving anything about an implementation.
- Compression level fidelity: Different DEFLATE compressors (zlib, miniz_oxide, libdeflate, etc.) all produce different byte streams — the format only specifies decompression, not compression choices. Byte-for-byte output matching is a non-goal. Compression quality (output sizes), however, is a goal — Track D targets competitive ratios across all levels.
- Sliding window management: Mutable arrays in Lean require careful handling of linearity/borrowing for performance.
- Huffman tree construction proofs: Code length limits, canonical ordering — tedious but not deep.
- LZ77 match-finding: Inherently heuristic. Proving it produces valid output is tractable; proving it produces good output (compression ratio bounds) is a characterizing property target.
- Zstd is a moving target: The format is complex and has evolved; the RFC (3.1) is 100+ pages.
-
RFC 1951 (Deutsch, 1996), "DEFLATE Compressed Data Format Specification version 1.3". The primary specification. Available in
references/rfc1951.txt. Section 3.2.3 contains the decoding algorithm pseudocode; Section 3.2.2 defines canonical Huffman codes; Section 3.2.5 defines the length/distance tables. -
RFC 1950 (Deutsch & Gailly, 1996), "ZLIB Compressed Data Format Specification version 3.3". Zlib framing around DEFLATE. Available in
references/rfc1950.txt. -
RFC 1952 (Deutsch, 1996), "GZIP file format specification version 4.3". Gzip framing around DEFLATE. Available in
references/rfc1952.txt. -
Senjak & Hofmann (2016), "An implementation of Deflate in Coq", arXiv:1609.01220. Available in
references/senjak-hofmann-2016.pdf(LaTeX source inreferences/codings.tex). A complete Coq formalization of DEFLATE with a verified roundtrip. The most relevant prior art is their canonical prefix-free coding library (Sections 1–4): uniqueness from code lengths, constructive existence via Kraft inequality, 4-axiom characterization. Their relational specification framework (Section 5: "strong decidability" / "strong uniqueness") is designed for Coq's extraction mechanism and is not adopted here — Lean 4 is code-first, so we write the implementation and prove characterizing properties about it rather than extracting code from relational specifications.