Skip to content

perf: serve cached S3 snapshot on cold start without waiting for mirror#226

Merged
joshfriend merged 4 commits intomainfrom
jfriend/cold-start-snapshot-serve
Mar 26, 2026
Merged

perf: serve cached S3 snapshot on cold start without waiting for mirror#226
joshfriend merged 4 commits intomainfrom
jfriend/cold-start-snapshot-serve

Conversation

@joshfriend
Copy link
Copy Markdown
Contributor

@joshfriend joshfriend commented Mar 24, 2026

When a snapshot request arrives and the local mirror isn't ready, cachew blocks the client behind ensureCloneReady — downloading the mirror snapshot from S3, extracting it to a local bare mirror, running git fetch to freshen, then finally serving a snapshot. For large repos (e.g. an 8 GB mirror), this takes minutes while the client receives zero bytes.

This PR makes three changes:

Serve S3 snapshots on cold start without blocking on mirror warmup. If a cached snapshot exists in S3, it streams directly to the client while the mirror is still cold. Concurrent cold requests are deduped so only one S3 connection is opened; followers wait for the local disk backfill and then serve from NVMe.

Deferred background mirror restore. After the first cold snapshot is served, a background job restores the mirror from its S3 snapshot and freshens it with a git fetch. This closes the gap where repos served only via cached snapshots would never warm their mirror, preventing bundle delta generation. The restore waits for all in-flight cold snapshot serves to finish before starting to avoid disk I/O contention. On failure it resets to StateEmpty rather than falling back to a heavyweight git clone.

Fix stuck StateCloning on clone failure. Previously, if both snapshot restore and git clone failed in startClone, the repo was left in StateCloning permanently. Now it resets to StateEmpty so subsequent requests can retry.

Tested on playpen with android-register (8.3 GB). Cold start drops from 8+ min to ~28s. Mirror becomes hot ~2 min after first serve, enabling bundle deltas for subsequent workstations.

@joshfriend joshfriend requested a review from a team as a code owner March 24, 2026 20:48
@joshfriend joshfriend requested review from js-murph and removed request for a team March 24, 2026 20:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f927115388

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +214 to +215
w.Header().Set("Content-Type", "application/zstd")
if _, err := io.Copy(w, reader); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep bundle hint on cold-start snapshot cache hits

Streaming the cached snapshot directly here bypasses serveSnapshotWithBundle, so the response no longer sets X-Cachew-Bundle-Url when the mirror is still warming. In that cold-start path, cachew git restore (see cmd/cachew/git.go, which only applies a delta when that header is present) will stop at the cached snapshot commit instead of catching up to HEAD, which can return stale source trees for builds after new upstream commits land.

Useful? React with 👍 / 👎.

@joshfriend joshfriend marked this pull request as draft March 24, 2026 21:00
@joshfriend joshfriend force-pushed the jfriend/cold-start-snapshot-serve branch 3 times, most recently from 46fe56c to 7d8f0e0 Compare March 25, 2026 17:27
…ror warmup

Avoid blocking cold-start snapshot requests on mirror restore, and decouple tiered-cache backfill from the response path so local disk caching does not throttle the initial client download.
@joshfriend joshfriend force-pushed the jfriend/cold-start-snapshot-serve branch from 7d8f0e0 to 7cc9ada Compare March 25, 2026 17:28
Make concurrent cold snapshot requests for the same repo wait for a single in-flight fill, then serve from the local cache instead of opening competing S3 downloads.
After serving a cached S3 snapshot on cold start, schedule a background
mirror restore so the mirror becomes hot and cachew can generate fresh
bundle deltas. Without this, repos served only via cached snapshots
would never warm their mirror.

The restore waits for all in-flight cold snapshot serves to complete
before starting, avoiding disk I/O contention. On failure it resets
to StateEmpty rather than falling back to a heavyweight git clone.

Also fixes a bug where startClone left the repo stuck in StateCloning
forever if both snapshot restore and git clone failed.
}
return
}
} else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few iterations of using the spool system but they were all quite slow. This version just lets the first requestor pull the snapshot from the backend and any requests that come in while that happens block until that one finishes and buffers it to disk. After that the other requests resume from the cached disk copy. Without this I could pretty easily bottleneck simultaneous clients by trying to download the snapshot from the s3 backend at the same time.

@joshfriend joshfriend marked this pull request as ready for review March 26, 2026 18:55
@joshfriend joshfriend merged commit 483697d into main Mar 26, 2026
7 checks passed
@joshfriend joshfriend deleted the jfriend/cold-start-snapshot-serve branch March 26, 2026 18:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c963e88313

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +612 to +613
if _, loaded := s.deferredRestoreOnce.LoadOrStore(upstream, true); loaded {
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry deferred mirror restore after failed attempt

scheduleDeferredMirrorRestore permanently sets deferredRestoreOnce on the first cold snapshot serve, but the key is never cleared when the deferred restore fails (tryRestoreSnapshot or FetchLenient paths reset the repo to StateEmpty and return). For repos that are mostly accessed via /snapshot.tar.zst, a single transient failure means every later cold request will keep serving cached snapshots while skipping restore forever, so the mirror never becomes ready again on that process.

Useful? React with 👍 / 👎.

Comment on lines +239 to 241
if srcErr != nil || bgErr != nil {
b.cancel()
_ = b.dst.Close()
return errors.WithStack(srcErr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return background close errors from backfill reader

When the async backfill goroutine reports an error (bgErr != nil) but src.Close() succeeds, Close() still returns errors.WithStack(srcErr), which is nil. This hides destination write/close failures that were previously surfaced, making backfill failures silent and preventing callers from observing that the cache write failed.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants