perf: serve cached S3 snapshot on cold start without waiting for mirror#226
perf: serve cached S3 snapshot on cold start without waiting for mirror#226joshfriend merged 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f927115388
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
internal/strategy/git/snapshot.go
Outdated
| w.Header().Set("Content-Type", "application/zstd") | ||
| if _, err := io.Copy(w, reader); err != nil { |
There was a problem hiding this comment.
Keep bundle hint on cold-start snapshot cache hits
Streaming the cached snapshot directly here bypasses serveSnapshotWithBundle, so the response no longer sets X-Cachew-Bundle-Url when the mirror is still warming. In that cold-start path, cachew git restore (see cmd/cachew/git.go, which only applies a delta when that header is present) will stop at the cached snapshot commit instead of catching up to HEAD, which can return stale source trees for builds after new upstream commits land.
Useful? React with 👍 / 👎.
46fe56c to
7d8f0e0
Compare
…ror warmup Avoid blocking cold-start snapshot requests on mirror restore, and decouple tiered-cache backfill from the response path so local disk caching does not throttle the initial client download.
7d8f0e0 to
7cc9ada
Compare
Make concurrent cold snapshot requests for the same repo wait for a single in-flight fill, then serve from the local cache instead of opening competing S3 downloads.
After serving a cached S3 snapshot on cold start, schedule a background mirror restore so the mirror becomes hot and cachew can generate fresh bundle deltas. Without this, repos served only via cached snapshots would never warm their mirror. The restore waits for all in-flight cold snapshot serves to complete before starting, avoiding disk I/O contention. On failure it resets to StateEmpty rather than falling back to a heavyweight git clone. Also fixes a bug where startClone left the repo stuck in StateCloning forever if both snapshot restore and git clone failed.
| } | ||
| return | ||
| } | ||
| } else { |
There was a problem hiding this comment.
I tried a few iterations of using the spool system but they were all quite slow. This version just lets the first requestor pull the snapshot from the backend and any requests that come in while that happens block until that one finishes and buffers it to disk. After that the other requests resume from the cached disk copy. Without this I could pretty easily bottleneck simultaneous clients by trying to download the snapshot from the s3 backend at the same time.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c963e88313
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _, loaded := s.deferredRestoreOnce.LoadOrStore(upstream, true); loaded { | ||
| return |
There was a problem hiding this comment.
Retry deferred mirror restore after failed attempt
scheduleDeferredMirrorRestore permanently sets deferredRestoreOnce on the first cold snapshot serve, but the key is never cleared when the deferred restore fails (tryRestoreSnapshot or FetchLenient paths reset the repo to StateEmpty and return). For repos that are mostly accessed via /snapshot.tar.zst, a single transient failure means every later cold request will keep serving cached snapshots while skipping restore forever, so the mirror never becomes ready again on that process.
Useful? React with 👍 / 👎.
| if srcErr != nil || bgErr != nil { | ||
| b.cancel() | ||
| _ = b.dst.Close() | ||
| return errors.WithStack(srcErr) |
There was a problem hiding this comment.
Return background close errors from backfill reader
When the async backfill goroutine reports an error (bgErr != nil) but src.Close() succeeds, Close() still returns errors.WithStack(srcErr), which is nil. This hides destination write/close failures that were previously surfaced, making backfill failures silent and preventing callers from observing that the cache write failed.
Useful? React with 👍 / 👎.
When a snapshot request arrives and the local mirror isn't ready, cachew blocks the client behind ensureCloneReady — downloading the mirror snapshot from S3, extracting it to a local bare mirror, running git fetch to freshen, then finally serving a snapshot. For large repos (e.g. an 8 GB mirror), this takes minutes while the client receives zero bytes.
This PR makes three changes:
Serve S3 snapshots on cold start without blocking on mirror warmup. If a cached snapshot exists in S3, it streams directly to the client while the mirror is still cold. Concurrent cold requests are deduped so only one S3 connection is opened; followers wait for the local disk backfill and then serve from NVMe.
Deferred background mirror restore. After the first cold snapshot is served, a background job restores the mirror from its S3 snapshot and freshens it with a git fetch. This closes the gap where repos served only via cached snapshots would never warm their mirror, preventing bundle delta generation. The restore waits for all in-flight cold snapshot serves to finish before starting to avoid disk I/O contention. On failure it resets to StateEmpty rather than falling back to a heavyweight git clone.
Fix stuck StateCloning on clone failure. Previously, if both snapshot restore and git clone failed in startClone, the repo was left in StateCloning permanently. Now it resets to StateEmpty so subsequent requests can retry.
Tested on playpen with android-register (8.3 GB). Cold start drops from 8+ min to ~28s. Mirror becomes hot ~2 min after first serve, enabling bundle deltas for subsequent workstations.