Skip to content

block 3: resource safety & correctness (SQLite hardening, HTTP/LLM timeouts, backpressure, ctx audit, graceful shutdown)#73

Merged
aksOps merged 8 commits intomainfrom
feat/block3-resource-safety
Apr 24, 2026
Merged

block 3: resource safety & correctness (SQLite hardening, HTTP/LLM timeouts, backpressure, ctx audit, graceful shutdown)#73
aksOps merged 8 commits intomainfrom
feat/block3-resource-safety

Conversation

@aksOps
Copy link
Copy Markdown
Contributor

@aksOps aksOps commented Apr 24, 2026

Summary

Block 3 ships seven landable commits that make docsiq resilient to misbehaving upstreams, slow disks, abrupt shutdowns, and runaway requests. Every outbound call now has a timeout; every HTTP provider shares a pooled client; SQLite runs with explicit PRAGMAs and a raised connection pool; EmbedBatch chunks against each provider's declared ceiling with backpressure; the HTTP surface is ringed by http.TimeoutHandler; and the graceful-shutdown + panic-log paths carry the context an operator needs.

Commits

  1. feat(store) — SQLite hardening: explicit PRAGMAs (journal_mode=WAL, busy_timeout=5000, synchronous=NORMAL, foreign_keys=ON); raised pool (MaxOpenConns=4, MaxIdleConns=2, ConnMaxLifetime=1h); (*Store).Ping(ctx) for /readyz. [3.6]
  2. feat(llm) — One pooled *http.Client per provider via newHTTPClient() (MaxIdle 100, per-host 10, IdleTimeout 90s, TLSHandshake 10s, ResponseHeader 60s). [3.5]
  3. feat(llm) — Per-call timeout wrapping (default 60s via llm.call_timeout). [3.3]
  4. feat(llm,embedder)Provider.BatchCeiling() (OpenAI 2048, Azure 16, Ollama 128); lcProvider.EmbedBatch chunks + backpressures via buffered channel; embedder.New clamps batchSize to ceiling. [3.4]
  5. feat(ci,crawler)scripts/ctx-audit.sh static check; crawler parseSitemap/discoverSitemap/extractLinks now thread ctx through http.NewRequestWithContext. [3.1]
  6. feat(api)requestTimeoutMiddleware(cfg) wraps the router with 30s default / 10m upload override (POST /api/upload, POST /api/projects/{p}/import); sits inside securityHeadersMiddleware and outside loggingMiddleware. [3.2]
  7. feat(api,serve) — Panic log enriched with req_id, route, method, user, full debug.Stack(); srv.Shutdown deadline raised from 10s to 30s; per-phase log lines (HTTP server shutting down, workq draining). [3.7]

Test plan

  • CGO_ENABLED=1 go test -tags sqlite_fts5 -timeout 300s ./... — all pass (566 tests, 22 packages)
  • CGO_ENABLED=1 go test -tags sqlite_fts5 -race -timeout 300s ./... — all pass, 0 data races (588 tests post-Block 3)
  • CGO_ENABLED=1 go test -tags "sqlite_fts5 integration" -race -timeout 1200s ./... — all pass (after adding database/sql.(*DB).connectionCleaner to the goleak ignore list — spawned by ConnMaxLifetime=1h, exits on DB.Close())
  • go vet -tags sqlite_fts5 ./... — clean
  • bash scripts/ctx-audit.shCTX AUDIT OK

Migration / ops notes

  • New Viper keys with env bindings: llm.call_timeout (DOCSIQ_LLM_CALL_TIMEOUT), server.request_timeout (DOCSIQ_SERVER_REQUEST_TIMEOUT), server.upload_timeout (DOCSIQ_SERVER_UPLOAD_TIMEOUT).
  • Provider interface gained BatchCeiling() int — all three test fakes (nopProvider, mockProvider, FakeProvider) updated.
  • MaxOpenConns raised from 1 to 4: SQLite WAL serializes writers internally; this lets readers overlap. Race detector clean.

Follow-ups (out of scope)

  • CI wiring of scripts/ctx-audit.sh — deferred, does not touch .github/workflows/ (owned by parallel block).
  • internal/loader.LoadURL(url) does not yet accept ctx — outside the Block 3.1 audited packages; filed as follow-up.
  • ctxUserKey is defined but no middleware populates it today; wired in when session auth lands.

Generated with Claude Code

aksOps and others added 8 commits April 24, 2026 02:46
Replaces DSN-embedded PRAGMAs with explicit db.Exec("PRAGMA ...") calls
so the recipe is driver-portable and self-documenting. Adds
PRAGMA synchronous=NORMAL — the WAL-safe default that cuts two fsyncs
per commit to one — which was silently missing before. Raises
MaxOpenConns from 1 to 4 (with MaxIdleConns=2, ConnMaxLifetime=1h) so
concurrent readers overlap under WAL; the writer remains serialized
by SQLite itself.

Adds (*Store).Ping(ctx context.Context) error on the public Store
surface so Block 4's /readyz can distinguish a cancelled caller from
a dead database. Covered by TestStore_PingContext.

Block 3.6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
langchaingo allocates a fresh net/http.Transport inside every provider
constructor by default — a new idle-conn pool, TLS session cache, and
DNS bucket per openai.New / ollama.New call. For a long-running server
that talks to the same upstream on every request, that's pure waste
and defeats keep-alive.

Adds internal/llm/httpclient.go with newHTTPClient() returning a
*http.Client backed by a tuned *http.Transport (MaxIdleConns=100,
MaxIdleConnsPerHost=10, IdleConnTimeout=90s, TLSHandshakeTimeout=10s,
ResponseHeaderTimeout=60s). Injects the same client into chat and
embed langchaingo sub-clients via WithHTTPClient options so every
provider uses exactly one pool.

The pooled client's Timeout is deliberately 0 — per-call deadlines
land in Task 3 (3.3) via ctx. Hard-cutting the client would truncate
streaming embedding responses.

Stores the client and provider-specific batchCeiling on lcProvider
so Task 3 can layer ctx timeouts and Task 4 can enforce per-provider
embedding batch limits.

Block 3.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every Complete / Embed / EmbedBatch call now wraps its parent ctx in
context.WithTimeout(parent, cfg.LLM.CallTimeout). Default is 60s;
callers can disable by setting llm.call_timeout=0, in which case the
parent ctx's deadline is authoritative. The vendored langchaingo has
no internal retry loop, so the spec's "retry counts against total
timeout" clause is trivially satisfied by wrapping once at the
provider entry point.

Adds config.LLMConfig.CallTimeout + DOCSIQ_LLM_CALL_TIMEOUT env
binding + a 60s default. Tests use stub Model/Embedder implementations
that block on ctx.Done, proving DeadlineExceeded surfaces cleanly
within 500ms when callTimeout=50ms.

Block 3.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each provider now declares a BatchCeiling via the Provider interface:
OpenAI 2048, Azure 16 (deployment default), Ollama 128 (heuristic).
lcProvider.EmbedBatch slices input to the ceiling and pushes per-chunk
results through a buffered channel of size 2 (equivalent to 2*ceiling
vector slots) — a slow consumer backpressures the producer once the
buffer fills. Chunks are dispatched serially in the producer goroutine;
multi-chunk concurrency remains the Embedder's responsibility.

The Embedder's New() now clamps caller-supplied batchSize to the
provider's ceiling, so "batch_size: 5000" in a config against an
Azure deployment silently becomes 16 rather than failing with a
provider-side 400.

Also updates the three test fakes (nopProvider, mockProvider,
FakeProvider) for the new interface method.

Block 3.4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/ctx-audit.sh — a static check that flags any HTTP or
DB call in the five ctx-required packages (internal/{llm,embedder,
extractor,crawler,store}) that bypasses ctx. HTTP: http.Get,
http.Post, client.Get, client.Post, http.DefaultClient,
http.NewRequest (non-ctx). DB: .Query, .Exec, .QueryRow (non-Context
variants), with an intentional carve-out for the open()/migrate()
PRAGMA+schema paths that run before a Store is constructed.

The initial audit surfaced three crawler helpers (parseSitemap,
discoverSitemap, extractLinks) that used client.Get(url) instead of
client.Do(http.NewRequestWithContext(ctx, ...)). A ctx.Done while
those were in flight would stall for up to the HTTP client's Timeout
(30s) waiting on the response. Fixed by threading ctx through every
level of the crawl.

The script intentionally scopes to I/O-call checks — signature-based
"exported funcs must take ctx first" in pure bash is brittle against
Go's return-tuple syntax. The Go compiler already enforces ctx
propagation at every call site once the interfaces demand it; the
script's value is the I/O side-channel check vet does not cover.

CI wiring of this script is deferred to a follow-up PR so this block
does not touch .github/workflows/ (owned by a parallel block).

Block 3.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the router in http.TimeoutHandler(cfg.Server.RequestTimeout,
"request timeout") — default 30s — so a misbehaving handler cannot
hold a goroutine indefinitely. POST /api/upload and
POST /api/projects/{p}/import route through a longer-timeout branch
bounded by cfg.Server.UploadTimeout (default 10m) so large indexing
kickoffs aren't killed mid-stream.

Layering is deliberate:

  securityHeadersMiddleware   (outermost, Block 2)
    requestTimeoutMiddleware  (new, Block 3.2)
      loggingMiddleware
        recoveryMiddleware
          bearerAuthMiddleware
            projectMiddleware
              mux

A 503 timeout response carries CSP/security headers; the timeout is
still logged.

Adds server.request_timeout + server.upload_timeout Viper keys with
env bindings (DOCSIQ_SERVER_REQUEST_TIMEOUT / _UPLOAD_TIMEOUT).

Block 3.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
recoveryMiddleware now logs req_id, route, method, user (if authed),
and the full debug.Stack() on panic — everything needed to
reconstruct a request from the log alone, no stderr tail required.
Backed by panic_enrichment_test.go which asserts on the attribute
surface via a TextHandler buffer capture.

cmd/serve.go: raises srv.Shutdown deadline from 10s to 30s so it
matches the workq drain phase; splits the single shutdown start/end
pair into per-phase log lines so operators can see exactly where a
stuck shutdown is hung.

Adds ctxUserKey in request_id.go — reserved for future auth
integration; today only recoveryMiddleware reads it (empty string
means unauthenticated).

TestShutdown_NoGoroutineLeaks: adds database/sql.(*DB).connectionCleaner
to the ignore list — Block 3.6's ConnMaxLifetime=1h spawns this stdlib
goroutine which only exits on DB.Close(). Same post-Close window as
the already-tolerated connectionOpener.

Block 1 already shipped signal handling + workq.Close(drainCtx);
this task closes the two remaining 3.7 gaps.

Block 3.7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-aligns the BatchCeiling() accessor rows on lcProvider, nopProvider,
and mockProvider to match gofmt's spacing. Limited to lines added in
this branch; pre-existing trailing-whitespace and map-key alignment
issues on main are left untouched per brownfield discipline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aksOps aksOps enabled auto-merge (squash) April 24, 2026 02:47
@aksOps aksOps merged commit 3a61148 into main Apr 24, 2026
11 checks passed
@aksOps aksOps deleted the feat/block3-resource-safety branch April 24, 2026 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant