You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(robustness): P1+P2 follow-ups from brainstorm (#61)
* fix(robustness): P1+P2 follow-ups from brainstorm
Closes the four actionable follow-ups deferred from the codex round-2
brainstorm. The fifth ("DLQ replay idempotency for spans/logs") is still
deferred — it requires a schema migration to add unique indexes plus a
pre-migration dedup pass, not a code-only fix.
P1 — tenant pipeline starvation fairness (`internal/ingest/pipeline.go`,
`config.go`, `main.go`).
- Added per-tenant in-flight cap to Pipeline. When set, healthy
submissions from a tenant already at the cap are dropped at Submit()
with reason "tenant_backpressure" (separate counter from
DroppedHealthy / RejectedFull). Priority batches (errors / slow
traces) bypass — diagnostic data must always land.
- Slot is reserved at Submit and released in process()'s deferred
cleanup, so panics still release. Failed channel sends in Submit
undo the reservation before returning ErrQueueFull.
- Wired via INGEST_PIPELINE_PER_TENANT_CAP (default 0 = disabled, no
behavior change for single-tenant deployments). Multi-tenant
deployments should set this to ~Capacity/N×2 where N is the number
of concurrently-active tenants.
P1 — pipeline drain-path metric blind spot: skipped. The brainstorm's
framing was that Stop()'s drain bypasses observeDrop accounting, but
verification shows the drain path PROCESSES (not drops) batches via
process(), and there are no shutdown-phase drops to account for.
P2 — /ready saturation probes (`internal/api/health_handlers.go`,
`server.go`, `main.go`).
- Server gains two callback-shaped probes (DLQ disk fullness, ingest
pipeline fullness). Probes return a fraction in [0,1]; /ready flips
to 503 above 0.95 so orchestrators stop sending traffic before the
pipeline starts hard-rejecting (gRPC RESOURCE_EXHAUSTED / HTTP 429)
or DLQ starts FIFO-evicting.
- Wired in main.go using closures that capture dlq + ingestPipeline.
Probes are nil-tolerant on the server side; nil = "skipped" rather
than fatal.
P2 — retention adaptive pacing (`internal/storage/retention.go`).
- New adaptPurgeSleep on RetentionScheduler. After each purge pass,
measures wall-clock time vs purgeInterval. >50% → double the
inter-batch sleep (capped at 100ms). <10% → halve it (floored at
1ms). Single-writer (the retention loop), so no synchronization
needed; purge methods read the value once at the call boundary.
P2 — GraphSnapshot row cap (`internal/graphrag/snapshot.go`).
- pruneOldSnapshots now enforces a 100k row backstop after the by-age
delete. Steady-state at 15-min cadence × 100 tenants is ~67k rows;
100k is enough headroom that the backstop never triggers under
normal operation but bounds disk if write rate spikes.
Tests: three new pipeline tests lock in the per-tenant cap contract
(drops past cap, priority bypasses, slot released after process). Full
suite: 407 tests pass under `-race -count=1`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robustness): clamp pipeline capacity/workers (CodeQL)
CodeQL flagged make(chan *Batch, cfg.Capacity) at pipeline.go:171
as go/uncontrolled-allocation-size — cfg.Capacity flows from the
INGEST_PIPELINE_QUEUE_SIZE env var without an upper bound, so a
typo like 10_000_000_000 would OOM the process at startup.
Add defensive caps in NewPipeline (maxPipelineCapacity = 1M,
maxPipelineWorkers = 256). Clamping at the constructor satisfies
the CodeQL taint-tracking and is the right defense regardless —
both ceilings are well above any reasonable deployment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robustness): use min() sanitizer for CodeQL allocation-size
In-place reassignment of cfg.Capacity/Workers wasn't recognized by
CodeQL as a sanitizer for go/uncontrolled-allocation-size. Reroute
through local variables clamped via min(); make() consumes the
local, which CodeQL's taint-tracking accepts as bounded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robustness): replace min() with explicit comparison guard for CodeQL
CodeQL's BoundedFlowSource taint analysis recognizes explicit if/else
comparison guards as sanitizers but not the min() builtin (Go 1.21+).
Switching to `if x > MAX { x = MAX }` is the canonical pattern from
go/uncontrolled-allocation-size docs and should clear the alert without
changing runtime behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments