block 4: observability & ops (version, health, metrics, access-log, log-format)#74
block 4: observability & ops (version, health, metrics, access-log, log-format)#74
Conversation
Moves the shared version-resolution logic out of cmd into a new
internal/buildinfo package so internal/api can serve the same data
without a cmd import cycle. Adds a public GET /api/version endpoint
that returns {version, commit, build_date, go_version, dirty, deps}
as JSON. Makefile LDFLAGS retargeted at the new package path.
bearerAuthMiddleware public bypass list extended with /healthz,
/readyz, and /api/version so upcoming probes and version endpoint
remain public.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
/healthz is dependency-free liveness. /readyz checks SQLite PingContext and an LLM provider Complete(maxTokens=1) reach, caching the verdict for 10 seconds to absorb Prometheus + Kubernetes probe loops. Nil provider (config provider: none) reports llm.status=skipped and keeps readiness green. Legacy /health route remains as a 200-returning alias for older clients; /healthz is the canonical probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the ad-hoc text-format collector in internal/api/metrics.go with the official prometheus/client_golang v1.20.5. Adds a new internal/obs package hosting the Default registry and per-subject collectors (HTTP, pipeline stages, embed latency, LLM tokens, workq depth + rejections, build info). Workq gains a Pool.Stats() snapshot accessor with rejectedTotal counter; cmd/serve.go initialises obs and binds the live pool stats provider. Pipeline IndexPath/IndexURL/Finalize are wrapped with TimeStage for granular stage timings via a nil-safe helper. Embedder observes per-batch provider latency; LLM Complete records an approximate token count (bytes/4) until langchaingo usage data is threaded through the Provider interface (tracked as follow-up). HTTP recording moved out of loggingMiddleware's bespoke collector into obs.HTTP.Observe; uses the Go 1.22 r.Pattern route to bound label cardinality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
loggingMiddleware now emits one JSON/text slog line per request with
{req_id, method, path, route, status, duration_ms, bytes_out, auth,
project, panic}. The emission is deferred so a panic escaping
recoveryMiddleware still produces an access-log entry. auth is a
coarse label (bearer|cookie|anon) because docsiq uses a single shared
API key; there is no real user identity. responseWriter now tracks
bytes via an overridden Write and also proxies Flush for SSE/
streaming handlers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New log.format config key (default "text"; DOCSIQ_LOG_FORMAT env). Precedence is --log-format > env > config > default. The json handler is wrapped in obs.NewProductionHandler, which strips a leading emoji from slog Record.Message so log aggregators do not have to special- case multi-byte sequences. The text handler keeps emoji for human readers. Adds config-level defaults + env binding and three load- level tests covering default, YAML, and env-var precedence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erface Block 3 (merged to main) added BatchCeiling() int to llm.Provider. After rebasing feat/block4-observability onto main, the mock provider needed the new method to compile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9fabea4 to
234bcf1
Compare
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
@codex review |
http.TimeoutHandler buffers the response body and does not implement http.Flusher. Block 4's responseWriter wrapper grew a Flush() method, so SSE handlers now pass the Flusher type-assertion and enter their streaming loop — but every Flush is a no-op because the underlying timeoutWriter absorbs writes until the request completes. The client times out reading the body at 30s (matches cfg.Server.RequestTimeout). Carve GET /api/upload/progress and GET /mcp out of the timeout wrapper so Flush propagates to the real net/http writer. SSE teardown still runs via r.Context() cancellation on client disconnect or shutdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 234bcf1e34
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… lazy sq pinger Three fixes bundled, all flagged by post-rebase CI or Codex review: 1. waitUploadDone now consumes the /api/upload/progress SSE stream incrementally via bufio.Scanner instead of ReadAll. Prior code worked on main only because Flush() was missing on responseWriter, which made the handler return 500 immediately; with Block 4's Flusher now wired through, ReadAll blocks until the handler closes the stream, tripping the client's 30s timeout before "done" arrives. 2. /readyz SQLite probe no longer installs a no-op success fallback when stores.Get fails at router build time. A lazy pinger resolves the default store at probe time, so a genuine open failure (permissions, corruption, disk) surfaces as 503 — and a store that becomes available later flips readiness green without a restart. 3. readyzCache.check decouples probe context from the incoming request context via context.WithoutCancel. Previously, a probing client (Kubernetes, Prometheus, curl) disconnecting mid-probe would return context.Canceled and pollute the 10-second cache for every subsequent caller. Coverage: new TestReadyz_ProbeCtxDecoupledFromRequestCtx exercises (3). TestIsStreamingRoute_Classification and TestRequestTimeoutMiddleware_StreamingRouteBypassesTimeout already cover the Block-4-aware SSE bypass added in 8f9595b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6947da5321
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| resp, err := llms.GenerateFromSinglePrompt(ctx, p.llm, prompt, callOpts...) | ||
| if obs.LLM != nil { | ||
| // Approximation: 1 token ~= 4 bytes of UTF-8 for English prose. | ||
| // This is a coarse fallback until langchaingo's GenerationInfo | ||
| // usage data is threaded through the Provider interface |
There was a problem hiding this comment.
Skip token accounting when completion fails
The token metric is updated unconditionally after GenerateFromSinglePrompt, so failed calls (for example context deadline, network errors, or upstream 5xx) still increment docsiq_llm_tokens_total using prompt length. That inflates usage/cost dashboards during outages and retries; only record tokens when the completion succeeds (or when real usage metadata is available).
Useful? React with 👍 / 👎.
| _ *project.Registry, | ||
| _ *projectStores, | ||
| _ *config.Config, |
There was a problem hiding this comment.
Preserve project and note gauges in metrics handler
This rewrite ignores registry and stores and serves only the shared Prometheus registry, but no collector now emits docsiq_projects_total or docsiq_notes_total. Those gauges were previously exported from /metrics, so this change removes per-project visibility and can break existing dashboards/alerts that depend on those series.
Useful? React with 👍 / 👎.
Summary
/api/version(public JSON with version/commit/build_date/go_version/deps) backed by a newinternal/buildinfopackage; Makefile LDFLAGS retargeted atinternal/buildinfo./healthz(always-200 liveness) and/readyz(SQLite ping + LLM reach, 10 s in-memory cache, 503 when SQLite or LLM fails).client_golangbackend via newinternal/obspackage:docsiq_http_requests_total,docsiq_http_request_duration_seconds,docsiq_pipeline_stage_duration_seconds,docsiq_embed_latency_seconds,docsiq_llm_tokens_total,docsiq_workq_depth,docsiq_workq_rejected_total,docsiq_build_info, plus client_golang defaults. Workq gainsPool.Stats()+ arejectedTotalcounter.loggingMiddlewareto emit one structured access-log line per request withreq_id, method, path, route, status, duration_ms, bytes_out, auth (bearer|cookie|anon), project, panic. Emission is deferred so panics escapingrecoveryMiddlewarestill produce an entry.log.format=text|jsonconfig (defaulttext;DOCSIQ_LOG_FORMATenv). JSON handler is wrapped byobs.NewProductionHandlerwhich strips leading emoji fromRecord.Message./healthz,/readyz,/metrics,/api/version.Test plan
CGO_ENABLED=1 go test -tags sqlite_fts5 -timeout 300s ./...all passCGO_ENABLED=1 go test -tags sqlite_fts5 -race -timeout 300s ./...all passCGO_ENABLED=1 go vet -tags sqlite_fts5 ./...clean/api/version,/healthz,/readyz,/metricsall return expected shapes;/readyzcorrectly reports 503 when LLM endpoint unreachabledocsiq_http_request_duration_secondsuses boundedroute="GET /healthz"labels (Go 1.22r.Pattern) rather than raw pathsNotes / follow-ups
kind="total"); threading real usage from langchaingo'sGenerationInfois a future follow-up. Flagged in commit message + source comment.IndexPath/IndexURL/Finalizevia a nil-safetimeStagehelper. Fine-grained per-phase (load/chunk/embed/extract/community) wrapping would require restructuringindexFile's nested Phase 1a/1b/1c/2 logic; deferred as a follow-up.🤖 Generated with Claude Code