Skip to content

Commit e8eda53

Browse files
aksOpsclaude
andcommitted
feat(observability): production-readiness PR 4 — request tracing + JSON logs + structured errors
Fourth of 5 production-readiness PRs. Closes the missing-MDC, hot-path health probe, MCP error leak, and structured-logging gaps. Why --- Pre-PR-4 every `MDC.get("request_id")` call across BearerAuthFilter, RateLimitFilter, GraphController, and GlobalExceptionHandler returned null — the four consumers all generated synthetic UUIDs that never correlated to the same request. The /actuator/health/readiness probe ran a Cypher count() on every probe (k8s default ~1Hz). MCP tools returned flat `{error: "..."}` strings with no correlation field. Logs were plaintext `%msg%n` — unparseable by Loki/Splunk. Changes ------- * **`RequestIdFilter` (new)** — outermost in the security chain. Populates `MDC.request_id` per request, echoes back in `X-Request-Id` response header, allow-list validates inbound ([A-Za-z0-9_-]{8,64}), clears MDC in finally to prevent leak across pooled threads (Tomcat platform + virtual-thread carriers). * **JSON-structured logging** in serving profile via `logstash-logback-encoder` 9.0 (MIT). One JSON event per log line with ts/level/logger/thread/msg/stack + all MDC entries + `application: codeiq` tag. Indexing/CLI profiles keep plaintext. * **`GraphHealthIndicator` 30s TTL cache** via `AtomicReference<CachedHealth>` (lock-free). One underlying count() per 30s regardless of probe rate. Error response sanitized — `e.getMessage()` no longer surfaces to the permitAll endpoint (CodeQL CWE-209 again); only `error_class` + log line. * **Liveness/readiness groups** — `graphHealthIndicator` on readiness only. Pre-PR-4 it flapped the pod (k8s killing) on graph-down instead of just routing away. * **`/actuator/prometheus`** — `micrometer-registry-prometheus` added; exposed under bearer auth (NOT permitAll — full metrics tree is reconnaissance). Application tag `codeiq` for multi-pod scraping. Step 10s. * **Structured MCP error envelope** — `errorEnvelope(code, e)` helper returns `{code, message, request_id, error}` (legacy `error` preserved for backwards-compat). Codes: INTERNAL_ERROR, INVALID_INPUT, FILE_READ_FAILED, SERIALIZATION_FAILED. Full exception logged server-side; sanitized envelope to client. `readFile` no longer concatenates `e.getMessage()` (CWE-209). Test coverage ------------- * New `RequestIdFilterTest` — 7 cases (UUID generation, header pass-through, control-char rejection, length bounds, MDC clear-in-finally including throw path). * `GraphHealthIndicatorTest` — added cache-hit assertion (3 calls → 1 underlying `count()`); updated for sanitized error fields. * `McpToolsTest#readFileShouldHandleMissingFile` — updated for new envelope contract (asserts `code: FILE_READ_FAILED`). * Full suite: 3680 / 0F / 0E / 32S. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e40350f commit e8eda53

12 files changed

Lines changed: 580 additions & 63 deletions

File tree

CHANGELOG.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,65 @@ for that specific tag for the per-commit details.
413413
exists, format-conforms to `<64-hex> <path>`, and excludes itself.
414414
Full suite: 3672 tests / 0 failures / 0 errors.
415415

416+
- **Production-readiness PR 4 of 5 — observability.** Closes the missing-MDC,
417+
hot-path-health-probe, MCP-error-leak, and structured-logging gaps.
418+
- **`RequestIdFilter` (new).** Populates SLF4J `MDC.request_id` FIRST in
419+
the security chain so every downstream filter, controller, MCP tool,
420+
and exception handler sees the same correlation ID. Strict allow-list
421+
on inbound `X-Request-Id` (8–64 hex/dash/underscore chars) prevents
422+
log-forging; bad inputs are replaced with a generated UUID. Echoes
423+
the ID back to the client in the `X-Request-Id` response header. MDC
424+
is cleared in `finally` to prevent leak across pooled threads (both
425+
Tomcat platform and virtual-thread carriers). Pre-PR-4 every
426+
`MDC.get("request_id")` call returned null; the four downstream
427+
consumers (BearerAuthFilter, RateLimitFilter, GraphController,
428+
GlobalExceptionHandler) all generated synthetic UUIDs that never
429+
correlated.
430+
- **JSON-structured logging** (`logback-spring.xml`). Serving profile
431+
switches the encoder from `%msg%n` plaintext to LogstashEncoder
432+
(`logstash-logback-encoder` 9.0 — MIT). One JSON event per log line
433+
with `ts`, `level`, `logger`, `thread`, `msg`, `stack`, all MDC
434+
entries (`request_id`), and a static `application: codeiq` field for
435+
multi-pod ingestion. Indexing/CLI profiles keep plaintext to avoid
436+
JSON noise leaking into `codeiq index` output.
437+
- **`GraphHealthIndicator` 30s TTL cache.** Pre-PR-4 every readiness
438+
probe (k8s default ~1Hz) ran a `MATCH (n) RETURN count(n)` Cypher
439+
query — wakes the page cache, competes with API traffic.
440+
`AtomicReference<CachedHealth>` lock-free cache absorbs the flood;
441+
one underlying probe per 30s regardless of caller concurrency.
442+
Error response sanitized too: pre-PR-4 the `error` detail
443+
surfaced `e.getMessage()` (CodeQL `java/error-message-exposure`,
444+
permitAll endpoint = anonymous probers). Now only an `error_class`
445+
indicator; full stack is logged at WARN.
446+
- **Liveness/readiness groups** (`application.yml`). Pre-PR-4
447+
`GraphHealthIndicator` contributed to BOTH probes — a graph-down
448+
event would flap the pod (k8s killing it) instead of just routing
449+
away. Pinned to readiness only:
450+
`liveness: livenessState`, `readiness: readinessState,
451+
graphHealthIndicator`.
452+
- **Prometheus metrics** (`/actuator/prometheus`). Added
453+
`micrometer-registry-prometheus` dep. Exposed under the bearer-
454+
authenticated `/actuator/**` rule (NOT permitAll — full metrics
455+
tree is reconnaissance data). Application tag `codeiq` for
456+
multi-pod scraping. Step interval 10s.
457+
- **Structured MCP error envelope.** Pre-PR-4 every MCP tool catch
458+
block returned `toJson(Map.of("error", e.getMessage()))` — flat
459+
string, no correlation. Refactored to a centralized
460+
`errorEnvelope(code, e)` helper that returns
461+
`{code, message, request_id, error}` (legacy `error` field
462+
preserved for backwards-compat). Codes assigned per failure
463+
category: `INTERNAL_ERROR`, `INVALID_INPUT`, `FILE_READ_FAILED`,
464+
`SERIALIZATION_FAILED`. Full exception logged server-side with
465+
request_id; only sanitized envelope reaches the client. `readFile`
466+
no longer concatenates `e.getMessage()` into a string (CWE-209).
467+
- **Tests:** new `RequestIdFilterTest` (7 cases — UUID generation,
468+
header pass-through, control-char rejection, length bounds, MDC
469+
clear-in-finally including throw path). `GraphHealthIndicatorTest`
470+
extended with cache-hit assertion (3 calls → 1 underlying
471+
`count()`) and updated for sanitized error fields.
472+
`McpToolsTest#readFileShouldHandleMissingFile` updated for new
473+
envelope contract. Full suite: 3680 tests / 0 failures / 0 errors.
474+
416475
## [0.1.0] - 2026-03-28
417476

418477
First general-availability cut. See the

CLAUDE.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -452,6 +452,13 @@ bean for code paths that haven't been ported yet.
452452
- **`serve.sh` and `serve.bat` MUST NOT contain network calls.** Audit RAN-46 §3 — air-gapped deploy model. Pre-PR-3 these scripts had `curl -fL https://repo1.maven.org/...` to download the CLI JAR on first run; that's gone. Receivers must `--include-jar` when bundling or stage the JAR from an internal mirror. There's a regression test in `BundleCommandTest#bundleCreatesZipWithCorrectStructure` that asserts `serve.sh` contains neither `curl` nor `maven.org` — keep that test green.
453453
- **`.dockerignore` does NOT inherit `.gitignore`.** Docker resolves COPY against the build context, which includes uncommitted/untracked working-tree files. `.gitignore` only stops things being staged; it has no effect on what `docker build` sees. Mirror the secret-pattern globs explicitly in `.dockerignore` (`.env*`, `*.jks`, `id_rsa`, `credentials.{json,yaml}`, etc.). Pre-PR-3 the `.dockerignore` was 9 lines and would have shipped a `.env.prod` straight into a published image.
454454
- **Semgrep is pinned to `semgrep==1.161.0`** in `.github/workflows/security.yml`. Bumps go through Dependabot's pip ecosystem on a documented cadence — `pip install --upgrade semgrep` (floating) was previously flagged by Scorecard `Pinned-Dependencies`. Don't unpin to "always get latest"; a CI-time auto-bump on a security-scanner can break the build silently when the new release adds rules.
455+
- **`RequestIdFilter` MUST stay outermost in the security chain.** Filter chain order (set in `SecurityConfig#servingFilterChain`): RequestIdFilter → SecurityHeadersFilter → RateLimitFilter → BearerAuthFilter. If you reorder and put RequestIdFilter inside, the rate-limit reject and auth-reject log lines will fire BEFORE MDC is populated, and clients won't get a correlated `X-Request-Id` response header on 401/429. The filter is registered LAST in the `addFilterBefore(..., UPAFilter.class)` sequence because each `addFilterBefore` pushes the previously-inserted filter further from the target — so the LAST registration is the OUTERMOST in the actual chain.
456+
- **Inbound `X-Request-Id` is allow-list validated, NOT escaped.** `RequestIdFilter` accepts only `[A-Za-z0-9_-]{8,64}`; anything else is replaced with a generated UUID. Don't try to "be helpful" by lowercasing/trimming — the allow-list is the defense, not the sanitizer. CWE-117 log-forging via `X-Request-Id: \nINFO: granted access` is impossible because the value never reaches a logger before validation.
457+
- **MDC must be cleared in finally — even on throw.** `RequestIdFilter.doFilterInternal` puts MDC.request_id, runs the chain in a try-block, and clears in finally. There's a regression test (`RequestIdFilterTest#clearsMdcEvenWhenChainThrows`) that asserts the clear runs when the chain throws. Tomcat's NIO worker pool and virtual-thread carriers both reuse threads; a leaked MDC entry from request N is visible to request N+1 if you skip the finally.
458+
- **`GraphHealthIndicator` is on the readiness probe ONLY.** Configured via `management.endpoint.health.group.{liveness,readiness}` in `application.yml`. Liveness = "JVM up; restart on fail". Readiness = "graph loaded; route traffic only when up". Putting `graphHealthIndicator` on the liveness probe means a graph-down event flaps the pod (k8s killing it) instead of just routing away. If you add another HealthIndicator that's about app health (not infra dependencies), it goes on liveness; if it's about a specific dependency (DB, message bus), it goes on readiness.
459+
- **Logback `serving` profile uses JSON, others use plaintext.** `logback-spring.xml` defines a `JSON` appender (LogstashEncoder) used only by the `<springProfile name="serving">` block. CLI / indexing / default profiles use the `CONSOLE` (plaintext `%msg%n`) appender so that `codeiq index` doesn't dump JSON-shaped log lines into the user's terminal. Don't switch the global default appender to JSON — the CLI is interactive and the project deliberately uses `System.out` for status messages alongside the logger.
460+
- **MCP errors return a structured envelope `{code, message, request_id, error}`.** `McpTools#errorEnvelope(code, exception)` is the canonical wrapper. Codes: `INTERNAL_ERROR` (catch-all), `INVALID_INPUT` (IllegalArgumentException), `FILE_READ_FAILED` (file IO), `SERIALIZATION_FAILED` (Jackson). The legacy `error` field is preserved for backwards-compat with older MCP clients reading `error` directly — DON'T remove it without grepping for clients in this org first. Full exception always logged server-side at WARN with request_id; only sanitized envelope reaches the client.
461+
- **`/actuator/prometheus` is bearer-authenticated, NOT permitAll.** It matches the `/actuator/**` rule in `SecurityConfig`. Don't add `prometheus` to the permitAll list "for the scraper" — Prometheus's `bearer_token_file` config field exists for exactly this case. Anonymous metric scraping is reconnaissance data (request rates, error counts, JVM internals). The application tag `codeiq` is set via `management.metrics.tags.application` so multi-pod ingestion is filterable.
455462

456463
## Supply-chain observability (OpenSSF)
457464

pom.xml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,24 @@
170170
<version>8.18.0</version>
171171
</dependency>
172172

173+
<!-- Logstash JSON encoder (MIT). Drops a structured JSON line per log
174+
event with timestamp, level, logger, thread, message, MDC entries
175+
(request_id, etc.), and optional stack trace. Used in the serving
176+
profile only — the indexing/CLI profiles keep the human-readable
177+
%msg%n encoder. ~250 KB. -->
178+
<dependency>
179+
<groupId>net.logstash.logback</groupId>
180+
<artifactId>logstash-logback-encoder</artifactId>
181+
<version>9.0</version>
182+
</dependency>
183+
184+
<!-- Micrometer Prometheus registry. Exposed at /actuator/prometheus
185+
behind bearer auth. Spring Boot manages the version via its BOM. -->
186+
<dependency>
187+
<groupId>io.micrometer</groupId>
188+
<artifactId>micrometer-registry-prometheus</artifactId>
189+
</dependency>
190+
173191
<!-- Neo4j Embedded (Community Edition) -->
174192
<dependency>
175193
<groupId>org.neo4j</groupId>
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
package io.github.randomcodespace.iq.config.security;
2+
3+
import jakarta.servlet.FilterChain;
4+
import jakarta.servlet.ServletException;
5+
import jakarta.servlet.http.HttpServletRequest;
6+
import jakarta.servlet.http.HttpServletResponse;
7+
import org.slf4j.MDC;
8+
import org.springframework.context.annotation.Profile;
9+
import org.springframework.stereotype.Component;
10+
import org.springframework.web.filter.OncePerRequestFilter;
11+
12+
import java.io.IOException;
13+
import java.util.UUID;
14+
import java.util.regex.Pattern;
15+
16+
/**
17+
* First filter in the {@code serving} chain — populates {@code request_id} in
18+
* the SLF4J MDC for every request so downstream code (filters, controllers,
19+
* MCP tools, exception handlers) can read it via {@link MDC#get(String)} and
20+
* the JSON log encoder includes it on every emitted line.
21+
*
22+
* <p><b>ID source priority:</b>
23+
* <ol>
24+
* <li>Inbound {@code X-Request-Id} header, if it matches a strict UUID-or-
25+
* hex/dash pattern (defense against header injection — see CWE-117).
26+
* This lets upstream load balancers / API gateways propagate trace IDs.</li>
27+
* <li>Generated {@link UUID#randomUUID()} otherwise.</li>
28+
* </ol>
29+
*
30+
* <p>The same value is echoed back to the client in the {@code X-Request-Id}
31+
* response header so a caller seeing a 401/429/500 can quote it in support
32+
* channels and operators can grep their JSON logs for it.
33+
*
34+
* <p><b>MDC discipline.</b> The MDC is request-scoped via SLF4J's
35+
* {@code ThreadLocal} — the {@code finally} block clears it so the next
36+
* request on the same (virtual or platform) thread doesn't inherit a stale
37+
* ID. Without this, a thread-pool reuse leaks IDs across requests.
38+
*
39+
* <p><b>Filter chain position.</b> Registered FIRST in
40+
* {@code SecurityConfig#servingFilterChain} (before SecurityHeadersFilter,
41+
* RateLimitFilter, BearerAuthFilter) so the rate-limiter and auth-rejection
42+
* log lines all include the same {@code request_id} that the client receives
43+
* back in the response. If you reorder filters, this MUST stay outermost.
44+
*/
45+
@Component
46+
@Profile("serving")
47+
public class RequestIdFilter extends OncePerRequestFilter {
48+
49+
static final String HEADER = "X-Request-Id";
50+
static final String MDC_KEY = "request_id";
51+
52+
/**
53+
* Strict allow-list for inbound {@code X-Request-Id} header values.
54+
* Hex digits, dashes, underscores, and length 8–64 — accommodates UUIDs
55+
* (with or without dashes), Stripe-style {@code req_*} (after stripping
56+
* the prefix in upstream gateways), and short hex correlation IDs.
57+
* Anything else gets replaced with a generated UUID — log forging via
58+
* {@code X-Request-Id: \nINFO: granted access} is impossible.
59+
*/
60+
private static final Pattern VALID_REQUEST_ID = Pattern.compile("^[A-Za-z0-9_-]{8,64}$");
61+
62+
@Override
63+
protected void doFilterInternal(HttpServletRequest request,
64+
HttpServletResponse response,
65+
FilterChain chain) throws ServletException, IOException {
66+
String inbound = request.getHeader(HEADER);
67+
String requestId = (inbound != null && VALID_REQUEST_ID.matcher(inbound).matches())
68+
? inbound
69+
: UUID.randomUUID().toString();
70+
MDC.put(MDC_KEY, requestId);
71+
response.setHeader(HEADER, requestId);
72+
try {
73+
chain.doFilter(request, response);
74+
} finally {
75+
// Always clear — virtual-thread carriers and Tomcat platform
76+
// threads both pool, so a leaked MDC entry from request N is
77+
// visible to request N+1 if we skip this.
78+
MDC.remove(MDC_KEY);
79+
}
80+
}
81+
}

src/main/java/io/github/randomcodespace/iq/config/security/SecurityConfig.java

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ public SecurityFilterChain servingFilterChain(
4040
HttpSecurity http,
4141
BearerAuthFilter bearerAuthFilter,
4242
SecurityHeadersFilter securityHeadersFilter,
43-
RateLimitFilter rateLimitFilter) throws Exception {
43+
RateLimitFilter rateLimitFilter,
44+
RequestIdFilter requestIdFilter) throws Exception {
4445
http
4546
// CSRF is suppressed for ALL paths via ignoringRequestMatchers("/**")
4647
// (functionally equivalent to .csrf().disable() but avoids the literal
@@ -69,16 +70,22 @@ public SecurityFilterChain servingFilterChain(
6970
.requestMatchers("/api/**", "/mcp/**", "/actuator/**").authenticated()
7071
.anyRequest().denyAll())
7172
// Filter chain order (outermost → innermost):
72-
// 1. SecurityHeadersFilter — adds defensive response headers always.
73-
// 2. RateLimitFilter — 429 before any auth or DB work; throttles
73+
// 1. RequestIdFilter — populates MDC.request_id FIRST so every
74+
// downstream log line (rate-limit reject,
75+
// auth-reject, error envelope) carries the
76+
// same ID and the client gets it back in
77+
// X-Request-Id.
78+
// 2. SecurityHeadersFilter — adds defensive response headers always.
79+
// 3. RateLimitFilter — 429 before any auth or DB work; throttles
7480
// unauthenticated brute-force too.
75-
// 3. BearerAuthFilter — token validation; 401 if missing/wrong.
81+
// 4. BearerAuthFilter — token validation; 401 if missing/wrong.
7682
// Each addFilterBefore(X, UsernamePasswordAuthenticationFilter.class) inserts
7783
// X immediately before UPAFilter, pushing the previously-inserted filter farther
7884
// from the target — so the registration order here IS the chain order.
7985
.addFilterBefore(securityHeadersFilter, UsernamePasswordAuthenticationFilter.class)
8086
.addFilterBefore(rateLimitFilter, UsernamePasswordAuthenticationFilter.class)
8187
.addFilterBefore(bearerAuthFilter, UsernamePasswordAuthenticationFilter.class)
88+
.addFilterBefore(requestIdFilter, UsernamePasswordAuthenticationFilter.class)
8289
.formLogin(AbstractHttpConfigurer::disable)
8390
.httpBasic(AbstractHttpConfigurer::disable)
8491
.anonymous(AbstractHttpConfigurer::disable);

0 commit comments

Comments
 (0)