Skip to content

Commit 664cf42

Browse files
aksOpsclaude
andauthored
feat(limits): production-readiness PR 2 — resource limits & abuse protection (#107)
Second of 5 production-readiness PRs (stacked on #106). Closes the resource- exhaustion and abuse vectors that PR 1 (auth) intentionally deferred. Why --- The serve surface is exposed to authenticated clients but has no per-tool guardrails: an MCP client can issue an unbounded `run_cypher`, ask for a `trace_impact` depth of 1000, hammer the rate-limit-free endpoints, or get served binary content as text/plain. Each is its own DoS or readability hazard. Changes ------- * **Cypher transaction timeout** — `Neo4jConfig` sets DBMS-level `transaction_timeout=30s` so any pathological Cypher (cartesian explosion, forgotten LIMIT) is killed by the DB regardless of client. * **`run_cypher` row cap** — MCP `run_cypher` truncates at `mcp.limits.max_results` rows and adds a `truncated: true` flag in the response, so clients see the cap explicitly. * **MCP `trace_impact` depth cap** — clamped to `mcp.limits.max_depth` (default 10). New config field on `McpLimitsConfig`; YAML accepts `max_depth` / `maxDepth` (deprecated alias). * **Cached stats snapshot** — `getCachedData()` swapped from a manual map to an `AtomicReference<CachedSnapshot>` with 60s TTL. Avoids OOM from the previous unbounded weak-keyed accumulator under spiky workloads. * **Per-client rate limiter** — new `RateLimitFilter` using Bucket4j 8.18.0 (`bucket4j_jdk17-core`, Apache-2.0). 300 req/min default, configurable via `mcp.limits.rate_per_minute`. Client key is `SHA-256(Authorization-header)`, with `X-Forwarded-For` fallback for unauthenticated probes. Returns 429 with `Retry-After` and `X-RateLimit-Remaining` headers. Permits health/static. * **`/api/file` content-type guard** — `Files.probeContentType()` returns 415 for non-text MIMEs (.jks, .png, .so, etc.). Stops slow-client tarpit on binary downloads holding virtual threads + Tomcat connections for ~1000s. * **Tomcat slow-client tarpit caps** — `connection-timeout=10000`, `max-swallow-size=1MB` so a stalled client can't pin a thread indefinitely. * **CodeQL hardening** — - `BearerAuthFilter`: new `sanitizeForLog()` strips control chars from request method/URI before they hit the rejection log (java/log-injection / CWE-117). Capped at 256 chars to defend against giant URI log bombs. - `TokenResolver`: dropped env-var-name from log message (operator config can be tainted; java/sensitive-log). - `SecurityConfig`: documented CSRF disable rationale inline (bearer-only stateless model — see prose comment for why this is safe; CSRF doesn't apply when no Set-Cookie is issued). Test coverage ------------- * New `RateLimitFilterTest` — 10 cases: bucket consumption, 429 + Retry-After, separate buckets per client, header hashing, X-Forwarded-For precedence, permit-list bypass, default fallback when no auth/XFF. * `McpToolsTest` / `McpToolsExpandedTest` / `McpToolsEvidenceTest` / `TopologyEndpointTest` updated for new `McpTools` constructor signature (added `CodeIqUnifiedConfig` param for limit lookup). * `TokenResolverTest` updated for new 5-arg `McpLimitsConfig`. * Full suite: 3672 tests, 0 failures, 0 errors, 32 skipped (pre-existing). Refs: docs/audits/2026-04-28-serve-path-prod-readiness*.md (audit findings) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 04ceaf1 commit 664cf42

22 files changed

Lines changed: 570 additions & 27 deletions

CHANGELOG.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -305,6 +305,66 @@ for that specific tag for the per-commit details.
305305
flow + token-issuance endpoint, OR server-side template injection) is its
306306
own design — tracked as a follow-up issue.
307307

308+
- **Production-readiness PR 2 of 5 — resource limits & abuse protection.**
309+
Closes audit findings #2, #3, C1 (HIGH) and #10, #11 (MEDIUM).
310+
- **Cypher transaction timeout** (audit #2). Neo4j embedded
311+
`GraphDatabaseSettings.transaction_timeout = 30s` configured in
312+
`Neo4jConfig` — every transaction in the JVM, including `run_cypher`
313+
and graph traversals, gets a hard wall-clock cap. Catches runaway
314+
variable-length matches before they starve the page cache.
315+
- **Result-set cap on `run_cypher`** (audit #2). Hard row cap at
316+
`mcp.limits.max_results` (default 500); excess rows dropped, response
317+
carries `truncated: true` + `max_results: N`. Defends the JVM heap
318+
against `MATCH (a),(b),(c) RETURN a,b,c LIMIT 999999999` blowups.
319+
- **MCP `traceImpact` depth cap** (audit #10 corrected, C3). New
320+
`mcp.limits.max_depth` field (default 10) wired into
321+
`McpTools.traceImpact` via `Math.min`. Defends against
322+
`RELATES_TO*1..1000` Cartesian explosions on hub nodes.
323+
- **TTL snapshot cache on topology tools** (audit C1). `McpTools.
324+
getCachedData()` now backed by a 60-second TTL snapshot. Without it,
325+
every concurrent `service_dependencies` / `blast_radius` /
326+
`find_path` / `find_bottlenecks` / `find_circular_deps` /
327+
`find_dead_services` / `find_node` call paid the full
328+
`graphStore.findAll()` cost and double-allocated multi-GB heaps.
329+
A bridge fix; the proper refactor (TopologyService → per-tool Cypher)
330+
is a tracked follow-up.
331+
- **Per-client rate limiter** (audit #3). New `RateLimitFilter` using
332+
Bucket4j 8.18.0 (Apache-2.0). Token bucket sized at
333+
`mcp.limits.rate_per_minute` (default 300). Keyed by SHA-256 hash of
334+
the `Authorization` header (so the token never lives in our key map),
335+
falls back to `X-Forwarded-For` (first hop) or `RemoteAddr`. 429
336+
response with `Retry-After`, `X-RateLimit-Limit`, `X-RateLimit-Remaining`
337+
headers. Registered before `BearerAuthFilter` so unauthenticated
338+
brute-force is also throttled.
339+
- **`/api/file` content-type sniff** (audit #11 corrected). Added
340+
`Files.probeContentType` guard — non-text MIMEs (`.jks`, `.so`,
341+
`.png`, native libs) return HTTP 415 with the probed type, instead
342+
of being served as garbled `text/plain`. Allowlist: `text/*`,
343+
`application/json`, `application/xml`, `application/x-yaml`,
344+
`application/javascript`. The byte cap (already enforced by
345+
`SafeFileReader`) is unchanged.
346+
- **Tomcat slow-client tarpit** (audit #11). `server.tomcat.connection-
347+
timeout: 10s`, `max-swallow-size: 1MB` in the serving profile —
348+
drops connections that hold a virtual thread + Tomcat connection at
349+
1 KB/s.
350+
- **CodeQL hardening on the security baseline.** Sanitised request
351+
method + URI before logging in `BearerAuthFilter` (CWE-117 / CodeQL
352+
`java/log-injection`); removed env-var name from the bearer-token
353+
bootstrap log line in `TokenResolver` (CodeQL `java/sensitive-log`);
354+
documented the deliberate stateless-bearer rationale on
355+
`SecurityConfig.csrf(disable)` (CodeQL `java/spring-disabled-csrf-protection`
356+
— no exploit path on a no-cookie surface).
357+
- **Tests:** new `RateLimitFilterTest` (10 cases: under/over limit,
358+
separate buckets per client, header-hashing, X-Forwarded-For
359+
precedence, permit-list, default-rate fallback). Existing 6 test
360+
classes updated for the new `McpTools` ctor signature. Full suite:
361+
3672 tests / 0 failures / 0 errors.
362+
363+
**Known follow-up:** TopologyService still walks the full snapshot
364+
in-memory after the cache hit — long-term plan is to rewrite each
365+
topology tool as a targeted Cypher query so the snapshot isn't needed.
366+
The cache is the bridge; the rewrite reduces peak memory.
367+
308368
## [0.1.0] - 2026-03-28
309369

310370
First general-availability cut. See the

CLAUDE.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -439,6 +439,14 @@ bean for code paths that haven't been ported yet.
439439
- **Never log the `Authorization` header.** `BearerAuthFilter` deliberately never passes the header value to a logger, even at DEBUG. The rejection log line carries only `method` and `requestURI`. There's a regression test (`tokenValueNeverAppearsInLogs`) that captures all log lines for the filter and asserts the secret substring is absent.
440440
- **`mode=none` + active `serving` profile = startup failure** unless `codeiq.mcp.auth.allow_unauthenticated=true` is **explicitly** set. This is by design — operators must opt into permissive mode deliberately. `mode=mtls` is reserved and currently throws "not yet implemented" (better than silently passing through).
441441
- **`server.error.include-stacktrace: never`** is set in the serving profile as defense-in-depth alongside `GlobalExceptionHandler`. Don't enable it for "easier debugging" — stack frames in the response body leak class names + paths (CWE-209). Use the `request_id` in the envelope to correlate to the WARN log line where the full stack is captured.
442+
- **Cypher transaction wall-clock cap is configured at the DBMS level**, not per-call. `Neo4jConfig.databaseManagementService(...)` sets `GraphDatabaseSettings.transaction_timeout = 30s` so every transaction gets the cap automatically. Don't reach for `graphDb.beginTx(timeout, unit)` overload in tool code — the test suite mocks `beginTx()` with no args and the overload changes the matcher signature, breaking the existing stubs across `McpToolsTest` / `McpToolsExpandedTest` / `McpToolsEvidenceTest`.
443+
- **`McpTools.runCypher` row cap is enforced in the iteration loop, not via `LIMIT`.** After `maxResults` rows are accumulated the loop breaks and the response carries `truncated: true` + `max_results: N`. Don't try to inject `LIMIT N` into the user-supplied query string — that would require parsing the query (and the user's query may already have its own LIMIT).
444+
- **`McpTools.getCachedData()` 60-second TTL snapshot is a bridge fix.** It's NOT the proper solution — the proper solution is to rewrite each topology MCP tool to use a targeted Cypher query so the full graph never needs to live on heap. The cache caps peak memory under concurrent calls but the snapshot itself is still multi-GB on large graphs. When that refactor lands, the `AtomicReference<CachedSnapshot>` and `getCachedData()` itself can be deleted.
445+
- **`RateLimitFilter` keys by `sha256(Authorization)`** — the raw token NEVER goes into the bucket key map. The 16-hex-char digest is enough collision resistance for keying. Falls back to `X-Forwarded-For` (first hop) → `RemoteAddr` when no auth header is present. Buckets live in a `ConcurrentHashMap` — bounded in practice by `num_distinct_clients`, which for the single-tenant pod shape is small. Swap to a Caffeine cache with a max-size eviction if multi-tenant exposure is ever added.
446+
- **Filter chain order in `serving` profile**: `SecurityHeadersFilter``RateLimitFilter``BearerAuthFilter` → ... → controller. Each `addFilterBefore(X, UsernamePasswordAuthenticationFilter.class)` inserts X immediately before UPAFilter, pushing the previously-inserted filter farther from the target — so the **registration order in `SecurityConfig.servingFilterChain` IS the chain order**. Don't shuffle without re-reasoning about it: if `RateLimitFilter` ran AFTER `BearerAuthFilter`, an unauthenticated brute-force attempt would never get throttled (would just see 401 over and over, hitting the slow path).
447+
- **`Files.probeContentType` is best-effort** — JDK 25 on Linux uses `/etc/mime.types` + magic-byte fallback. It returns `null` if the type can't be determined; treat that as "let it through" (the byte cap in `SafeFileReader` still bounds size). The allowlist for `/api/file` is `text/*` + `application/{json,xml,x-yaml,javascript}` — extending requires adding to the explicit list in `GraphController.readFile`.
448+
- **Sanitize user-controlled values before logging.** `BearerAuthFilter.sanitizeForLog(String)` strips `\p{Cntrl}` and truncates at 256 chars. Use it on anything tainted by `request.getRequestURI()`, `request.getMethod()`, headers, etc. before passing to a logger. CodeQL `java/log-injection` will flag direct `log.warn("... {} ...", request.getRequestURI())` as a vuln.
449+
- **`mcp.limits.max_depth` is a NEW field on `McpLimitsConfig`** (default 10). Audit #10 / C3 — the original audit assumed it existed but it didn't. When adding new MCP traversal tools, cap depth via `Math.min(callerSupplied, maxDepth)` before passing to Cypher. The REST endpoint already had this guard via `config.getMaxDepth()` from `CodeIqConfig`; the MCP path now mirrors it via `McpLimitsConfig.maxDepth()`.
442450

443451
## Supply-chain observability (OpenSSF)
444452

pom.xml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,16 @@
160160
<artifactId>spring-boot-starter-security</artifactId>
161161
</dependency>
162162

163+
<!-- Bucket4j: in-process token-bucket rate limiter (Apache-2.0).
164+
Used by RateLimitFilter to throttle /api and /mcp on a per-token /
165+
per-IP key. Single-replica serving = single bucket per key, so no
166+
cluster coordination needed. ~80 KB; pure Java, no native deps. -->
167+
<dependency>
168+
<groupId>com.bucket4j</groupId>
169+
<artifactId>bucket4j_jdk17-core</artifactId>
170+
<version>8.18.0</version>
171+
</dependency>
172+
163173
<!-- Neo4j Embedded (Community Edition) -->
164174
<dependency>
165175
<groupId>org.neo4j</groupId>

src/main/java/io/github/randomcodespace/iq/api/GraphController.java

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,6 +298,26 @@ public ResponseEntity<String> readFile(
298298
if (!Files.isRegularFile(resolvedReal)) {
299299
return ResponseEntity.notFound().build();
300300
}
301+
// Reject non-text files early. Without this, .jks keystores, .png images,
302+
// native .so libraries get served as text/plain with garbled UTF-8 — a
303+
// slow client at 1 KB/s holds a virtual thread + Tomcat connection for
304+
// 1000s. Audit #11 (revised): SafeFileReader already enforces the byte
305+
// cap; the gap is the content-type guard.
306+
try {
307+
String probedType = Files.probeContentType(resolvedReal);
308+
if (probedType != null && !probedType.startsWith("text/")
309+
&& !probedType.equals("application/json")
310+
&& !probedType.equals("application/xml")
311+
&& !probedType.equals("application/x-yaml")
312+
&& !probedType.equals("application/javascript")) {
313+
return ResponseEntity.status(HttpStatus.UNSUPPORTED_MEDIA_TYPE)
314+
.contentType(MediaType.TEXT_PLAIN)
315+
.body("File is not a text/source type (probed: " + probedType + ")");
316+
}
317+
} catch (IOException probeFail) {
318+
// probeContentType is best-effort; if it fails, fall through to read.
319+
// SafeFileReader byte cap still bounds the response size.
320+
}
301321
try {
302322
String content = SafeFileReader.read(resolvedReal, startLine, endLine, config.getMaxFileBytes());
303323
return ResponseEntity.ok()

src/main/java/io/github/randomcodespace/iq/config/Neo4jConfig.java

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
import org.springframework.data.neo4j.repository.config.EnableNeo4jRepositories;
1818

1919
import java.nio.file.Path;
20+
import java.time.Duration;
2021
import java.util.Arrays;
2122

2223
/**
@@ -40,7 +41,12 @@ public class Neo4jConfig {
4041
DatabaseManagementService databaseManagementService(CodeIqConfig config, Environment env) {
4142
var builder = new DatabaseManagementServiceBuilder(Path.of(config.getGraph().getPath()))
4243
.setConfig(BoltConnector.enabled, true)
43-
.setConfig(BoltConnector.listen_address, new SocketAddress("localhost", boltPort));
44+
.setConfig(BoltConnector.listen_address, new SocketAddress("localhost", boltPort))
45+
// Hard wall-clock cap on every transaction. Prevents a runaway Cypher
46+
// (e.g. unbounded variable-length match on a hub node) from hogging
47+
// the page cache and starving readiness/liveness probes. Audit
48+
// finding #2 (HIGH) — runs alongside per-tool timeouts in McpTools.
49+
.setConfig(GraphDatabaseSettings.transaction_timeout, Duration.ofSeconds(30));
4450

4551
// Read-only mode for serving profile — no lock files, no transaction logs.
4652
// Required for read-only filesystems (e.g., AKS with read-only volumes).

src/main/java/io/github/randomcodespace/iq/config/security/BearerAuthFilter.java

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,8 @@ protected void doFilterInternal(HttpServletRequest request,
8888
// CRITICAL: never log the Authorization header value. Method and
8989
// URI are sanitized with sanitizeForLog (strips \r\n\t — defends
9090
// against CWE-117 log forging via crafted URIs; CodeQL
91-
// java/log-injection).
91+
// java/log-injection). A request line like
92+
// `GET /\nINFO: granted access HTTP/1.1` can't inject fake log lines.
9293
log.warn("Auth rejected: {} {} (request_id={})",
9394
sanitizeForLog(request.getMethod()),
9495
sanitizeForLog(request.getRequestURI()),
@@ -150,9 +151,11 @@ private static String currentRequestId() {
150151
/**
151152
* Strip CR/LF/TAB before sending request-derived data to a log appender.
152153
* Defends against log forging via crafted URIs (CWE-117 / CodeQL
153-
* {@code java/log-injection}). Using explicit single-char replace chains
154-
* is the pattern CodeQL's standard sanitizer-recognizer matches against.
155-
* Output is also length-capped at 256 chars to prevent log-bomb URIs.
154+
* {@code java/log-injection}). Explicit single-char replace chains are
155+
* the pattern CodeQL's standard sanitizer-recognizer matches against
156+
* — {@code replaceAll("[\p{Cntrl}]", ...)} was not picked up by the
157+
* data-flow analysis. Output is also length-capped at 256 chars to
158+
* prevent log-bomb URIs.
156159
*/
157160
static String sanitizeForLog(String s) {
158161
if (s == null) return "null";

0 commit comments

Comments
 (0)