Skip to content

Commit 62806c4

Browse files
authored
fix(baseline): probe /api/stats for serve-smoke readiness instead of /actuator/health (#45)
The original baseline captured health=fail on both seed repos. Initial hypothesis was that the 8s sleep was too short for Spring Boot + Neo4j cold start. Live probing showed otherwise: - /actuator/health returns HTTP 503 with body {"groups":["liveness","readiness"],"status":"OUT_OF_SERVICE"} at ALL times, even after the graph is fully loaded. - /api/stats returns HTTP 200 within ~10-11s on both seeds, populated with real graph data (691/1836 nodes/edges for petclinic, 224/297 for realworld-express). The real bug is in GraphHealthIndicator, which flags the app as OUT_OF_SERVICE despite a loaded graph. Filed as a separate known gap for a future fix; out of scope for getting the baseline unblocked. Changes to scripts/baseline/run-pipeline.sh: - Poll /api/stats (30 x 2s = 60s budget) for readiness. /api/stats is the public REST surface and returns iff the graph is loaded. - Capture /actuator/health HTTP code + body as a diagnostic; do not gate readiness on it. - Truncate timings.txt at the start of each run so re-runs don't accumulate stale entries. - Summary JSON now reports stats_ok (real readiness) and health_raw (diagnostic body) rather than health_ok. BASELINE.md: - Marks pipeline serve-smoke gap as RESOLVED with real timings + stats for both seeds. - Adds a new known gap for the GraphHealthIndicator 503 issue.
1 parent fab34db commit 62806c4

2 files changed

Lines changed: 42 additions & 7 deletions

File tree

docs/superpowers/baselines/2026-04-17/BASELINE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,16 @@ Ordered by severity. Each item cites the raw artifact it was derived from.
227227

228228
- **Pipeline serve-smoke failed on both seed repos** (`health=fail`, `stats=null`). `index` and `enrich` succeeded (petclinic 8+13s, express 5+10s) but the 8-second sleep between starting `serve` and `curl /actuator/health` is at the low end of the documented 8–16s Spring Boot + embedded Neo4j cold-start window (see CLAUDE.md §Gotchas). Fix in Phase F hardening: poll `/actuator/health` with a retry budget instead of a fixed sleep.
229229
- Raw: `raw/pipeline/spring-petclinic/`, `raw/pipeline/realworld-express/`.
230+
- **RESOLVED (2026-04-17, branch `phase-a/fixups-pipeline-smoke`)**: patched `run-pipeline.sh` to poll `/api/stats` (up to 60s at 2s interval) as the readiness probe and to capture `/actuator/health` only as a diagnostic. Root cause was *not* a too-short sleep — the server cold-starts in 10–11s on both seeds and `/api/stats` responds with real data, but `/actuator/health` returns HTTP **503 `OUT_OF_SERVICE`** because the `GraphHealthIndicator` reports OUT_OF_SERVICE even after the graph loads. Captured baseline numbers below.
231+
232+
| Seed | index | enrich | ready (stats) | nodes | edges | files | languages | frameworks | health HTTP |
233+
|---|---:|---:|---:|---:|---:|---:|---|---|---:|
234+
| spring-petclinic | 4s | 11s | 11s | 691 | 1,836 | 67 | java 18 | spring_boot 24 | 503 |
235+
| realworld-express | 5s | 10s | 10s | 224 | 297 | 39 | typescript 6 | express 20, prisma 7 | 503 |
236+
237+
Follow-up split out below.
238+
239+
- **`GraphHealthIndicator` reports `OUT_OF_SERVICE` (503) even when the graph is loaded.** Discovered during the pipeline smoke-test fix. `/actuator/health` body: `{"groups":["liveness","readiness"],"status":"OUT_OF_SERVICE"}`. The server is fully functional (`/api/stats` returns real data) but the health indicator makes `/actuator/health` unusable as a readiness probe for orchestrators (K8s, Compose, CI). Fix in `src/main/java/io/github/randomcodespace/iq/health/GraphHealthIndicator.java`. Low for baseline use; High when we start Dockerizing or targeting K8s.
230240

231241
- **SpotBugs: 8 HIGH-priority findings (priority=1) + 1,484 at priority=2.** Total 1,492. HIGH findings must be triaged individually (read `raw/spotbugs.xml`). Noise-dominant rules (`NM_METHOD_NAMING_CONVENTION`=730, `SF_SWITCH_NO_DEFAULT`=448) should be filtered via a SpotBugs exclude file so real signal surfaces; real-concern patterns that deserve review now: `NP_NULL_ON_SOME_PATH_FROM_RETURN_VALUE` (26), `BC_UNCONFIRMED_CAST` (55), `UL_UNRELEASED_LOCK_EXCEPTION_PATH` (1), `WMI_WRONG_MAP_ITERATOR` (2), `ES_COMPARING_STRINGS_WITH_EQ` (2), `MT_CORRECTNESS` category (1).
232242
- Raw: `raw/spotbugs.xml`, `raw/spotbugs-summary.json`.

scripts/baseline/run-pipeline.sh

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ fi
1818

1919
# Clean any prior state in the seed repo.
2020
rm -rf "$SEED/.code-intelligence" "$SEED/.osscodeiq"
21+
# Truncate timings file so re-runs don't append stale entries.
22+
: > "$OUT/timings.txt"
2123

2224
timer() {
2325
local label="$1"; shift
@@ -37,13 +39,34 @@ PORT=18080
3739
java -jar "$JAR" serve "$SEED" --port "$PORT" > "$OUT/serve.log" 2>&1 &
3840
PID=$!
3941
trap "kill $PID 2>/dev/null || true" EXIT
40-
sleep 8
41-
if curl -sf "http://127.0.0.1:$PORT/actuator/health" > "$OUT/health.json"; then
42-
echo "health=ok" >> "$OUT/timings.txt"
42+
# Poll /api/stats up to 60s (30 x 2s) as the readiness probe. Spring Boot
43+
# cold-start + embedded Neo4j page-cache warm-up is documented 8-16s (see
44+
# CLAUDE.md §Gotchas). We deliberately do NOT poll /actuator/health: the
45+
# GraphHealthIndicator currently reports OUT_OF_SERVICE (503) even after the
46+
# graph has loaded (tracked as a known gap), so it is not a reliable readiness
47+
# signal. /api/stats is the public REST surface and returns graph data iff
48+
# the server has finished starting and loaded the graph.
49+
ready_t0=$(date +%s)
50+
ready_ok="no"
51+
for _ in $(seq 1 30); do
52+
if curl -sf "http://127.0.0.1:$PORT/api/stats" > "$OUT/stats.json"; then
53+
ready_ok="yes"; break
54+
fi
55+
sleep 2
56+
done
57+
ready_elapsed=$(( $(date +%s) - ready_t0 ))
58+
if [[ "$ready_ok" == "yes" ]]; then
59+
echo "stats=ok ready_after_s=${ready_elapsed}" | tee -a "$OUT/timings.txt"
4360
else
44-
echo "health=fail" >> "$OUT/timings.txt"
61+
echo "stats=fail ready_after_s=${ready_elapsed}" | tee -a "$OUT/timings.txt"
62+
echo '{"error":"/api/stats never returned 2xx within 60s"}' > "$OUT/stats.json"
4563
fi
46-
curl -sf "http://127.0.0.1:$PORT/api/stats" > "$OUT/stats.json" || true
64+
65+
# Capture /actuator/health as a diagnostic snapshot (may be 503 today;
66+
# still useful for tracking the health-indicator fix over time).
67+
health_http=$(curl -s -o "$OUT/health.json" -w '%{http_code}' \
68+
"http://127.0.0.1:$PORT/actuator/health" 2>/dev/null || echo "000")
69+
echo "health_http=${health_http}" | tee -a "$OUT/timings.txt"
4770
kill $PID 2>/dev/null || true
4871
wait $PID 2>/dev/null || true
4972

@@ -54,11 +77,13 @@ def load(p):
5477
try: return json.load(open(p))
5578
except Exception: return None
5679
t=open("$OUT/timings.txt").read().strip().splitlines()
80+
stats = load("$OUT/stats.json")
5781
print(json.dumps({
5882
"seed": "$NAME",
5983
"timings": t,
60-
"stats": load("$OUT/stats.json"),
61-
"health_ok": load("$OUT/health.json") is not None,
84+
"stats": stats,
85+
"stats_ok": isinstance(stats, dict) and "graph" in stats,
86+
"health_raw": load("$OUT/health.json"),
6287
}, indent=2))
6388
PY
6489
cat "$OUT/summary.json"

0 commit comments

Comments
 (0)