From af46fca5e1a3b1c06e98372e7d8a82f09bec6e7e Mon Sep 17 00:00:00 2001 From: Amit Kumar Date: Fri, 17 Apr 2026 10:48:15 +0000 Subject: [PATCH] fix(baseline): probe /api/stats for serve-smoke readiness instead of /actuator/health The original baseline captured health=fail on both seed repos. Initial hypothesis was that the 8s sleep was too short for Spring Boot + Neo4j cold start. Live probing showed otherwise: - /actuator/health returns HTTP 503 with body {"groups":["liveness","readiness"],"status":"OUT_OF_SERVICE"} at ALL times, even after the graph is fully loaded. - /api/stats returns HTTP 200 within ~10-11s on both seeds, populated with real graph data (691/1836 nodes/edges for petclinic, 224/297 for realworld-express). The real bug is in GraphHealthIndicator, which flags the app as OUT_OF_SERVICE despite a loaded graph. Filed as a separate known gap for a future fix; out of scope for getting the baseline unblocked. Changes to scripts/baseline/run-pipeline.sh: - Poll /api/stats (30 x 2s = 60s budget) for readiness. /api/stats is the public REST surface and returns iff the graph is loaded. - Capture /actuator/health HTTP code + body as a diagnostic; do not gate readiness on it. - Truncate timings.txt at the start of each run so re-runs don't accumulate stale entries. - Summary JSON now reports stats_ok (real readiness) and health_raw (diagnostic body) rather than health_ok. BASELINE.md: - Marks pipeline serve-smoke gap as RESOLVED with real timings + stats for both seeds. - Adds a new known gap for the GraphHealthIndicator 503 issue. --- .../baselines/2026-04-17/BASELINE.md | 10 +++++ scripts/baseline/run-pipeline.sh | 39 +++++++++++++++---- 2 files changed, 42 insertions(+), 7 deletions(-) diff --git a/docs/superpowers/baselines/2026-04-17/BASELINE.md b/docs/superpowers/baselines/2026-04-17/BASELINE.md index 60550846..1d0eccfc 100644 --- a/docs/superpowers/baselines/2026-04-17/BASELINE.md +++ b/docs/superpowers/baselines/2026-04-17/BASELINE.md @@ -227,6 +227,16 @@ Ordered by severity. Each item cites the raw artifact it was derived from. - **Pipeline serve-smoke failed on both seed repos** (`health=fail`, `stats=null`). `index` and `enrich` succeeded (petclinic 8+13s, express 5+10s) but the 8-second sleep between starting `serve` and `curl /actuator/health` is at the low end of the documented 8–16s Spring Boot + embedded Neo4j cold-start window (see CLAUDE.md §Gotchas). Fix in Phase F hardening: poll `/actuator/health` with a retry budget instead of a fixed sleep. - Raw: `raw/pipeline/spring-petclinic/`, `raw/pipeline/realworld-express/`. + - **RESOLVED (2026-04-17, branch `phase-a/fixups-pipeline-smoke`)**: patched `run-pipeline.sh` to poll `/api/stats` (up to 60s at 2s interval) as the readiness probe and to capture `/actuator/health` only as a diagnostic. Root cause was *not* a too-short sleep — the server cold-starts in 10–11s on both seeds and `/api/stats` responds with real data, but `/actuator/health` returns HTTP **503 `OUT_OF_SERVICE`** because the `GraphHealthIndicator` reports OUT_OF_SERVICE even after the graph loads. Captured baseline numbers below. + + | Seed | index | enrich | ready (stats) | nodes | edges | files | languages | frameworks | health HTTP | + |---|---:|---:|---:|---:|---:|---:|---|---|---:| + | spring-petclinic | 4s | 11s | 11s | 691 | 1,836 | 67 | java 18 | spring_boot 24 | 503 | + | realworld-express | 5s | 10s | 10s | 224 | 297 | 39 | typescript 6 | express 20, prisma 7 | 503 | + + Follow-up split out below. + +- **`GraphHealthIndicator` reports `OUT_OF_SERVICE` (503) even when the graph is loaded.** Discovered during the pipeline smoke-test fix. `/actuator/health` body: `{"groups":["liveness","readiness"],"status":"OUT_OF_SERVICE"}`. The server is fully functional (`/api/stats` returns real data) but the health indicator makes `/actuator/health` unusable as a readiness probe for orchestrators (K8s, Compose, CI). Fix in `src/main/java/io/github/randomcodespace/iq/health/GraphHealthIndicator.java`. Low for baseline use; High when we start Dockerizing or targeting K8s. - **SpotBugs: 8 HIGH-priority findings (priority=1) + 1,484 at priority=2.** Total 1,492. HIGH findings must be triaged individually (read `raw/spotbugs.xml`). Noise-dominant rules (`NM_METHOD_NAMING_CONVENTION`=730, `SF_SWITCH_NO_DEFAULT`=448) should be filtered via a SpotBugs exclude file so real signal surfaces; real-concern patterns that deserve review now: `NP_NULL_ON_SOME_PATH_FROM_RETURN_VALUE` (26), `BC_UNCONFIRMED_CAST` (55), `UL_UNRELEASED_LOCK_EXCEPTION_PATH` (1), `WMI_WRONG_MAP_ITERATOR` (2), `ES_COMPARING_STRINGS_WITH_EQ` (2), `MT_CORRECTNESS` category (1). - Raw: `raw/spotbugs.xml`, `raw/spotbugs-summary.json`. diff --git a/scripts/baseline/run-pipeline.sh b/scripts/baseline/run-pipeline.sh index a5650f8f..97970542 100755 --- a/scripts/baseline/run-pipeline.sh +++ b/scripts/baseline/run-pipeline.sh @@ -18,6 +18,8 @@ fi # Clean any prior state in the seed repo. rm -rf "$SEED/.code-intelligence" "$SEED/.osscodeiq" +# Truncate timings file so re-runs don't append stale entries. +: > "$OUT/timings.txt" timer() { local label="$1"; shift @@ -37,13 +39,34 @@ PORT=18080 java -jar "$JAR" serve "$SEED" --port "$PORT" > "$OUT/serve.log" 2>&1 & PID=$! trap "kill $PID 2>/dev/null || true" EXIT -sleep 8 -if curl -sf "http://127.0.0.1:$PORT/actuator/health" > "$OUT/health.json"; then - echo "health=ok" >> "$OUT/timings.txt" +# Poll /api/stats up to 60s (30 x 2s) as the readiness probe. Spring Boot +# cold-start + embedded Neo4j page-cache warm-up is documented 8-16s (see +# CLAUDE.md §Gotchas). We deliberately do NOT poll /actuator/health: the +# GraphHealthIndicator currently reports OUT_OF_SERVICE (503) even after the +# graph has loaded (tracked as a known gap), so it is not a reliable readiness +# signal. /api/stats is the public REST surface and returns graph data iff +# the server has finished starting and loaded the graph. +ready_t0=$(date +%s) +ready_ok="no" +for _ in $(seq 1 30); do + if curl -sf "http://127.0.0.1:$PORT/api/stats" > "$OUT/stats.json"; then + ready_ok="yes"; break + fi + sleep 2 +done +ready_elapsed=$(( $(date +%s) - ready_t0 )) +if [[ "$ready_ok" == "yes" ]]; then + echo "stats=ok ready_after_s=${ready_elapsed}" | tee -a "$OUT/timings.txt" else - echo "health=fail" >> "$OUT/timings.txt" + echo "stats=fail ready_after_s=${ready_elapsed}" | tee -a "$OUT/timings.txt" + echo '{"error":"/api/stats never returned 2xx within 60s"}' > "$OUT/stats.json" fi -curl -sf "http://127.0.0.1:$PORT/api/stats" > "$OUT/stats.json" || true + +# Capture /actuator/health as a diagnostic snapshot (may be 503 today; +# still useful for tracking the health-indicator fix over time). +health_http=$(curl -s -o "$OUT/health.json" -w '%{http_code}' \ + "http://127.0.0.1:$PORT/actuator/health" 2>/dev/null || echo "000") +echo "health_http=${health_http}" | tee -a "$OUT/timings.txt" kill $PID 2>/dev/null || true wait $PID 2>/dev/null || true @@ -54,11 +77,13 @@ def load(p): try: return json.load(open(p)) except Exception: return None t=open("$OUT/timings.txt").read().strip().splitlines() +stats = load("$OUT/stats.json") print(json.dumps({ "seed": "$NAME", "timings": t, - "stats": load("$OUT/stats.json"), - "health_ok": load("$OUT/health.json") is not None, + "stats": stats, + "stats_ok": isinstance(stats, dict) and "graph" in stats, + "health_raw": load("$OUT/health.json"), }, indent=2)) PY cat "$OUT/summary.json"