Summary
record_test_baseline defaults to a 120s subprocess timeout (timeout_seconds: int = 120). On a project whose full test suite runs longer than ~120s (this repo's own suite is ~105s pytest alone, longer with render/lint via make test), the baseline run times out and returns {"status":"baseline_failures","timed_out":true,"returncode":-1,"baseline_failures":[]}. The empty baseline_failures is then indistinguishable from a genuinely-green baseline — so any pre-existing failure is silently treated as "not pre-existing", defeating the regression-vs-pre-existing distinction the baseline exists to provide.
mapify_version: 3.20.0
Observed
During a /map-efficient INIT_STATE pre-flight on this repo:
{"status":"baseline_failures","command":"make test","timed_out":true,
"returncode":-1,"elapsed_seconds":120.01,"baseline_failures":[],...}
The suite did not actually fail — it just exceeded 120s. baseline_failures: [] here means "we never finished", not "nothing was broken".
Expected
A baseline that cannot complete should NOT present as an empty (= clean) baseline. Options (any/all):
- Raise the default
timeout_seconds for the full-suite path (the existing flaky-triage/repro-probe 120s caps are per-run, not per-full-suite — a full suite needs more headroom), and/or make it configurable via .map/config.yaml.
- When
timed_out is true, mark the baseline result so downstream consumers treat it as unknown, not clean (e.g. a baseline_complete: false flag that later regression checks must honor — fail-safe, not fail-open).
- Surface a loud warning when the baseline times out so operators know the regression-vs-pre-existing signal is degraded.
Affected
src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinja:11687 — record_test_baseline(..., timeout_seconds: int = 120) (and the rendered mirrors).
Found while shipping #303 Slice 5a (PR #306) — the slice itself was unaffected because the full make check was run independently as the real gate, but the baseline degradation is a general framework correctness gap (fail-open on timeout) for any repo with a >120s suite.
Summary
record_test_baselinedefaults to a 120s subprocess timeout (timeout_seconds: int = 120). On a project whose full test suite runs longer than ~120s (this repo's own suite is ~105s pytest alone, longer with render/lint viamake test), the baseline run times out and returns{"status":"baseline_failures","timed_out":true,"returncode":-1,"baseline_failures":[]}. The emptybaseline_failuresis then indistinguishable from a genuinely-green baseline — so any pre-existing failure is silently treated as "not pre-existing", defeating the regression-vs-pre-existing distinction the baseline exists to provide.mapify_version: 3.20.0Observed
During a
/map-efficientINIT_STATE pre-flight on this repo:{"status":"baseline_failures","command":"make test","timed_out":true, "returncode":-1,"elapsed_seconds":120.01,"baseline_failures":[],...}The suite did not actually fail — it just exceeded 120s.
baseline_failures: []here means "we never finished", not "nothing was broken".Expected
A baseline that cannot complete should NOT present as an empty (= clean) baseline. Options (any/all):
timeout_secondsfor the full-suite path (the existing flaky-triage/repro-probe 120s caps are per-run, not per-full-suite — a full suite needs more headroom), and/or make it configurable via.map/config.yaml.timed_outis true, mark the baseline result so downstream consumers treat it as unknown, not clean (e.g. abaseline_complete: falseflag that later regression checks must honor — fail-safe, not fail-open).Affected
src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinja:11687—record_test_baseline(..., timeout_seconds: int = 120)(and the rendered mirrors).Found while shipping #303 Slice 5a (PR #306) — the slice itself was unaffected because the full
make checkwas run independently as the real gate, but the baseline degradation is a general framework correctness gap (fail-open on timeout) for any repo with a >120s suite.