record_test_baseline fail-open on >120s suite: timeout yields empty baseline indistinguishable from clean

## Summary

`record_test_baseline` defaults to a 120s subprocess timeout (`timeout_seconds: int = 120`). On a project whose full test suite runs longer than ~120s (this repo's own suite is ~105s pytest alone, longer with render/lint via `make test`), the baseline run times out and returns `{"status":"baseline_failures","timed_out":true,"returncode":-1,"baseline_failures":[]}`. The empty `baseline_failures` is then indistinguishable from a genuinely-green baseline — so any pre-existing failure is silently treated as "not pre-existing", defeating the regression-vs-pre-existing distinction the baseline exists to provide.

`mapify_version`: 3.20.0

## Observed

During a `/map-efficient` INIT_STATE pre-flight on this repo:
```json
{"status":"baseline_failures","command":"make test","timed_out":true,
 "returncode":-1,"elapsed_seconds":120.01,"baseline_failures":[],...}
```
The suite did not actually fail — it just exceeded 120s. `baseline_failures: []` here means "we never finished", not "nothing was broken".

## Expected

A baseline that cannot complete should NOT present as an empty (= clean) baseline. Options (any/all):
1. Raise the default `timeout_seconds` for the full-suite path (the existing flaky-triage/repro-probe 120s caps are per-run, not per-full-suite — a full suite needs more headroom), and/or make it configurable via `.map/config.yaml`.
2. When `timed_out` is true, mark the baseline result so downstream consumers treat it as **unknown**, not **clean** (e.g. a `baseline_complete: false` flag that later regression checks must honor — fail-safe, not fail-open).
3. Surface a loud warning when the baseline times out so operators know the regression-vs-pre-existing signal is degraded.

## Affected

- `src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinja:11687` — `record_test_baseline(..., timeout_seconds: int = 120)` (and the rendered mirrors).

Found while shipping #303 Slice 5a (PR #306) — the slice itself was unaffected because the full `make check` was run independently as the real gate, but the baseline degradation is a general framework correctness gap (fail-open on timeout) for any repo with a >120s suite.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record_test_baseline fail-open on >120s suite: timeout yields empty baseline indistinguishable from clean #307

Summary

Observed

Expected

Affected

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

record_test_baseline fail-open on >120s suite: timeout yields empty baseline indistinguishable from clean #307

Description

Summary

Observed

Expected

Affected

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions