Skip to content

record_test_baseline fail-open on >120s suite: timeout yields empty baseline indistinguishable from clean #307

Description

@azalio

Summary

record_test_baseline defaults to a 120s subprocess timeout (timeout_seconds: int = 120). On a project whose full test suite runs longer than ~120s (this repo's own suite is ~105s pytest alone, longer with render/lint via make test), the baseline run times out and returns {"status":"baseline_failures","timed_out":true,"returncode":-1,"baseline_failures":[]}. The empty baseline_failures is then indistinguishable from a genuinely-green baseline — so any pre-existing failure is silently treated as "not pre-existing", defeating the regression-vs-pre-existing distinction the baseline exists to provide.

mapify_version: 3.20.0

Observed

During a /map-efficient INIT_STATE pre-flight on this repo:

{"status":"baseline_failures","command":"make test","timed_out":true,
 "returncode":-1,"elapsed_seconds":120.01,"baseline_failures":[],...}

The suite did not actually fail — it just exceeded 120s. baseline_failures: [] here means "we never finished", not "nothing was broken".

Expected

A baseline that cannot complete should NOT present as an empty (= clean) baseline. Options (any/all):

  1. Raise the default timeout_seconds for the full-suite path (the existing flaky-triage/repro-probe 120s caps are per-run, not per-full-suite — a full suite needs more headroom), and/or make it configurable via .map/config.yaml.
  2. When timed_out is true, mark the baseline result so downstream consumers treat it as unknown, not clean (e.g. a baseline_complete: false flag that later regression checks must honor — fail-safe, not fail-open).
  3. Surface a loud warning when the baseline times out so operators know the regression-vs-pre-existing signal is degraded.

Affected

  • src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinja:11687record_test_baseline(..., timeout_seconds: int = 120) (and the rendered mirrors).

Found while shipping #303 Slice 5a (PR #306) — the slice itself was unaffected because the full make check was run independently as the real gate, but the baseline degradation is a general framework correctness gap (fail-open on timeout) for any repo with a >120s suite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions