Resilience CIA metric + biased red FSM (rebased + cleanups)#14
Merged
Conversation
…fic red agents and env for train/test
…Also updated env to fixed selection of CIA tied hosts for Dartmouth method
Two API drifts surfaced after rebase, plus lint nits. resilience_red_fsm.py: ported from old sample_red_policy_random + decode-token + use_red_policy_randoms cond branch (deleted in main 10a83b1) to the new sample_red_policy_choice(probs) API. The resilience-specific weighting (eligible * host_weights / sum) is preserved; tape-based parity replay still flows through sample_red_policy_choice. resilience_red_env.py: dropped topology_bank_size + sync_red_policy_bank kwargs that were removed from ScenarioEnv when the topology-bank work didn't land. Lint: import sort, unused GLOBAL_MAX_HOSTS, one E501 wrap, two noqa: E741 on C/I/A loop variables (CIA-triad domain notation, intentional).
…stry The CIA scorer registry already exists (evaluation/cia/__init__.py), keyed on eval_cfg["cia_metric"]. The script was duplicating that dispatch with its own resilience_mode if-branch. Replace with a single get_cia_scorer call so new metrics register in one place and scripts stay metric-agnostic. No behavioural change — both branches resolve to the same callable they did before, just looked up via the registry instead of inline.
Three implementations existed: JAX topology (by host index), trajectory recorder (regex on hostnames), and CybORG-side biased red agent (its own regex + index%3 mapping). With no test pinning them they were free to drift silently — and on inspection the CybORG-side index%3 scheme already does disagree with the JAX/recorder "lowest 3 sorted" assignment, with no docs. Consolidate the shared bits — role constants, the operational-server regex, and the canonical hostname-list role assignment — into scenarios/cc4/topology_roles.py. JAX topology, trajectory recorder, and the CybORG mirror agents all import from one place. Resilience metric drops its private ROLE_* constants. RESILIENCE_ROLE_* aliases preserved on the JAX side for backwards compat. The CybORG mirror keeps its own divergent index%3 scheme, but now with a clear module-docstring explanation of why and a TODO(resilience-parity) flagging the score↔bias mismatch this divergence creates. Tests: tests/test_resilience_roles.py pins the canonical fn against hand-written hostname lists; future refactors can't silently drift the four call sites.
The earlier consolidation moved ROLE_NONE/AUTH/DB/WEB out of resilience_metric.py into topology_roles.py, but evaluation/cia/__init__.py kept importing from resilience_metric. ROLE_AUTH/DB/WEB happened to keep working because resilience_metric still imports them itself (re-export-by-import) but ROLE_NONE wasn't in that import list, breaking ``from jaxborg.evaluation.cia import score_trajectory_file`` and the entire fast pytest suite. Switch the __init__ re-export to topology_roles for all four constants — that's the canonical home now.
… tests
Two test_recipes_smoke failures, both in recipes/resilience.yaml:
- train.total_timesteps: 10000 → 3000000. With NUM_ENVS=48 * NUM_STEPS=500 =
24000 steps/update, the dev value yielded 0 updates and tripped
test_jax_projection's "updates >= 1" sanity check. The recipe's own
inline comment ("125 updates @ 3M") plus its meta.notes ("Replication
target — 3M steps, mean across 3 seeds") confirm 3M was the intent;
10000 was leftover dev override never reverted.
- cleanrl.num_envs: 1 → 48. Made batch (1*500*1=500) not divisible by
num_minibatches=16, tripping test_minibatch_divides_batch. 48 matches
the inline comment ("48*500=24000 steps/update") and recipes/default.yaml.
Per discussion: the cc4 metric — a CC4-port of CC2's CIATriadMetric with
zone-weighted host scoring and per-event-type CIA mapping — is not what the
resilience paper specifies. The resilience metric is. Keeping both invites
confusion ("which scorer matches the paper?") with no value-add since the
paper-spec metric is already shipped.
Removed:
- src/jaxborg/evaluation/cia/cc4_cia_metric.py (the metric itself)
- scripts/eval/cc4_aggregate_cia.py (CEC-pilot research tool that consumed
red_event_counts/blue_event_counts — fields the resilience score doesn't
carry, so the script was inert against the surviving metric)
Updated:
- evaluation/cia/__init__.py — drop cc4 imports/exports; get_cia_scorer
registry stays so future metrics still register here.
- scripts/eval/cc4_score_trajectories.py — drop the defensive
getattr(s, "impact_counts") or getattr(s, "red_event_counts") that
was straddling both metrics; report only impact_counts now.
- recipe.py:project_eval — change cia_metric default from "cc4" to
"resilience"; update docstring.
- recipes/resilience.yaml — drop the now-stale "cc4: original composite"
comment.
1 task
PR #14's slow run got cancelled at 4h05min with the test suite at ~90% (see run 25460410128). The ubuntu-latest runner has 4 vCPUs so xdist only spawns 2 workers, which makes the 545 slow items take ~3.5h end-to-end. 6h gives ~70% headroom on the observed runtime to absorb future test additions before the next bump.
Amend of f347c35 — go big once instead of bumping again. 10h gives ~3× headroom on the current 3.5h runtime; new tests can grow without tripping the limit for a long while.
PaulHax
added a commit
that referenced
this pull request
May 8, 2026
PR #14's slow run got cancelled at 4h05min with the test suite at ~90% (see run 25460410128). The ubuntu-latest runner has 4 vCPUs so xdist only spawns 2 workers, which makes the 545 slow items take ~3.5h end-to-end. 6h gives ~70% headroom on the observed runtime to absorb future test additions before the next bump.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #11. Rebases @Dmujt's resilience work onto current main and stacks cleanups on top.
Commits
Rebased from #11 (authorship preserved):
699a5d4Add resilience metric and CIA scoring from CAGE 2 + CIA-specific red agents and enva2db6d5Config to select red agent (incl. CIA-targeted variants) + Dartmouth fixed-host method8939529Resilience configed69653Revert default recipe to standard without resilience1c25f51Brief README description8740892Combine Resilience and Targeted CIA agent envs for trainingcaa0503Rename targeted red FSM to resilience red FSM193eaebRefactor: clean up resilience classes from prior cc4(Dropped
9013c5fand4b3bd01— those PaulHax commits already landed on main via1c6b059.)Stacked cleanups:
5add27ffixup: API ports + lint.sample_red_policy_randomwas renamed/replaced bysample_red_policy_choice(probs)in main's10a83b1;ScenarioEnvlost thetopology_bank_size/sync_red_policy_bankkwargs that never landed. Plus ruff fixes.2da68c8scorer registry.cc4_score_trajectories.pywas duplicatingevaluation/cia/__init__.py:get_cia_scorer's dispatch with its ownif resilience_modebranch. Now uses the registry.df92cc3single source of truth for role assignment. Three implementations existed: JAX const, trajectory recorder regex, CybORG mirror — drifting silently. Consolidated intoscenarios/cc4/topology_roles.pywith a parity test pinning the canonical fn against hand-written hostname lists.ddd3f55fix: re-export ROLE_NONE from topology_roles, not resilience_metric. The earlier consolidation removed ROLE_NONE fromresilience_metric.pybutevaluation/cia/__init__.pywas still importing it from there — broke the entire fast pytest suite on import. Re-routes the re-export to the canonical home.202d67afix: resilience recipe dev placeholders.train.total_timesteps: 10000 → 3000000(was tripping the "≥1 update" smoke test);cleanrl.num_envs: 1 → 48(was making batch not divisible by num_minibatches=16). Both values match the recipe's own inline comments andrecipes/default.yaml; they were dev-mode overrides the author left in.ca0c86fdrop cc4_cia_metric (PaulHax's pre-paper CIA scorer). The cc4 metric is a CC4-port of CC2's CIATriadMetric with zone-weighted host scoring and per-event-type CIA mapping — not what the resilience paper specifies. Resilience is. Keeping both invited "which scorer matches the paper?" confusion. Also dropsscripts/eval/cc4_aggregate_cia.py, a CEC-pilot research tool that consumedred_event_counts/blue_event_counts(fields the resilience score doesn't have, so it was inert against the surviving metric).get_cia_scorerregistry indirection kept so future metrics still register here.How roles are picked
CC4 has no native CIA / auth / db / web concepts. Those labels are a synthetic per-episode overlay this PR introduces so the resilience metric has something to score.
Each episode picks 3 of the operational-zone server hostnames at random (out of ~6) and tags them
auth/db/web. The metric scores impact actions against those 3; the red bias points at those 3 (or a CIA subset). All other hosts — including untagged op-zone servers — are unbiased and unscored that episode. Same(env_seed)reproduces the same map; over many episodes every op-zone server sees every role.CC4's hosts also run real services (
apache2,mysqld,smtp,sshd,otservice) chosen randomly per episode; the role assignment ignores that. A meaningful follow-up would derivedb/webroles frommysqld/apache2presence; "auth" has no CC4 service equivalent (sshd is on every host).Validation
uv run ruff checkcleantests/test_resilience_roles.py+ the affectedtest_fsm_red_env/test_cc4_env/tests/subsystems/test_recipes_smoke)Architecture follow-up — #15
#15 is stacked on this branch and (a) collapses
ResilienceRedCC4Env+ the four hand-rolled selectors into ared_selectorregistry +extras_factoryinjection, and (b) replaces the divergent role-assignment schemes with a single per-episode-random rule. Net effect: the next biased-red PR becomes ~200 lines instead of ~1300, and the resilience metric stops being implicit-per-side. Review order: this PR first, then #15 against main once this merges.Closes #11.