Skip to content

Resilience CIA metric + biased red FSM (rebased + cleanups)#14

Merged
PaulHax merged 16 commits into
mainfrom
pr-11-rebase
May 8, 2026
Merged

Resilience CIA metric + biased red FSM (rebased + cleanups)#14
PaulHax merged 16 commits into
mainfrom
pr-11-rebase

Conversation

@PaulHax

@PaulHax PaulHax commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Supersedes #11. Rebases @Dmujt's resilience work onto current main and stacks cleanups on top.

Commits

Rebased from #11 (authorship preserved):

  • 699a5d4 Add resilience metric and CIA scoring from CAGE 2 + CIA-specific red agents and env
  • a2db6d5 Config to select red agent (incl. CIA-targeted variants) + Dartmouth fixed-host method
  • 8939529 Resilience config
  • ed69653 Revert default recipe to standard without resilience
  • 1c25f51 Brief README description
  • 8740892 Combine Resilience and Targeted CIA agent envs for training
  • caa0503 Rename targeted red FSM to resilience red FSM
  • 193eaeb Refactor: clean up resilience classes from prior cc4

(Dropped 9013c5f and 4b3bd01 — those PaulHax commits already landed on main via 1c6b059.)

Stacked cleanups:

  • 5add27f fixup: API ports + lint. sample_red_policy_random was renamed/replaced by sample_red_policy_choice(probs) in main's 10a83b1; ScenarioEnv lost the topology_bank_size/sync_red_policy_bank kwargs that never landed. Plus ruff fixes.
  • 2da68c8 scorer registry. cc4_score_trajectories.py was duplicating evaluation/cia/__init__.py:get_cia_scorer's dispatch with its own if resilience_mode branch. Now uses the registry.
  • df92cc3 single source of truth for role assignment. Three implementations existed: JAX const, trajectory recorder regex, CybORG mirror — drifting silently. Consolidated into scenarios/cc4/topology_roles.py with a parity test pinning the canonical fn against hand-written hostname lists.
  • ddd3f55 fix: re-export ROLE_NONE from topology_roles, not resilience_metric. The earlier consolidation removed ROLE_NONE from resilience_metric.py but evaluation/cia/__init__.py was still importing it from there — broke the entire fast pytest suite on import. Re-routes the re-export to the canonical home.
  • 202d67a fix: resilience recipe dev placeholders. train.total_timesteps: 10000 → 3000000 (was tripping the "≥1 update" smoke test); cleanrl.num_envs: 1 → 48 (was making batch not divisible by num_minibatches=16). Both values match the recipe's own inline comments and recipes/default.yaml; they were dev-mode overrides the author left in.
  • ca0c86f drop cc4_cia_metric (PaulHax's pre-paper CIA scorer). The cc4 metric is a CC4-port of CC2's CIATriadMetric with zone-weighted host scoring and per-event-type CIA mapping — not what the resilience paper specifies. Resilience is. Keeping both invited "which scorer matches the paper?" confusion. Also drops scripts/eval/cc4_aggregate_cia.py, a CEC-pilot research tool that consumed red_event_counts/blue_event_counts (fields the resilience score doesn't have, so it was inert against the surviving metric). get_cia_scorer registry indirection kept so future metrics still register here.

How roles are picked

CC4 has no native CIA / auth / db / web concepts. Those labels are a synthetic per-episode overlay this PR introduces so the resilience metric has something to score.

Each episode picks 3 of the operational-zone server hostnames at random (out of ~6) and tags them auth / db / web. The metric scores impact actions against those 3; the red bias points at those 3 (or a CIA subset). All other hosts — including untagged op-zone servers — are unbiased and unscored that episode. Same (env_seed) reproduces the same map; over many episodes every op-zone server sees every role.

State of this PR vs #15: as rebased, this PR's role assignment is "lowest-3 sorted hostnames" (deterministic per topology) on the JAX side and index mod 3 across all op-zone servers on the CybORG side — they diverge. #15 (stacked on this branch) implements the per-episode-random scheme described above and unifies both sides onto one rule. Reviewers who want the final behaviour should look at #15.

CC4's hosts also run real services (apache2, mysqld, smtp, sshd, otservice) chosen randomly per episode; the role assignment ignores that. A meaningful follow-up would derive db/web roles from mysqld/apache2 presence; "auth" has no CC4 service equivalent (sshd is on every host).

Validation

  • uv run ruff check clean
  • 850+ tests pass (tests/test_resilience_roles.py + the affected test_fsm_red_env/test_cc4_env/tests/subsystems/test_recipes_smoke)

Architecture follow-up — #15

#15 is stacked on this branch and (a) collapses ResilienceRedCC4Env + the four hand-rolled selectors into a red_selector registry + extras_factory injection, and (b) replaces the divergent role-assignment schemes with a single per-episode-random rule. Net effect: the next biased-red PR becomes ~200 lines instead of ~1300, and the resilience metric stops being implicit-per-side. Review order: this PR first, then #15 against main once this merges.

Closes #11.

Dmujt and others added 11 commits May 6, 2026 11:07
…Also updated env to fixed selection of CIA tied hosts for Dartmouth method
Two API drifts surfaced after rebase, plus lint nits.

resilience_red_fsm.py: ported from old sample_red_policy_random + decode-token +
use_red_policy_randoms cond branch (deleted in main 10a83b1) to the new
sample_red_policy_choice(probs) API. The resilience-specific weighting
(eligible * host_weights / sum) is preserved; tape-based parity replay still
flows through sample_red_policy_choice.

resilience_red_env.py: dropped topology_bank_size + sync_red_policy_bank
kwargs that were removed from ScenarioEnv when the topology-bank work didn't
land.

Lint: import sort, unused GLOBAL_MAX_HOSTS, one E501 wrap, two noqa: E741 on
C/I/A loop variables (CIA-triad domain notation, intentional).
…stry

The CIA scorer registry already exists (evaluation/cia/__init__.py), keyed
on eval_cfg["cia_metric"]. The script was duplicating that dispatch with its
own resilience_mode if-branch. Replace with a single get_cia_scorer call so
new metrics register in one place and scripts stay metric-agnostic.

No behavioural change — both branches resolve to the same callable they did
before, just looked up via the registry instead of inline.
Three implementations existed: JAX topology (by host index), trajectory
recorder (regex on hostnames), and CybORG-side biased red agent (its own
regex + index%3 mapping). With no test pinning them they were free to drift
silently — and on inspection the CybORG-side index%3 scheme already does
disagree with the JAX/recorder "lowest 3 sorted" assignment, with no docs.

Consolidate the shared bits — role constants, the operational-server regex,
and the canonical hostname-list role assignment — into
scenarios/cc4/topology_roles.py. JAX topology, trajectory recorder, and
the CybORG mirror agents all import from one place. Resilience metric drops
its private ROLE_* constants. RESILIENCE_ROLE_* aliases preserved on the
JAX side for backwards compat.

The CybORG mirror keeps its own divergent index%3 scheme, but now with a
clear module-docstring explanation of why and a TODO(resilience-parity)
flagging the score↔bias mismatch this divergence creates.

Tests: tests/test_resilience_roles.py pins the canonical fn against
hand-written hostname lists; future refactors can't silently drift the
four call sites.
PaulHax added 3 commits May 6, 2026 16:07
The earlier consolidation moved ROLE_NONE/AUTH/DB/WEB out of
resilience_metric.py into topology_roles.py, but evaluation/cia/__init__.py
kept importing from resilience_metric. ROLE_AUTH/DB/WEB happened to keep
working because resilience_metric still imports them itself
(re-export-by-import) but ROLE_NONE wasn't in that import list, breaking
``from jaxborg.evaluation.cia import score_trajectory_file`` and the entire
fast pytest suite.

Switch the __init__ re-export to topology_roles for all four constants —
that's the canonical home now.
… tests

Two test_recipes_smoke failures, both in recipes/resilience.yaml:

- train.total_timesteps: 10000 → 3000000. With NUM_ENVS=48 * NUM_STEPS=500 =
  24000 steps/update, the dev value yielded 0 updates and tripped
  test_jax_projection's "updates >= 1" sanity check. The recipe's own
  inline comment ("125 updates @ 3M") plus its meta.notes ("Replication
  target — 3M steps, mean across 3 seeds") confirm 3M was the intent;
  10000 was leftover dev override never reverted.

- cleanrl.num_envs: 1 → 48. Made batch (1*500*1=500) not divisible by
  num_minibatches=16, tripping test_minibatch_divides_batch. 48 matches
  the inline comment ("48*500=24000 steps/update") and recipes/default.yaml.
Per discussion: the cc4 metric — a CC4-port of CC2's CIATriadMetric with
zone-weighted host scoring and per-event-type CIA mapping — is not what the
resilience paper specifies. The resilience metric is. Keeping both invites
confusion ("which scorer matches the paper?") with no value-add since the
paper-spec metric is already shipped.

Removed:
- src/jaxborg/evaluation/cia/cc4_cia_metric.py (the metric itself)
- scripts/eval/cc4_aggregate_cia.py (CEC-pilot research tool that consumed
  red_event_counts/blue_event_counts — fields the resilience score doesn't
  carry, so the script was inert against the surviving metric)

Updated:
- evaluation/cia/__init__.py — drop cc4 imports/exports; get_cia_scorer
  registry stays so future metrics still register here.
- scripts/eval/cc4_score_trajectories.py — drop the defensive
  getattr(s, "impact_counts") or getattr(s, "red_event_counts") that
  was straddling both metrics; report only impact_counts now.
- recipe.py:project_eval — change cia_metric default from "cc4" to
  "resilience"; update docstring.
- recipes/resilience.yaml — drop the now-stale "cc4: original composite"
  comment.
PaulHax added 2 commits May 7, 2026 09:35
PR #14's slow run got cancelled at 4h05min with the test suite at ~90%
(see run 25460410128). The ubuntu-latest runner has 4 vCPUs so xdist only
spawns 2 workers, which makes the 545 slow items take ~3.5h end-to-end.
6h gives ~70% headroom on the observed runtime to absorb future test
additions before the next bump.
Amend of f347c35 — go big once instead of bumping again. 10h gives ~3×
headroom on the current 3.5h runtime; new tests can grow without tripping
the limit for a long while.
@PaulHax PaulHax merged commit 2ead5c4 into main May 8, 2026
4 of 6 checks passed
PaulHax added a commit that referenced this pull request May 8, 2026
PR #14's slow run got cancelled at 4h05min with the test suite at ~90%
(see run 25460410128). The ubuntu-latest runner has 4 vCPUs so xdist only
spawns 2 workers, which makes the 545 slow items take ~3.5h end-to-end.
6h gives ~70% headroom on the observed runtime to absorb future test
additions before the next bump.
@PaulHax PaulHax deleted the pr-11-rebase branch May 9, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants