Resilience CIA metric + biased red FSM (rebased + cleanups) by PaulHax · Pull Request #14 · ITM-Kitware/jaxborg

PaulHax · 2026-05-06T19:14:06Z

Supersedes #11. Rebases @Dmujt's resilience work onto current main and stacks cleanups on top.

Commits

Rebased from #11 (authorship preserved):

699a5d4 Add resilience metric and CIA scoring from CAGE 2 + CIA-specific red agents and env
a2db6d5 Config to select red agent (incl. CIA-targeted variants) + Dartmouth fixed-host method
8939529 Resilience config
ed69653 Revert default recipe to standard without resilience
1c25f51 Brief README description
8740892 Combine Resilience and Targeted CIA agent envs for training
caa0503 Rename targeted red FSM to resilience red FSM
193eaeb Refactor: clean up resilience classes from prior cc4

(Dropped 9013c5f and 4b3bd01 — those PaulHax commits already landed on main via 1c6b059.)

Stacked cleanups:

5add27f fixup: API ports + lint. sample_red_policy_random was renamed/replaced by sample_red_policy_choice(probs) in main's 10a83b1; ScenarioEnv lost the topology_bank_size/sync_red_policy_bank kwargs that never landed. Plus ruff fixes.
2da68c8 scorer registry. cc4_score_trajectories.py was duplicating evaluation/cia/__init__.py:get_cia_scorer's dispatch with its own if resilience_mode branch. Now uses the registry.
df92cc3 single source of truth for role assignment. Three implementations existed: JAX const, trajectory recorder regex, CybORG mirror — drifting silently. Consolidated into scenarios/cc4/topology_roles.py with a parity test pinning the canonical fn against hand-written hostname lists.
ddd3f55 fix: re-export ROLE_NONE from topology_roles, not resilience_metric. The earlier consolidation removed ROLE_NONE from resilience_metric.py but evaluation/cia/__init__.py was still importing it from there — broke the entire fast pytest suite on import. Re-routes the re-export to the canonical home.
202d67a fix: resilience recipe dev placeholders. train.total_timesteps: 10000 → 3000000 (was tripping the "≥1 update" smoke test); cleanrl.num_envs: 1 → 48 (was making batch not divisible by num_minibatches=16). Both values match the recipe's own inline comments and recipes/default.yaml; they were dev-mode overrides the author left in.
ca0c86f drop cc4_cia_metric (PaulHax's pre-paper CIA scorer). The cc4 metric is a CC4-port of CC2's CIATriadMetric with zone-weighted host scoring and per-event-type CIA mapping — not what the resilience paper specifies. Resilience is. Keeping both invited "which scorer matches the paper?" confusion. Also drops scripts/eval/cc4_aggregate_cia.py, a CEC-pilot research tool that consumed red_event_counts/blue_event_counts (fields the resilience score doesn't have, so it was inert against the surviving metric). get_cia_scorer registry indirection kept so future metrics still register here.

How roles are picked

CC4 has no native CIA / auth / db / web concepts. Those labels are a synthetic per-episode overlay this PR introduces so the resilience metric has something to score.

Each episode picks 3 of the operational-zone server hostnames at random (out of ~6) and tags them auth / db / web. The metric scores impact actions against those 3; the red bias points at those 3 (or a CIA subset). All other hosts — including untagged op-zone servers — are unbiased and unscored that episode. Same (env_seed) reproduces the same map; over many episodes every op-zone server sees every role.

State of this PR vs #15: as rebased, this PR's role assignment is "lowest-3 sorted hostnames" (deterministic per topology) on the JAX side and index mod 3 across all op-zone servers on the CybORG side — they diverge. #15 (stacked on this branch) implements the per-episode-random scheme described above and unifies both sides onto one rule. Reviewers who want the final behaviour should look at #15.

CC4's hosts also run real services (apache2, mysqld, smtp, sshd, otservice) chosen randomly per episode; the role assignment ignores that. A meaningful follow-up would derive db/web roles from mysqld/apache2 presence; "auth" has no CC4 service equivalent (sshd is on every host).

Validation

uv run ruff check clean
850+ tests pass (tests/test_resilience_roles.py + the affected test_fsm_red_env/test_cc4_env/tests/subsystems/test_recipes_smoke)

Architecture follow-up — #15

#15 is stacked on this branch and (a) collapses ResilienceRedCC4Env + the four hand-rolled selectors into a red_selector registry + extras_factory injection, and (b) replaces the divergent role-assignment schemes with a single per-episode-random rule. Net effect: the next biased-red PR becomes ~200 lines instead of ~1300, and the resilience metric stops being implicit-per-side. Review order: this PR first, then #15 against main once this merges.

Closes #11.

…fic red agents and env for train/test

…Also updated env to fixed selection of CIA tied hosts for Dartmouth method

Two API drifts surfaced after rebase, plus lint nits. resilience_red_fsm.py: ported from old sample_red_policy_random + decode-token + use_red_policy_randoms cond branch (deleted in main 10a83b1) to the new sample_red_policy_choice(probs) API. The resilience-specific weighting (eligible * host_weights / sum) is preserved; tape-based parity replay still flows through sample_red_policy_choice. resilience_red_env.py: dropped topology_bank_size + sync_red_policy_bank kwargs that were removed from ScenarioEnv when the topology-bank work didn't land. Lint: import sort, unused GLOBAL_MAX_HOSTS, one E501 wrap, two noqa: E741 on C/I/A loop variables (CIA-triad domain notation, intentional).

…stry The CIA scorer registry already exists (evaluation/cia/__init__.py), keyed on eval_cfg["cia_metric"]. The script was duplicating that dispatch with its own resilience_mode if-branch. Replace with a single get_cia_scorer call so new metrics register in one place and scripts stay metric-agnostic. No behavioural change — both branches resolve to the same callable they did before, just looked up via the registry instead of inline.

Three implementations existed: JAX topology (by host index), trajectory recorder (regex on hostnames), and CybORG-side biased red agent (its own regex + index%3 mapping). With no test pinning them they were free to drift silently — and on inspection the CybORG-side index%3 scheme already does disagree with the JAX/recorder "lowest 3 sorted" assignment, with no docs. Consolidate the shared bits — role constants, the operational-server regex, and the canonical hostname-list role assignment — into scenarios/cc4/topology_roles.py. JAX topology, trajectory recorder, and the CybORG mirror agents all import from one place. Resilience metric drops its private ROLE_* constants. RESILIENCE_ROLE_* aliases preserved on the JAX side for backwards compat. The CybORG mirror keeps its own divergent index%3 scheme, but now with a clear module-docstring explanation of why and a TODO(resilience-parity) flagging the score↔bias mismatch this divergence creates. Tests: tests/test_resilience_roles.py pins the canonical fn against hand-written hostname lists; future refactors can't silently drift the four call sites.

The earlier consolidation moved ROLE_NONE/AUTH/DB/WEB out of resilience_metric.py into topology_roles.py, but evaluation/cia/__init__.py kept importing from resilience_metric. ROLE_AUTH/DB/WEB happened to keep working because resilience_metric still imports them itself (re-export-by-import) but ROLE_NONE wasn't in that import list, breaking ``from jaxborg.evaluation.cia import score_trajectory_file`` and the entire fast pytest suite. Switch the __init__ re-export to topology_roles for all four constants — that's the canonical home now.

… tests Two test_recipes_smoke failures, both in recipes/resilience.yaml: - train.total_timesteps: 10000 → 3000000. With NUM_ENVS=48 * NUM_STEPS=500 = 24000 steps/update, the dev value yielded 0 updates and tripped test_jax_projection's "updates >= 1" sanity check. The recipe's own inline comment ("125 updates @ 3M") plus its meta.notes ("Replication target — 3M steps, mean across 3 seeds") confirm 3M was the intent; 10000 was leftover dev override never reverted. - cleanrl.num_envs: 1 → 48. Made batch (1*500*1=500) not divisible by num_minibatches=16, tripping test_minibatch_divides_batch. 48 matches the inline comment ("48*500=24000 steps/update") and recipes/default.yaml.

Per discussion: the cc4 metric — a CC4-port of CC2's CIATriadMetric with zone-weighted host scoring and per-event-type CIA mapping — is not what the resilience paper specifies. The resilience metric is. Keeping both invites confusion ("which scorer matches the paper?") with no value-add since the paper-spec metric is already shipped. Removed: - src/jaxborg/evaluation/cia/cc4_cia_metric.py (the metric itself) - scripts/eval/cc4_aggregate_cia.py (CEC-pilot research tool that consumed red_event_counts/blue_event_counts — fields the resilience score doesn't carry, so the script was inert against the surviving metric) Updated: - evaluation/cia/__init__.py — drop cc4 imports/exports; get_cia_scorer registry stays so future metrics still register here. - scripts/eval/cc4_score_trajectories.py — drop the defensive getattr(s, "impact_counts") or getattr(s, "red_event_counts") that was straddling both metrics; report only impact_counts now. - recipe.py:project_eval — change cia_metric default from "cc4" to "resilience"; update docstring. - recipes/resilience.yaml — drop the now-stale "cc4: original composite" comment.

PR #14's slow run got cancelled at 4h05min with the test suite at ~90% (see run 25460410128). The ubuntu-latest runner has 4 vCPUs so xdist only spawns 2 workers, which makes the 545 slow items take ~3.5h end-to-end. 6h gives ~70% headroom on the observed runtime to absorb future test additions before the next bump.

Amend of f347c35 — go big once instead of bumping again. 10h gives ~3× headroom on the current 3.5h runtime; new tests can grow without tripping the limit for a long while.

PR #14's slow run got cancelled at 4h05min with the test suite at ~90% (see run 25460410128). The ubuntu-latest runner has 4 vCPUs so xdist only spawns 2 workers, which makes the 545 slow items take ~3.5h end-to-end. 6h gives ~70% headroom on the observed runtime to absorb future test additions before the next bump.

Dmujt and others added 11 commits May 6, 2026 11:07

Add resilience metric and CIA scoring from CAGE 2. Also add CIA-speci…

699a5d4

…fic red agents and env for train/test

Added config to select red agent, including targeted CIA red agents. …

a2db6d5

…Also updated env to fixed selection of CIA tied hosts for Dartmouth method

Added resilience config

8939529

Revert default recipe to standard without resilience

ed69653

Added brief description in README for running resilience metric

1c25f51

Combine Resilience nad Targeted CIA agent Env for training

8740892

Rename targeted red fsm to resilience red fsm agent

caa0503

Refactor: clean up resilience classes from prior cc4

193eaeb

PaulHax mentioned this pull request May 6, 2026

Pluggable red selector + extras factory (architecture follow-up) #15

Merged

PaulHax added 3 commits May 6, 2026 16:07

PaulHax mentioned this pull request May 6, 2026

Add Quantitative Resilience CIA Metric and Network Topology from CAGE 2 #11

Closed

1 task

PaulHax added 2 commits May 7, 2026 09:35

ci: bump slow timeout 4h → 10h

12bfbba

Amend of f347c35 — go big once instead of bumping again. 10h gives ~3× headroom on the current 3.5h runtime; new tests can grow without tripping the limit for a long while.

PaulHax merged commit 2ead5c4 into main May 8, 2026
4 of 6 checks passed

PaulHax deleted the pr-11-rebase branch May 9, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resilience CIA metric + biased red FSM (rebased + cleanups)#14

Resilience CIA metric + biased red FSM (rebased + cleanups)#14
PaulHax merged 16 commits into
mainfrom
pr-11-rebase

PaulHax commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

PaulHax commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits

How roles are picked

Validation

Architecture follow-up — #15

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PaulHax commented May 6, 2026 •

edited

Loading