Skip to content

feat(chaos): chaos injector with Trigger/Fault registries (Stage 2c)#7

Closed
pradeepvrd wants to merge 3 commits into
integration/devops-bench-stage1from
feat/devops-bench-chaos
Closed

feat(chaos): chaos injector with Trigger/Fault registries (Stage 2c)#7
pradeepvrd wants to merge 3 commits into
integration/devops-bench-stage1from
feat/devops-bench-chaos

Conversation

@pradeepvrd

Copy link
Copy Markdown
Owner

Splits the legacy chaos module into devops_bench/chaos/ (← pkg/agents/chaos/chaos.py).

  • base.py (Trigger/Fault ABCs + FAULTS/TRIGGERS registries), agent.py (ChaosAgent loop), faults/generate_load.py.
  • Model-agnostic: the chaos LLM loop runs through devops_bench.models (get_model/LLMClient tool-calling) — no provider SDK. chaos_active_event signaling preserved.
  • Tests under tests/unit/chaos/.

Stacked draft PR — part of the in-place Stage 2/3 restructure (see docs/migration/pr-plan.md). Base is the fork branch shown above; it will be retargeted to gke-labs/main once Stage 1 (gke-labs#89–92) merges. PRs are intended to be reviewed and merged in stage order.

Status: peer-reviewed by 2 teammates + senior sign-off on the full integration branch; full suite green (ruff + 374 unit tests). Do NOT mark ready until its stage is up for merge.

@pradeepvrd pradeepvrd force-pushed the feat/devops-bench-chaos branch from 2ab9ff8 to 3819a8e Compare June 18, 2026 07:57
…BCs and registries (2c)

Modules moved/refactored:
- pkg/agents/chaos/chaos.py -> devops_bench/chaos/agent.py (ChaosAgent loop)
                            + devops_bench/chaos/faults/generate_load.py (fault exec)
- new devops_bench/chaos/base.py (Fault/Trigger ABCs + FAULTS/TRIGGERS registries)
- new devops_bench/chaos/__init__.py + devops_bench/chaos/faults/__init__.py (light re-exports; no SDK imports)
- new tests/unit/chaos/test_chaos_agent.py + test_chaos_generate_load.py (legacy chaos_test.py ported to pytest)

Bugs fixed vs legacy:
- none (pure structural move; behavioral fixes land in the following fix(chaos) commit)

Improvements vs legacy:
- split the monolithic ChaosAgent into an orchestration layer (agent.py) and a registered fault (faults/generate_load.py), so faults are pluggable
- added Fault/Trigger ABCs and the FAULTS/TRIGGERS registries (base.py) per the component design, replacing ad-hoc dispatch on action "type"
- made the LLM loop model-agnostic: drive it through the neutral devops_bench.models LLMClient interface (get_model + format_tools/generate_content/extract_function_calls/get_text_content) instead of the hardcoded google.genai chat client, with provider/model from CHAOS_PROVIDER/CHAOS_MODEL falling back to AGENT_PROVIDER/AGENT_MODEL
- preserved the chaos_active_event signaling so the harness can detect an active load spike
- exposed command execution as a single run_command tool and bounded the loop with a turn cap
…s, and event ordering

Modules moved/refactored:
- see base move commit (devops_bench/chaos/agent.py, devops_bench/chaos/faults/generate_load.py)

Bugs fixed vs legacy:
- ChaosAgent._run_async dropped the model's final text when a tool call landed on the last turn (or the turn cap): final_text was only assigned when there were no function calls. Now set final_text on every turn so an accompanying summary is never lost.
- _execute_tool raised AttributeError when the model returned non-dict tool args (str/list/None): args.get(...) was called unconditionally. Now guard with isinstance(args, dict) and return "Error: tool args must be an object"; the caller passes raw args so the guard fires.
- run_chaos_command raised IndexError on an empty command string (shlex.split -> [] -> run([])). Now short-circuit with "Error: command string is empty" before parsing.
- run_chaos_command set chaos_active_event BEFORE parsing, so a command that failed shlex.split still told the harness "load active". Now signal the event only after a successful parse, immediately before execution.

Improvements vs legacy:
- none (behavioral bug fixes only; further improvements land in the following feat(chaos) commit)
…ndency injection

Modules moved/refactored:
- see base move commit (devops_bench/chaos/agent.py, devops_bench/chaos/faults/generate_load.py)

Bugs fixed vs legacy:
- none (fixes landed in the preceding fix(chaos) commit)

Improvements vs legacy:
- expand a leading ~ in each command token (os.path.expanduser) so model-emitted paths like ~/go/bin/fortio resolve under the shell-free argv executor instead of failing execvp; document that only single, non-piped commands are supported (no pipes/redirection/$VAR) in the run_command prompt and docstring.
- drive the fortio target URL from the spec: read target.service_url (rewritten by the harness to the local port-forward) via target_url_from_spec() with a single _DEFAULT_TARGET_URL fallback, and inject it into both the goal and the system instruction (build_system_instruction(target_url)), removing the hardcoded http://localhost:8080 from SYSTEM_INSTRUCTION and goal().
- ChaosAgent.__init__ now accepts optional system_instruction and tools (defaulting to the module constants), used throughout the loop, so the agent is reusable for other faults.
- decouple the orchestrator from the concrete fault: drop the top-level import of run_chaos_command and inject a tool_handler callable into the ctor (lazily defaulting to run_chaos_command); _execute_tool dispatches via self._tool_handler.
@pradeepvrd

Copy link
Copy Markdown
Owner Author

Superseded by the reconciled cross-cutting refactor (see docs/refactor/e2e-refactor-sequencing-plan.md). Reworked into the layered devops_bench/ package on branch refactor/integration; replaced by the reworked component PRs and capstone #23. Closing as superseded.

@pradeepvrd pradeepvrd closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant