Skip to content

tools/stress: orchestrator skeleton (CLI, sweep, runlog, abort)#3776

Open
elitegreg wants to merge 1 commit into
gm/sdk-user-crud-reconcilefrom
gm/stress-orchestrator-skeleton
Open

tools/stress: orchestrator skeleton (CLI, sweep, runlog, abort)#3776
elitegreg wants to merge 1 commit into
gm/sdk-user-crud-reconcilefrom
gm/stress-orchestrator-skeleton

Conversation

@elitegreg
Copy link
Copy Markdown
Contributor

Summary

Adds the device-stress orchestrator skeleton at tools/stress/device-orchestrator/, for the GRE Tunnel Capacity Study. Stacked on top of #3774 (part 1, SDK user CRUD). Part 2 of #3746. Closes #3771.

  • cmd/device-orchestrator — every flag from stress: implement tools/stress/device-orchestrator #3746's CLI list (--target-user-count, --users-per-batch, --hold-seconds, --dut-pubkey, --dut-ssh-host, --dut-ssh-key, --rpc-url, --program-id, --keypair, --controller, --abort-file, --working-dir plus --client-ip-base, --tunnel-endpoint, --tenant-pubkey, --run-id, --log-level, --dry-run). Dumps orchestrator-config.json on start.
  • pkg/reconcilePlanFor(current, target, ownerFilter) returns a deterministic Plan{ToCreate, ToDelete} delta. Lifted from the part-1 SDK PR per the discussion — it's orchestrator policy, not an SDK primitive.
  • pkg/sweep — provision-then-reverse-deprovision loop driven by PlanFor; batches of --users-per-batch with --hold-seconds between batches; reverse-creation-order deprovision tracked by the sweep itself; emits submit | confirm | activate | deprovision_* runlog rows.
  • pkg/runlog — append-only JSONL writer for orchestrator-runlog.json with the row schema {run_id, user_index, user_pubkey, tunnel_id, event, t_ns, n_after_event}.
  • pkg/abort — ticker-based watcher of --abort-file; cancels a derived ctx so the sweep finishes the in-flight user before exiting non-zero, then still tears down what was created.
  • pkg/agentRunner interface (Start(ctx) error; Events() <-chan Event) with a no-op implementation. The SSH-backed runner and the pre_commit_log / applied row generation land in part 3.
  • pkg/execLive impl of sweep.Executor wrapping serviceability.{Client, Executor}; picks deterministic per-user IPs (base + idx) and forwards DevicePubkey / TenantPubkey to UserCreateArgs.
  • Makefile mirrors tools/twamp/Makefile (build, test, lint).

Testing Verification

  • pkg/sweep: fake Executor + fake Clock + no-op Agent drive a 0→4 sweep in batches of 2. Asserts ordered submit/confirm/activate x4, reverse-order deprovision_submit/deprovision_confirm/deprovision_activate x4, Hold fires exactly once (between batches, not after reaching target), and n_after_event increments at activate / decrements at deprovision_activate.
  • pkg/sweep abort case: failing the 3rd create still drives deprovision over the first two users so the orchestrator never leaks state on abort.
  • pkg/abort: tempdir + touch the sentinel + assert the derived ctx cancels within 1s; empty-path watch is a no-op that still propagates parent cancellation.
  • pkg/runlog: round-trip rows, auto-fill t_ns, reject writes after Close, Open(path) truncates existing content.
  • pkg/reconcile: table-driven 0→N / N→0 / partial / foreign-only / mixed / negative / tie-break-by-pubkey.
  • Smoke test: make build produces bin/device-orchestrator; ./bin/device-orchestrator --dry-run --target-user-count 4 --users-per-batch 2 --working-dir /tmp/orch writes a valid orchestrator-config.json without contacting RPC.
  • make go-build go-lint go-test all green.

Out of scope

  • SSH-driven agent runner that parses Committing config session due to diffs detected: <diff> and the commit-success line into pre_commit_log / applied events. Lands in part 3 of stress: implement tools/stress/device-orchestrator #3746.
  • Live-RPC end-to-end test in CI (the binary is exercisable manually with the dz-local devnet).

Adds tools/stress/device-orchestrator/, the device-stress orchestrator binary
for the GRE Tunnel Capacity Study. The binary parses every flag from #3746's
CLI list, dumps orchestrator-config.json on start, runs a provision-then-
reverse-deprovision sweep against a live serviceability program, and emits
the runlog row schema {run_id, user_index, user_pubkey, tunnel_id, event,
t_ns, n_after_event} for each submit | confirm | activate | deprovision_*
event.

Packages:

- pkg/reconcile  — PlanFor() pure function (lifted from the part-1 SDK PR;
  now lives with the orchestrator as policy, not as an SDK primitive)
- pkg/runlog     — append-only JSONL writer for orchestrator-runlog.json
- pkg/sweep      — provision-then-deprovision loop driven by PlanFor; uses a
  Clock + Executor interface for testability; reverse-creation-order delete
- pkg/abort      — sentinel-file poller that cancels a derived ctx between
  user iterations so an in-flight Create/Delete completes before exit
- pkg/agent      — AgentRunner interface + noop impl; SSH runner lands in
  part 3 along with pre_commit_log / applied event emission
- pkg/exec       — Live impl of sweep.Executor over serviceability.{Client,
  Executor}; picks deterministic per-user IPs from --client-ip-base
- cmd/device-orchestrator — flag parsing, config dump, signal + abort
  handling, sweep wiring

The agent runner is stubbed behind an interface so this PR can land
end-to-end functionality (provision/deprovision + runlog + abort) without
the SSH plumbing. The SSH runner and the corresponding pre_commit_log /
applied row generation land in part 3 of #3746.

Part 2 of #3746. Closes #3771.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant