Skip to content

stress: implement tools/stress/device-orchestrator #3746

@elitegreg

Description

@elitegreg

Why

Drives the sweep. Manages onchain user records on the live devnet ledger (alternate serviceability program), records timestamps end-to-end, and runs the agent on the DUT via SSH so its log goes to a known location.

See Notion design for full context.

Depends on

Scope (single Go binary; PR may split — flag at review)

Create tools/stress/device-orchestrator/ (new Go module under the existing tools/ tree alongside gnmi-tunnel, twamp, etc.).

  • CLI flags (per doc): --target-user-count, --users-per-batch, --hold-seconds (default 180), --dut-pubkey, --dut-ssh-host, --dut-ssh-key, --rpc-url, --program-id, --keypair, --controller (IP:PORT, passed to agent), --abort-file, --working-dir.
  • Reconcile-to-target loop: on each batch iteration, query the alternate-program user list, create or delete to reach the next target. Use smartcontract/sdk/go/serviceability directly — do not shell out to the doublezero CLI.
  • Hold: pause --hold-seconds between batches for observer samples.
  • Deprovision phase: reverse-order delete, same timestamp set, distinct provision/deprovision event names.
  • Agent runner: ssh to DUT, run doublezero-agent ... -verbose, stream stdout/stderr into <working-dir>/orchestrator.agent.log. Parse "Committing config session due to diffs detected: <diff>" to extract + interface Tunnel<ID> and record t_pre_commit_log; parse the commit-success line (from config agent: log config size in bytes  #3741 scope extension) to record t_agent_applied.
  • Outputs in --working-dir:
    • orchestrator-config.json — all CLI options.
    • orchestrator-runlog.json — one row per event: {run_id, user_index, user_pubkey, tunnel_id, event, t_ns, n_after_event}. Events: submit | confirm | activate | pre_commit_log | applied | deprovision_*.
    • orchestrator.agent.log — tailed agent output.
  • Abort handling: poll --abort-file; on appearance, finish the current user, stop, dump partial outputs, exit non-zero.

Acceptance

  • Dry run against local devnet (cEOS dz1) sweeps 0 → 8 in batches of 2, hold 10 s, deprovisions cleanly. Output JSON validates.
  • All five event timestamps recorded for at least one user end-to-end.
  • Respects --abort-file within one batch iteration.
  • No doublezero CLI shellouts in the code.

Notes

If PR exceeds ~500 LOC, split into: (a) SDK-based user CRUD + reconcile-to-target; (b) sweep loop + JSON outputs + abort polling; (c) SSH + agent log parser. Mention this in the PR description per CLAUDE.md.

Tracker: #3744.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions