Skip to content

stress: implement tools/stress/device-observer #3747

@elitegreg

Description

@elitegreg

Why

Independent sampler that writes the breaking-point sentinel the orchestrator polls. Splitting from the orchestrator keeps abort logic out of the sweep path.

See Notion design for full context.

Scope

Create tools/stress/device-observer/.

  • CLI flags (per doc): --dut-host, --eapi-user, --eapi-pass, --agent-metrics-url, --sample-interval (default 10s), --abort-file, --working-dir.
  • Every --sample-interval (eAPI commands, JSON output): show hardware capacity | json, show gre tunnel static | json, show processes top once | json, show logging errors, show logging critical. One file per sample named <command>-<iso8601>.json.
  • Every --sample-interval: scrape --agent-metrics-url; append rows to observer.agent_metrics.json as {t_ns, metric_name, value, labels_json}.
  • Continuous tail: parse non-config-commit EOS log entries into observer.logging.json as {t_ns, severity, facility, message}. Also scan <working-dir>/orchestrator.agent.log for the abort-trigger strings below.
  • Abort triggers (write <working-dir>/abort once on first match; include reason):
    • Provision (t_submit → t_agent_applied) p95 over the active batch window > 30 s, or any single user > 30 s.
    • Deprovision p95 > 30 s.
    • Switch CPU > 80% sustained over any 60 s window.
    • doublezero_agent_apply_config_errors_total increments within a sample.
    • doublezero_agent_get_config_errors_total increments within a sample.
    • Agent log: "could not get diff because /usr/bin/Cli command timed out after 60 seconds".
    • Agent log: "not overriding lock since its age is less than" (any occurrence — already a clear race signal).
    • Agent log silent for > 15 s.
    • Ledger unreachable for > 30 s (signaled by orchestrator writing a heartbeat the observer can read, or a direct RPC probe — choose one; recommend the heartbeat approach to avoid duplicating RPC config).

Acceptance

  • Sampling produces the expected files on a local cEOS dz1 in < 30 s; JSON validates.
  • Each abort trigger has a unit test using a fixture log / fake eAPI response.
  • Manual: cause a fake CPU spike (e.g., fixture-injected show processes JSON) and confirm sentinel is written within one sample interval.

Tracker: #3744.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions