Why
Independent sampler that writes the breaking-point sentinel the orchestrator polls. Splitting from the orchestrator keeps abort logic out of the sweep path.
See Notion design for full context.
Scope
Create tools/stress/device-observer/.
- CLI flags (per doc):
--dut-host, --eapi-user, --eapi-pass, --agent-metrics-url, --sample-interval (default 10s), --abort-file, --working-dir.
- Every
--sample-interval (eAPI commands, JSON output): show hardware capacity | json, show gre tunnel static | json, show processes top once | json, show logging errors, show logging critical. One file per sample named <command>-<iso8601>.json.
- Every
--sample-interval: scrape --agent-metrics-url; append rows to observer.agent_metrics.json as {t_ns, metric_name, value, labels_json}.
- Continuous tail: parse non-config-commit EOS log entries into
observer.logging.json as {t_ns, severity, facility, message}. Also scan <working-dir>/orchestrator.agent.log for the abort-trigger strings below.
- Abort triggers (write
<working-dir>/abort once on first match; include reason):
- Provision (
t_submit → t_agent_applied) p95 over the active batch window > 30 s, or any single user > 30 s.
- Deprovision p95 > 30 s.
- Switch CPU > 80% sustained over any 60 s window.
doublezero_agent_apply_config_errors_total increments within a sample.
doublezero_agent_get_config_errors_total increments within a sample.
- Agent log:
"could not get diff because /usr/bin/Cli command timed out after 60 seconds".
- Agent log:
"not overriding lock since its age is less than" (any occurrence — already a clear race signal).
- Agent log silent for > 15 s.
- Ledger unreachable for > 30 s (signaled by orchestrator writing a heartbeat the observer can read, or a direct RPC probe — choose one; recommend the heartbeat approach to avoid duplicating RPC config).
Acceptance
- Sampling produces the expected files on a local cEOS dz1 in < 30 s; JSON validates.
- Each abort trigger has a unit test using a fixture log / fake eAPI response.
- Manual: cause a fake CPU spike (e.g., fixture-injected
show processes JSON) and confirm sentinel is written within one sample interval.
Tracker: #3744.
Why
Independent sampler that writes the breaking-point sentinel the orchestrator polls. Splitting from the orchestrator keeps abort logic out of the sweep path.
See Notion design for full context.
Scope
Create
tools/stress/device-observer/.--dut-host,--eapi-user,--eapi-pass,--agent-metrics-url,--sample-interval(default 10s),--abort-file,--working-dir.--sample-interval(eAPI commands, JSON output):show hardware capacity | json,show gre tunnel static | json,show processes top once | json,show logging errors,show logging critical. One file per sample named<command>-<iso8601>.json.--sample-interval: scrape--agent-metrics-url; append rows toobserver.agent_metrics.jsonas{t_ns, metric_name, value, labels_json}.observer.logging.jsonas{t_ns, severity, facility, message}. Also scan<working-dir>/orchestrator.agent.logfor the abort-trigger strings below.<working-dir>/abortonce on first match; include reason):t_submit → t_agent_applied) p95 over the active batch window > 30 s, or any single user > 30 s.doublezero_agent_apply_config_errors_totalincrements within a sample.doublezero_agent_get_config_errors_totalincrements within a sample."could not get diff because /usr/bin/Cli command timed out after 60 seconds"."not overriding lock since its age is less than"(any occurrence — already a clear race signal).Acceptance
show processesJSON) and confirm sentinel is written within one sample interval.Tracker: #3744.