Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions docs/schedule-followup-observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# `schedule_followup` observability

`schedule_followup` (see [schedule-followup.md](./schedule-followup.md)) creates
its cron jobs with `system: true`, which keeps them out of the user-facing
automation list (`GET /api/cron/jobs` filters `system` jobs out). That is the
right default — these are internal self-wake schedules, not automations a user
manages — but it leaves a blind spot during incidents like *"the agent promised
a followup and it never came back."*

The debug endpoint `GET /api/debug/system-cron-jobs` closes that gap: it lists
the system cron jobs (including `schedule_followup`-created followups) together
with each job's recent `RunRecord` history, and offers a system-only manual
fire.

## Authorization

The debug routes live under the gateway's **protected** router, so they inherit
`enforce_gateway_auth`:

- Requests from **loopback** (`127.0.0.1` / `::1`) pass without a token. On the
gateway host, `curl http://127.0.0.1:<port>/api/debug/system-cron-jobs` just
works.
- Any **non-loopback** request must carry a valid gateway token (the same token
used for the rest of the gateway API — `Authorization: Bearer <token>`, the
`x-garyx-token` header, or `?token=`). No token / wrong token → `401`.

This reuses the existing gateway auth token rather than introducing a separate
debug-token config surface. The endpoint is never exposed unauthenticated to
external callers.

## `GET /api/debug/system-cron-jobs`

Lists every cron job with `system == true`, plus each job's recent runs.

### Query parameters

| Param | Type | Default | Notes |
|---|---|---|---|
| `thread_id` | string | — | Exact match on the job's `thread_id`. Empty / whitespace-only is ignored (returns all system jobs). |
| `since` | string | — | Lower bound on the job's `created_at`. Accepts a **unix-second timestamp** (all digits) or an **RFC3339** datetime. Jobs created strictly before this instant are filtered out. A value that parses as neither form returns `400 invalid_since` — it is never silently treated as "no filter". |
| `runs_limit` | integer | `20` | Max recent `RunRecord`s attached per job, most-recent-first. |

### Response

```json
{
"jobs": [
{
"id": "followup_4d7c5b8f12ab9e3a",
"label": "schedule_followup(thread::abc)",
"kind": {
"type": "internal_dispatch",
"reason": "background build finished",
"originating_run_id": "run-...",
"scheduled_at": "2026-05-29T07:25:00+00:00",
"delay_seconds_requested": 300
},
"schedule": { "once": { "at": "2026-05-29T07:30:00+00:00" } },
"thread_id": "thread::abc",
"agent_id": null,
"enabled": true,
"system": true,
"delete_after_run": true,
"next_run": "2026-05-29T07:30:00+00:00",
"last_status": "never_run",
"run_count": 0,
"created_at": "2026-05-29T07:25:00+00:00",
"last_run_at": null,
"recent_runs": [
{
"run_id": "...",
"job_id": "followup_4d7c5b8f12ab9e3a",
"status": "failed",
"started_at": "2026-05-29T07:30:00+00:00",
"finished_at": "2026-05-29T07:30:00+00:00",
"duration_ms": 12,
"thread_id": "thread::abc",
"error": "thread not found"
}
]
}
],
"count": 1,
"thread_id": null,
"since": null,
"runs_limit": 20,
"service_available": true
}
```

When the cron service is not running, the endpoint returns `200` with
`{"jobs": [], "count": 0, "service_available": false}` (mirroring
`GET /api/cron/jobs`), so a probe never 500s just because cron is disabled.

### Reading a "followup never fired" incident

1. Filter to the thread: `GET /api/debug/system-cron-jobs?thread_id=thread::abc`.
2. If the job is **absent**, it already fired and self-deleted
(`delete_after_run: true`) — check `GET /api/cron/runs` or the gateway logs
for its terminal `RunRecord`.
3. If the job is **present** with `last_status: never_run` and a future
`next_run`, it is still pending — the delay simply has not elapsed.
4. If `recent_runs` shows a `failed` record, the `error` field explains why the
dispatch did not reach the thread (e.g. the thread was deleted or had no
provider attached).

## `POST /api/debug/system-cron-jobs/{id}/run`

Manually fires a system cron job immediately — a system-only wrapper around
`CronService::run_now`. The debug channel must never be a back door to trigger
user-visible automations, so:

- A **missing** job → `404 not_found`.
- A job that exists but is **not** `system` → `404 not_found` (same shape as
missing; the debug channel does not enumerate or fire user automations).
- A system job that **cannot run right now** (disabled or already running) →
`409 not_runnable`.
- Otherwise → `200` with the resulting `RunRecord`:

```json
{ "ran": true, "run": { "run_id": "...", "job_id": "...", "status": "success", "...": "..." } }
```

## Implementation notes

- `GET` reuses `CronService::list_all` (the unfiltered list — `list()` hides
system jobs) and `CronService::list_runs_for_job`. It is strictly read-only
and does not repair or mutate any cron state.
- `POST .../run` reuses `CronService::get` (to enforce the system-only guard)
and `CronService::run_now`.
- Handlers: `garyx-gateway/src/api.rs`
(`debug_system_cron_jobs` / `debug_run_system_cron_job`); routes registered in
`garyx-gateway/src/route_graph.rs` under `operations_routes()`.
- The default `GET /api/cron/jobs` and `GET /api/cron/runs` behavior is
unchanged — these debug routes are additive.
8 changes: 8 additions & 0 deletions docs/schedule-followup.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,14 @@ followup-driven runs distinctly from organic user input:
the `AppState` reference is held weakly, no circular `Arc` is
formed between `AppState` and `CronService`.

## Observability

`schedule_followup` jobs are `system: true`, so they do not show up in the
user-facing automation list. To inspect them during an incident — list the
pending followups, see each job's `RunRecord` history, or manually fire one —
use the debug endpoint documented in
[schedule-followup-observability.md](./schedule-followup-observability.md).

## Backwards compatibility

The `CronJobConfig.system` field and the `CronJobKind::InternalDispatch`
Expand Down
230 changes: 230 additions & 0 deletions garyx-gateway/src/api.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2023,6 +2023,236 @@ pub async fn cron_runs(
}))
}

// ---------------------------------------------------------------------------
// GET /api/debug/system-cron-jobs
// ---------------------------------------------------------------------------
//
// Debug observability for system-managed cron jobs (AXON-692). The default
// user-facing `GET /api/cron/jobs` filters `system == true` jobs out, so
// `schedule_followup`-created followups are invisible there. When an incident
// like "agent promised a followup but it never fired" needs triage, SREs /
// developers reach for this endpoint to see the pending system jobs and each
// job's recent RunRecord history.
//
// Auth: registered under the protected router, so `enforce_gateway_auth`
// already gates it — loopback requests pass, everything else needs a valid
// gateway token. It reuses the existing gateway token rather than introducing
// a separate debug-token config surface. It is never exposed unauthenticated
// to non-loopback callers.

/// Default number of recent RunRecords attached to each job.
fn default_debug_runs_limit() -> usize {
20
}

#[derive(Deserialize)]
pub struct DebugSystemCronParams {
/// Optional thread filter. Matches `CronJob.thread_id` exactly. An empty
/// or whitespace-only value is ignored (returns all system jobs) rather
/// than matching jobs whose `thread_id` is unset.
#[serde(default)]
pub thread_id: Option<String>,
/// Optional lower bound on job `created_at`. Accepts either a unix-second
/// timestamp (all digits) or an RFC3339 datetime. Jobs created strictly
/// before this instant are filtered out. A value that parses as neither
/// form yields `400`, never a silent full list.
#[serde(default)]
pub since: Option<String>,
/// Max recent RunRecords attached per job (most-recent-first).
#[serde(default = "default_debug_runs_limit")]
pub runs_limit: usize,
}

/// Parse a `since` query value as a unix-second timestamp or RFC3339 datetime.
fn parse_since(raw: &str) -> Option<chrono::DateTime<Utc>> {
let trimmed = raw.trim();
if trimmed.is_empty() {
return None;
}
if let Ok(secs) = trimmed.parse::<i64>() {
return chrono::DateTime::from_timestamp(secs, 0);
}
chrono::DateTime::parse_from_rfc3339(trimmed)
.ok()
.map(|dt| dt.with_timezone(&Utc))
}

/// Render a single system cron job (plus its recent runs) into the debug shape.
fn debug_job_json(job: &crate::cron::CronJob, recent_runs: Vec<Value>) -> Value {
let kind = match &job.kind {
garyx_models::config::CronJobKind::AutomationPrompt => json!({ "type": "automation_prompt" }),
garyx_models::config::CronJobKind::InternalDispatch { payload } => json!({
"type": "internal_dispatch",
"reason": payload.reason,
"originating_run_id": payload.originating_run_id,
"scheduled_at": payload.scheduled_at.to_rfc3339(),
"delay_seconds_requested": payload.delay_seconds_requested,
}),
};
json!({
"id": job.id,
"label": job.label,
"kind": kind,
"schedule": job.schedule,
"thread_id": job.thread_id,
"agent_id": job.agent_id,
"enabled": job.enabled,
"system": job.system,
"delete_after_run": job.delete_after_run,
"next_run": job.next_run.to_rfc3339(),
"last_status": job.last_status,
"run_count": job.run_count,
"created_at": job.created_at.to_rfc3339(),
"last_run_at": job.last_run_at.map(|t| t.to_rfc3339()),
"recent_runs": recent_runs,
})
}

/// Render a RunRecord into JSON (mirrors the `cron_runs` shape, adds thread_id).
fn debug_run_json(r: &crate::cron::RunRecord) -> Value {
json!({
"run_id": r.run_id,
"job_id": r.job_id,
"status": r.status,
"started_at": r.started_at.to_rfc3339(),
"finished_at": r.finished_at.map(|t| t.to_rfc3339()),
"duration_ms": r.duration_ms,
"thread_id": r.thread_id,
"error": r.error,
})
}

/// GET /api/debug/system-cron-jobs - list system cron jobs + RunRecord history.
pub async fn debug_system_cron_jobs(
State(state): State<Arc<AppState>>,
Query(params): Query<DebugSystemCronParams>,
) -> impl IntoResponse {
let cron = match &state.ops.cron_service {
Some(svc) => svc,
None => {
return Json(json!({
"jobs": [],
"count": 0,
"service_available": false,
}))
.into_response();
}
};

// Parse `since` up front so a bad value fails loudly instead of returning
// an unfiltered list that an SRE might misread as "no jobs since X".
let since = match params.since.as_deref().map(str::trim) {
Some(raw) if !raw.is_empty() => match parse_since(raw) {
Some(ts) => Some(ts),
None => {
return (
StatusCode::BAD_REQUEST,
Json(json!({
"error": "invalid_since",
"message": "since must be a unix-second timestamp or an RFC3339 datetime",
"got": raw,
})),
)
.into_response();
}
},
_ => None,
};

let thread_filter = params
.thread_id
.as_deref()
.map(str::trim)
.filter(|value| !value.is_empty());

let mut jobs: Vec<Value> = Vec::new();
for job in cron.list_all().await.into_iter().filter(|j| j.system) {
if let Some(tid) = thread_filter
&& job.thread_id.as_deref() != Some(tid)
{
continue;
}
if let Some(since_ts) = since
&& job.created_at < since_ts
{
continue;
}
let recent_runs: Vec<Value> = cron
.list_runs_for_job(&job.id, params.runs_limit, 0)
.await
.iter()
.map(debug_run_json)
.collect();
jobs.push(debug_job_json(&job, recent_runs));
}

Json(json!({
"jobs": jobs,
"count": jobs.len(),
"thread_id": thread_filter,
"since": since.map(|t| t.to_rfc3339()),
"runs_limit": params.runs_limit,
"service_available": true,
}))
.into_response()
}

/// POST /api/debug/system-cron-jobs/{id}/run - manually fire a system cron job.
///
/// System-only wrapper around `CronService::run_now` (AXON-692 goal #3): the
/// debug channel must never be a back door to trigger user-visible automations,
/// so a non-system job (or a missing one) returns `404`. A job that exists but
/// can't run right now (disabled / already running) returns `409`.
pub async fn debug_run_system_cron_job(
State(state): State<Arc<AppState>>,
Path(id): Path<String>,
) -> impl IntoResponse {
let cron = match &state.ops.cron_service {
Some(svc) => svc,
None => {
return (
StatusCode::SERVICE_UNAVAILABLE,
Json(json!({
"error": "service_unavailable",
"message": "cron service is not running",
})),
)
.into_response();
}
};

match cron.get(&id).await {
// Hide non-system jobs behind the same 404 as a missing one — the debug
// channel only fires system jobs and must not enumerate user automations.
None => (
StatusCode::NOT_FOUND,
Json(json!({ "error": "not_found", "message": "no such system cron job", "id": id })),
)
.into_response(),
Some(job) if !job.system => (
StatusCode::NOT_FOUND,
Json(json!({ "error": "not_found", "message": "no such system cron job", "id": id })),
)
.into_response(),
Some(_) => match cron.run_now(&id).await {
Some(record) => Json(json!({
"ran": true,
"run": debug_run_json(&record),
}))
.into_response(),
None => (
StatusCode::CONFLICT,
Json(json!({
"error": "not_runnable",
"message": "job is disabled or already running",
"id": id,
})),
)
.into_response(),
},
}
}

// ---------------------------------------------------------------------------
// PUT /api/settings
// ---------------------------------------------------------------------------
Expand Down
Loading
Loading