Skip to content

deepti-96/WatchDog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WatchDog

Release regression detection for engineering teams.

WatchDog answers one question after every deploy: did this release break something? It correlates a deploy event with post-deploy changes in error rate, latency, and log signatures, then saves a triage-ready incident with evidence, notes, exports, and an explanation. The hosted Vercel demo also runs an evidence-bounded triage agent automatically after each deploy event and persists the agent report to Supabase.

The intended user is an engineer, SRE, or engineering manager who needs to understand whether a release caused customer-facing risk without reading raw metrics and logs first.

Why this is a strong Rust project

  • Solves a real production problem that every backend team understands
  • Uses Rust for a low-overhead, always-on streaming process
  • Demonstrates event correlation, rolling baselines, anomaly detection, and alerting
  • Produces measurable benchmark output instead of vague claims

What it does

  • Watches a metrics stream from a JSONL source in the MVP
  • Accepts deploy notifications from a CLI command or deploy script hook
  • Builds a rolling baseline from recent pre-deploy samples
  • Runs CUSUM change detection on error rate and latency
  • Attributes suspicious shifts and repeated new error signatures to a specific deploy
  • Persists incident records with status, notes, explanation cache, and export endpoints
  • Serves a product dashboard for incident review and deploy-event demos
  • Runs autonomous deploy triage in the hosted demo: detect, explain, recommend, and store the audit trail
  • Emits a human-readable verdict to stdout, webhook, and the dashboard

Flow

flowchart TD
    A["Metric samples arrive"] --> B["Rolling baseline buffer"]
    C["Deploy event arrives"] --> D["Snapshot pre-deploy baseline"]
    D --> E["Open post-deploy monitoring window"]
    B --> E
    E --> F["Run CUSUM on error rate and latency"]
    F --> G{"Regression detected?"}
    G -- "No" --> H["Keep monitoring"]
    G -- "Yes" --> I["Correlate metric shift to deploy timestamp"]
    I --> J["Generate plain-English verdict"]
    J --> K["Generate explanation and triage report"]
    K --> L["Persist incident and audit trail"]
    L --> M["Dashboard triage, notes, status, exports"]
Loading

Architecture

Quick start

Create a ready-to-demo bad deploy incident:

cargo run -- demo
cargo run -- serve --state-dir .watchdog-demo --port 3001

Open http://127.0.0.1:3001, then use the dashboard to:

  • Run a checkout or payments deploy regression scenario from the sidebar
  • Select the saved incident from history
  • Generate or refresh the explanation
  • Add investigation notes and mark the incident resolved
  • Export Markdown or JSON for handoff

For hosted demos, see deploy/README.md:

  • Vercel hosted product demo from vercel-demo/ with serverless APIs and Supabase persistence
  • Dockerized Rust dashboard for Render, Railway, Fly.io, or any Docker host
  • Deployment notes for persistent state and lightweight explanations

The live dashboard exposes GET /healthz for deployment checks. Environment examples are in .env.example.

Use SQLite-backed demo storage locally:

WATCHDOG_STORAGE=sqlite \
WATCHDOG_DATABASE_URL=.watchdog-demo/watchdog.sqlite \
WATCHDOG_EXPLAINER=local \
cargo run -- serve --state-dir .watchdog-demo --port 3001

Use Supabase-backed storage:

WATCHDOG_STORAGE=supabase \
SUPABASE_URL=https://your-project.supabase.co \
SUPABASE_SERVICE_ROLE_KEY=your-service-role-key \
WATCHDOG_EXPLAINER=local \
cargo run -- serve --state-dir .watchdog-demo --port 3001

Create the Supabase table first:

create table if not exists incidents (
  id text primary key,
  created_at timestamptz not null,
  severity text not null,
  status text not null default 'open',
  deploy_id text not null,
  environment text not null,
  summary text not null,
  incident_json jsonb not null,
  updated_at timestamptz not null default now()
);

create index if not exists idx_incidents_created_at
  on incidents (created_at desc);

create index if not exists idx_incidents_status
  on incidents (status);

create index if not exists idx_incidents_deploy_id
  on incidents (deploy_id);

Run the streaming synthetic bad deploy demo:

cargo run -- simulate --state-dir .WatchDog --deploy v1.4.2 --bad-deploy
cargo run -- run --state-dir .WatchDog

Run with a JSON config file:

cargo run -- run --state-dir .WatchDog --config watchdog.config.json

Example config:

{
  "baseline_capacity": 120,
  "monitoring_window_secs": 300,
  "log_file": ".WatchDog/app.log",
  "webhook_url": "https://hooks.example.test/watchdog",
  "detector": {
    "error_threshold": 0.08,
    "error_drift": 0.002,
    "latency_threshold": 120.0,
    "latency_drift": 5.0
  }
}

CLI flags such as --log-file, --monitoring-window-secs, and --webhook-url override config file values.

Slack incoming webhook URLs get a richer alert payload with Block Kit sections for the regression summary, metric deltas, dominant error signature, and timeline. Other webhook URLs receive the plain text alert body.

Incident explanations

The dashboard can explain an incident with Ollama or a built-in lightweight explainer. By default, WATCHDOG_EXPLAINER=auto tries Ollama first and falls back to the local explainer if Ollama is not running, which keeps the demo flow reliable.

# Always use the built-in lightweight explainer
WATCHDOG_EXPLAINER=local cargo run -- serve --state-dir .watchdog-demo --port 3001

# Require Ollama instead of falling back locally
WATCHDOG_EXPLAINER=ollama WATCHDOG_OLLAMA_MODEL=gemma3 cargo run -- serve --state-dir .watchdog-demo --port 3001

The local explainer uses the captured incident evidence only: deploy timing, metric deltas, dominant error signature, request rate, and baseline comparison.

Real vs simulated

  • Real: Rust detection engine, CUSUM metric shift detection, deploy correlation, log signature extraction, Supabase/SQLite/JSON incident persistence, Vercel serverless APIs, notes/status updates, exports, health endpoint, explanation caching, and persisted triage-agent reports.
  • Simulated for demo: JSONL metrics, deploy-event source, and log lines generated by cargo run -- demo, cargo run -- simulate, or the hosted dashboard deploy buttons.
  • Replaceable in production: JSONL ingestion can be swapped for Prometheus/OpenTelemetry/webhook ingestion while keeping the detection, storage, and triage workflow.

Hosted demo architecture

flowchart LR
    A["Demo user<br/>deploys checkout/payments"] --> B["Vercel frontend<br/>vercel-demo/index.html"]
    B --> C["POST /api/deployments/start<br/>Vercel serverless API"]
    C --> D["WatchDog deploy monitor<br/>baseline vs new release"]
    D --> E["Evidence explanation<br/>deterministic incident summary"]
    E --> F["Triage agent<br/>confidence, action, limits"]
    F --> G["Supabase Postgres<br/>incidents.incident_json"]
    G --> H["Dashboard detail<br/>history, notes, status, audit trail"]
    H --> B
Loading

The hosted demo is autonomous after the deploy event: it detects the regression, generates the explanation, runs the triage agent, and stores the incident in Supabase. It deliberately does not auto-rollback production; rollback remains a human-approved action.

Rust service architecture

flowchart TD
    A["Deploy event<br/>CLI, deploy hook, or hosted scenario"] --> B["WatchDog Engine"]
    C["Metric samples<br/>JSONL demo stream"] --> B
    D["Log events<br/>app.log / JSON lines"] --> B

    B --> E["Rolling baseline buffer"]
    B --> F["CUSUM detector"]
    B --> G["Error signature extractor"]

    E --> H["Regression verdict"]
    F --> H
    G --> H

    H --> I["Storage adapter"]
    I --> J["Supabase Postgres<br/>hosted cloud demo"]
    I --> K["SQLite DB<br/>local/Docker demo"]
    I --> P["Incident JSON files<br/>fallback mode"]

    J --> L["Axum dashboard API"]
    K --> L
    P --> L
    L --> M["Web console<br/>history, detail, notes, status"]
    L --> N["Explain Incident<br/>local explainer or Ollama"]
    L --> O["Exports<br/>Markdown / JSON"]
    N --> I
    M --> I
Loading

Record a real deploy event:

cargo run -- notify --state-dir .WatchDog --deploy v1.4.2 --environment production

Run benchmark scenarios:

cargo run -- benchmark --trials 100

Example benchmark output

WatchDog benchmark summary
trials: 100
healthy false positives: 0
bad deploys detected: 100
bad deploys missed: 0
average detection latency: 4.00s
best detection latency: 4s
worst detection latency: 4s

This benchmark is deterministic and scoped to the built-in synthetic scenarios. It is a repo quality signal, not a universal production guarantee.

Demo data format

WatchDog reads and writes JSONL files inside the state directory:

  • metrics.jsonl
  • deploy-events.jsonl
  • watchdog.sqlite when WATCHDOG_STORAGE=sqlite

Example metric sample:

{"timestamp":"2026-03-30T19:30:00Z","error_rate":0.02,"p95_latency_ms":190.0,"request_rate":1200.0}

Integration example

A tiny deploy hook is included at examples/deploy.sh. It shows how a deploy pipeline can notify WatchDog with one line.

What to build next

  • Prometheus or OpenTelemetry metrics ingestion
  • Database-backed multi-tenant storage
  • GitHub Actions or deploy-platform integration for automatic deploy notifications

About

Rust daemon that detects deployment-linked regressions by correlating deploy events with shifts in error rate and latency.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors