Agent Factory

Agent Factory is a streaming traffic analysis engine that detects bugs in production, confirms they're real, and replicates them — grounded in actual request/response data from the Speedscale forwarder.

How it works

flowchart LR
    apps([Services]) --> fwd[Speedscale Forwarder]
    fwd -- "OTLP gRPC\n(EXPORTERS fan-out)" --> intake[intake-api\nOTLP receiver :4317]
    intake --> buf[Per-service buffer\n60s tumbling windows]
    buf --> analyze[Signal detection\nbaseline comparison]
    analyze --> archive[(Findings archive\nDO Spaces / S3)]
    analyze --> run[AgentRun queue\nreproduce + confirm]
    run --> worker[Worker\nproxymock replay]
    worker --> ticket[Linear ticket\nwith evidence]

    style apps fill:#6e7681,stroke:#6e7681,color:#fff
    style fwd fill:#1f6feb,stroke:#1f6feb,color:#fff
    style intake fill:#1a7f37,stroke:#1a7f37,color:#fff
    style buf fill:#1a7f37,stroke:#1a7f37,color:#fff
    style analyze fill:#1a7f37,stroke:#1a7f37,color:#fff
    style archive fill:#8957e5,stroke:#8957e5,color:#fff
    style run fill:#1a7f37,stroke:#1a7f37,color:#fff
    style worker fill:#1a7f37,stroke:#1a7f37,color:#fff
    style ticket fill:#8957e5,stroke:#8957e5,color:#fff

The forwarder already captures all traffic (RRPairs) from instrumented services and streams it as OTLP log records. Agent Factory registers as another OTLP destination via the forwarder's EXPORTERS env var — no snapshot creation, no cloud round-trip, no batch processing.

What works today

OTLP gRPC receiver — intake-api accepts LogsService/Export RPCs from the forwarder on port 4317
Per-service buffering — records grouped by service name in 60-second tumbling windows
Signal detection — on window close, finds error rate spikes, latency anomalies, N+1 query patterns, slow endpoints
Baseline comparison — signals compared against rolling per-endpoint baselines (2x p95 threshold)
Correlation — related signals (slow endpoint + slow downstream query) merged into incident groups
Findings archive — JSON findings with evidence uploaded to S3-compatible storage
Prometheus metrics — records received, windows processed, signals found, buffer depth per service
Baseline accumulation — every window's per-endpoint stats feed a rolling baseline so regressions are detected relative to normal, not just static thresholds
Evidence archival — the failing RRPairs for each signal are tarred to S3, keyed by fingerprint, so the bug stays replayable
Detect → confirm → replicate loop — high-severity regressions enqueue a reproduce AgentRun; the worker replays the archived traffic, confirms the signal reappears, and files a Linear ticket with the evidence

The killer feature

The closed loop — detect, confirm, replicate — is wired end-to-end. No other tool can do this because nobody else has the full request/response payloads AND an AI agent that can act on them. Production rollout needs live config (REPRODUCE_REPLAY_TARGET, LINEAR_API_KEY, LINEAR_REPRODUCE_TEAM_ID) and threshold tuning.

See docs/plan.md for the roadmap and remaining P1/P2 work.

Architecture

Three processes, one image:

Process	Role
`intake-api`	HTTP API (:8080) + OTLP gRPC receiver (:4317) + run queue + metrics
`worker`	Polls queue, executes agent runs (triage, bug-fix, reproduce)

See docs/architecture.md for full system design.

Deployment

Helm chart alongside speedscale-operator:

helm install agent-factory ./charts/agent-factory \
  --namespace agent-factory --create-namespace \
  --set engine.kind=claude-sdk \
  --set engine.authSecret.name=anthropic-api-key \
  --set intakeApi.otlp.enabled=true \
  --set intakeApi.otlp.archiveSecret.name=agent-factory-archive-s3

Then add Agent Factory as a forwarder OTLP destination in the operator ConfigMap or forwarder ConfigMap:

{
  "agent_factory": {
    "otel_endpoint": "http://agent-factory-intake-api.agent-factory.svc.cluster.local:4317",
    "dlp_config_id": "standard",
    "filter_rule": "standard"
  }
}

CLI mode

For one-off runs against existing traffic:

npm install
export ANTHROPIC_API_KEY=<your-key>

npm run llm-run -- \
  --title "Service X returning 429 errors on /api/sync" \
  --body  "Errors cluster in short bursts suggesting a concurrency problem." \
  --snapshot /path/to/snapshot/inner-dir \
  --source  /path/to/service/src \
  --workdir /tmp/llm-run-work \
  --verbose

Documentation

Doc	Audience
`docs/architecture.md`	System design, streaming pipeline, deployment
`docs/plan.md`	Roadmap: close the detect/confirm/replicate loop
`docs/CONFIG.md`	Every env var the binary accepts
`docs/operations.md`	Metrics, thresholds, runbook
`docs/engine.md`	LLM engine: tool catalog, agent loop
`docs/developers.md`	Development workflow
`docs/history.md`	Refactor history and design decisions
`docs/release.md`	Version bump + publish flow
`docs/EVALS.md`	Eval substrate

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
.githooks		.githooks
.github/workflows		.github/workflows
charts/agent-factory		charts/agent-factory
crds		crds
docs		docs
evals		evals
examples/instances		examples/instances
scripts		scripts
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Factory

How it works

What works today

The killer feature

Architecture

Deployment

CLI mode

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Factory

How it works

What works today

The killer feature

Architecture

Deployment

CLI mode

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages