sre-on-call

A multi-agent bot that automatically investigates infrastructure alerts across Slack and Discord. When an alert is posted to a channel, the system acknowledges it, fans out investigation work to specialized agents in parallel, and posts a structured incident analysis as a thread reply.

Architecture

Slack/Discord webhook
        │
        ▼
   Lambda Adapter ── (signature verify, dedup, classify, ack) ──▶ chat platform
        │
        │ bedrock-agentcore.invoke_agent_runtime (JSON-RPC 2.0 / A2A)
        ▼
   Master Agent  ── investigate_alert tool ──▶ Orchestrator
                                                   │
                            ┌──────────────────────┼──────────────────────┐
                            ▼                      ▼                      ▼
                     Slack Scanner   Discord Scanner
                     CloudWatch Logs  EKS              (4 specialized agents,
                                                        configured via config.yaml)

Lambda Adapter — receives Slack/Discord webhooks, verifies signatures, deduplicates via DynamoDB, classifies the mention (suppresses non-alert chatter before the fan-out — see below), then invokes the Master Agent runtime.
Master Agent — orchestrates the investigation: fans out to specialized agents, enforces deadlines, posts the Incident Report. See agents/master/README.md.
Specialized agents — one per data source. Each has its own README:
- Slack Scanner — Slack channel history correlation.
- Discord Scanner — Discord channel history correlation.
- CloudWatch Logs — Logs Insights queries.
- EKS — Kubernetes cluster state.

A Prometheus agent (agents/prometheus/) is checked-in but not deployed and not in the orchestrator fan-out — it isn't listed in config.yaml and has no terraform plumbing.

For the full docs index (deployment, testing, architecture, design specs), see docs/README.md.

All agents run on AWS Bedrock AgentCore Runtime, communicate via the A2A protocol (JSON-RPC 2.0 over the runtime's /invocations endpoint), and use the Strands Agents SDK with Claude Haiku 4.5. Each agent's tool surface is described declaratively: config.yaml lists the agent's enabled skills and MCP servers, every skill is a SKILL.md bundle under agents/<name>/skills/<skill>/, and shared.a2a_factory is the single entry point that loads the config, resolves skills, opens MCP connections, and starts the A2A server.

Project Structure

├── config.yaml                     # Per-agent skills + MCP servers (single source of truth)
├── lambda_adapter/                 # Lambda webhook ingestion
│   ├── handler.py                  # Lambda entry point
│   ├── intake.py                   # Dedup + classification gate + master agent invocation
│   ├── classifier.py               # Alert-vs-chatter classification (heuristics + optional LLM)
│   └── dedup.py                    # DynamoDB deduplication store
├── agents/
│   ├── master/                     # Master orchestration agent
│   │   ├── tools.py                # investigate_alert (single tool, fire-and-forget)
│   │   ├── orchestrator.py         # InvestigationOrchestrator: fan-out + deadlines
│   │   ├── report_formatter.py     # Incident report assembly
│   │   ├── skills/<skill>/SKILL.md # Skill bundles (frontmatter -> tool symbol)
│   │   ├── tests/test_tools.py     # Per-agent unit tests
│   │   └── agent_card.json
│   ├── slack_scanner/              # tools.py + skills/ + tests/ + agent_card.json
│   ├── discord_scanner/            # same layout
│   ├── cloudwatch_logs/            # same layout (also wires the aws_docs MCP)
│   ├── eks/                        # same layout (network_mode: VPC)
│   └── prometheus/                 # Not deployed; not in config.yaml
├── shared/                         # Cross-agent utilities
│   ├── models.py                   # AlertContext, AgentResult, Finding, AgentFailure, AgentMetadata, CommandRequest
│   ├── constants.py
│   ├── a2a_factory.py              # Loads config + skills + MCPs; A2AServer + uvicorn + /ping
│   ├── a2a_protocol.py             # JSON-RPC envelope build/extract helpers
│   ├── agent_telemetry.py          # Per-agent metadata footer (model, tokens, cost)
│   ├── config.py                   # ProjectConfig (Pydantic) + loader for config.yaml
│   ├── skill_loader.py             # SKILL.md parser + tool-symbol resolver
│   ├── mcp_loader.py               # Context-managed MCPConnections handle
│   ├── platforms/                  # ChatPlatform per chat platform (Slack, Discord)
│   │   ├── __init__.py             # Protocol, WebhookEvent tagged union, deliver_with_retry, registry
│   │   ├── slack.py                # SlackChatPlatform: signature, parse, ack, deliver
│   │   └── discord.py              # DiscordChatPlatform: signature, parse, ack, deliver
│   ├── channel_scan.py             # Shared channel-scanning algorithm
│   ├── channel_utils.py
│   ├── report_renderer.py          # MarkupDialect-driven section renderer (Slack mrkdwn, Discord MD)
│   ├── secrets.py                  # Secrets Manager ARN -> plaintext resolver (cached)
│   ├── time_utils.py               # Investigation window + ISO timestamp helpers
│   ├── tool_result.py
│   ├── experiment.py
│   ├── experiment_store.py
│   ├── experiment_results_store.py
│   └── trace_store.py              # S3 + DDB per-investigation trace archive (fail-open)
├── tests/                          # Cross-cutting / shared unit tests
│   ├── integration/                # Handler, orchestrator, A2A factory, synthetic webhook
│   └── property/                   # Hypothesis property-based tests
├── modules/sre-on-call/            # Reusable module (no provider/backend)
│   ├── versions.tf                 # Provider requirements only
│   ├── variables.tf                # Inputs (incl. config_path, source_root)
│   ├── networking.tf               # EKS-VPC reference + agent SG
│   ├── ecr.tf                      # ECR repos for the 5 agent images
│   ├── dynamodb.tf                 # Dedup table
│   ├── dynamodb_experiments.tf     # A/B experiment tables
│   ├── secrets.tf                  # Slack/Discord secret containers
│   ├── lambda.tf                   # Lambda function + URL
│   ├── iam.tf                      # Lambda + agent IAM roles
│   ├── iam_agentcore.tf            # AgentCore-specific IAM
│   ├── agentcore.tf                # 5 aws_bedrockagentcore_agent_runtime resources
│   ├── traces.tf                   # S3 trace bucket + DDB index + KMS CMK + IAM grants
│   └── observability.tf            # CloudWatch alarms + SNS topic for AgentCore
├── examples/complete/             # Reference root: provider + backend + module call
│   ├── main.tf                     # provider + module "sre_on_call"
│   ├── outputs.tf                  # Re-exports module outputs
│   └── moved.tf                    # State re-keying for the old flat root
├── scripts/
│   ├── build_and_push_agents.sh    # Build 5 linux/arm64 images and push to ECR
│   ├── hydrate_secrets.sh          # Push Slack/Discord secret values
│   ├── enable_observability.sh     # One-time CloudWatch Transaction Search enablement
│   └── synthetic_slack_webhook.py  # Send a signed synthetic alert to the Lambda URL
├── docs/
│   ├── README.md                   # Docs index
│   ├── deployment.md               # Build, deploy, scoped testing
│   ├── testing.md                  # Synthetic + real Slack alert procedures
│   ├── architecture.d2             # Source for architecture.svg
│   ├── architecture.svg
│   ├── icons/                      # AWS + vendor icons used by the diagram
│   └── superpowers/                # Living design specs and implementation plans
├── CONTEXT.md                      # Domain vocabulary
└── pyproject.toml

Prerequisites

Python 3.12+
Terraform >= 1.5 (for infrastructure deployment only)
Docker buildx with linux/arm64 support (AgentCore runtime requires arm64)
AWS CLI with SSO or static credentials for the target account

Installation

git clone <repository-url>
cd sre-on-call
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
python -c "from shared.models import AlertContext; print('OK')"

Running tests

pytest                                              # full suite
pytest -v                                           # verbose
pytest agents/eks/tests/test_tools.py               # one agent's unit tests
pytest tests/integration/test_orchestrator.py       # one integration test
pytest tests/property/                              # property-based tests only

Current count: 582 collected, 582 passing. (Prometheus tests run and pass even though the agent isn't deployed.)

Test layout

agents/<name>/tests/test_tools.py — per-agent unit tests for the tool surface.
tests/ — cross-cutting unit tests for shared modules (config, skill loader, MCP loader, channel utils, telemetry, dedup, parser, signature, time utils, report formatter, A/B experiments, postmortem command).
tests/integration/ — the master orchestrator, the Lambda handler, the A2A factory, and the synthetic-webhook signing round-trip.
tests/property/ — Hypothesis property-based tests for the parser, signature verifier, dedup, time utils, channel utils, report formatter, and CloudWatch Logs query helpers.

Infrastructure deployment

See docs/deployment.md for the full procedure (build images → ECR repos → terraform apply → secret hydration). High level:

Configure the AWS profile and required Terraform variables.
terraform apply -target=module.sre_on_call.aws_ecr_repository.agents to create the repos.
./scripts/build_and_push_agents.sh <tag> to build + push the 5 agent images.
terraform apply -var "agent_image_tag=<tag>" for the rest.
./scripts/hydrate_secrets.sh to populate Slack/Discord secret values.

Required Terraform variables

# examples/complete/terraform.tfvars
eks_cluster_name         = "eks-uat"                           # existing cluster the EKS agent inspects
agent_container_registry = "<account-id>.dkr.ecr.<region>.amazonaws.com"

Optional:

Variable	Default	Purpose
`aws_region`	`us-east-1`
`environment`	`dev`	Resource-name prefix
`project_name`	`sre-on-call`	Resource-name prefix
`agent_image_tag`	`latest`	Pin image tag at apply time
`model_id`	`us.anthropic.claude-haiku-4-5-20251001-v1:0`	Bedrock model used by all agents
`lambda_memory_size`	`256`
`lambda_timeout`	`30`

Configure Slack

See docs/testing.md for the full Slack App + bot setup. Quick version:

Set the Event Subscriptions URL to the deployed Lambda function URL.
Subscribe to the app_mention bot event only. The intake classification gate suppresses obvious non-alert chatter, but subscribing to broader events still wastes classifier work — keep it to app_mention.
Bot scopes: app_mentions:read, chat:write, channels:history.
Hydrate secrets with the real Bot Token (xoxb-…) and Signing Secret.

Alert classification gate

Not every @bot mention is an alert — a casual "thanks!" should not launch a full agent fan-out. The Lambda intake classifies each new mention before dispatch:

Tier 1 — heuristics (lambda_adapter/classifier.py, pure & deterministic): scans for alert-shaped markers (severity keywords, Alertmanager/Grafana formatting, dashboard/console links) and, conversely, obvious chatter (greetings, acknowledgements, bare mentions). A confident verdict wins here.
Tier 2 — LLM (optional, CLASSIFIER_LLM_ENABLED=true): one Bedrock Converse turn (Haiku by default) judges messages Tier 1 can't call.
Manual override: include the word investigate in the mention to force an investigation regardless of classification.

The gate is fail-open — an ambiguous message, a disabled/erroring LLM, or any unexpected error all default to investigate, so a real page is never silently dropped. A gated mention gets a one-line in-thread nudge instead of a fan-out. Disable the whole gate with ALERT_CLASSIFICATION_ENABLED=false (Terraform: enable_alert_classification = false).

Testing

See docs/testing.md. Three paths:

Synthetic alert — scripts/synthetic_slack_webhook.py builds a correctly-signed app_mention payload and POSTs to the Lambda URL. Useful for fast smoke tests.
Real Slack alert — invite the bot to a channel and @bot … to trigger an investigation end-to-end.
/sre-snapshot snapshot — the same script with --command /sre-snapshot (or run /sre-snapshot in any channel after Slack registration). Posts a top-level snapshot of cluster state, top log groups by ingestion, and chat platform reachability.

Environment variables

These are set on the Lambda function and the AgentCore runtimes by Terraform; you typically do not need to set them by hand.

Variable	Component	Description
`SLACK_SIGNING_SECRET`	Lambda	Secrets Manager ARN holding the Slack signing secret
`SLACK_BOT_TOKEN`	Lambda, Master, Slack Scanner	Secrets Manager ARN holding the Slack bot OAuth token
`DISCORD_PUBLIC_KEY`	Lambda	Secrets Manager ARN holding the Discord application public key
`DISCORD_BOT_TOKEN`	Lambda, Master, Discord Scanner	Secrets Manager ARN holding the Discord bot token
`DEDUP_TABLE_NAME`	Lambda	DynamoDB deduplication table name
`ALERT_CLASSIFICATION_ENABLED`	Lambda	Gate non-alert mentions out of the fan-out (default `true`; kill-switch)
`CLASSIFIER_LLM_ENABLED`	Lambda	Enable the Tier 2 LLM classifier for ambiguous mentions (default `false`)
`CLASSIFIER_MODEL_ID`	Lambda	Bedrock model for the Tier 2 classifier (falls back to `MODEL_ID`, then Haiku)
`EXPERIMENTS_TABLE_NAME`	Lambda	DynamoDB A/B experiment config table name
`MASTER_AGENT_RUNTIME_ARN`	Lambda	AgentCore runtime ARN of the master agent
`TRACES_BUCKET_NAME`	Lambda, Master	S3 bucket for per-investigation trace archive (optional — unset disables tracing)
`TRACES_TABLE_NAME`	Lambda, Master	DynamoDB index table for trace archive lookups
`SLACK_SCANNER_AGENT_RUNTIME_ARN`	Master	AgentCore runtime ARN of the Slack Scanner
`DISCORD_SCANNER_AGENT_RUNTIME_ARN`	Master	AgentCore runtime ARN of the Discord Scanner
`CLOUDWATCH_LOGS_AGENT_RUNTIME_ARN`	Master	AgentCore runtime ARN of CloudWatch Logs
`EKS_AGENT_RUNTIME_ARN`	Master	AgentCore runtime ARN of EKS
`MODEL_ID`	All agents	Bedrock model ID or cross-region inference profile
`EKS_CLUSTER_NAME`	EKS agent	Cluster the EKS agent inspects
`A2A_PORT` / `A2A_HOST`	All agents	A2A server bind port (9000) / host (0.0.0.0)

For local-dev work where agents run as plain HTTP A2A servers (not on AgentCore), each *_AGENT_RUNTIME_ARN falls back to a corresponding *_AGENT_URL (e.g. EKS_AGENT_URL=http://localhost:9005). When unset, the orchestrator uses localhost defaults.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sre-on-call

Architecture

Project Structure

Prerequisites

Installation

Running tests

Test layout

Infrastructure deployment

Required Terraform variables

Configure Slack

Alert classification gate

Testing

Environment variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.claude		.claude
agents		agents
docs		docs
examples/complete		examples/complete
lambda_adapter		lambda_adapter
modules/sre-on-call		modules/sre-on-call
page_renderer		page_renderer
scripts		scripts
shared		shared
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTEXT.md		CONTEXT.md
Dockerfile		Dockerfile
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

sre-on-call

Architecture

Project Structure

Prerequisites

Installation

Running tests

Test layout

Infrastructure deployment

Required Terraform variables

Configure Slack

Alert classification gate

Testing

Environment variables

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages