An AI-powered incident response system that intercepts infrastructure alerts, diagnoses root causes using RAG-augmented reasoning, and assists on-call engineers via a chat interface.
See SUMMARY.md for detailed architectural decisions, model selection benchmarks, and design rationale.
When an alert fires from an alerting tool (Grafana, PagerDuty), the system:
- Normalises the alert payload into a standard internal schema
- Classifies the incident (service, severity, failure class)
- Retrieves relevant runbooks and past incidents from the knowledge base
- Produces a structured incident brief with root cause hypothesis and recommended actions
- Notifies the on-call engineer via email with a link to the chat interface
- Allows the engineer to ask follow-up questions and mark the incident as resolved
Resolved incidents are automatically ingested into the knowledge base, making the system smarter over time.
/analyse endpoint or SQS /chat endpoint
│ │
▼ ▼
Normaliser Chat Model
│ │
▼ │
Triage Classifier │
│ │
└──────────────┬───────────────────┘
▼
RAG Retrieval
(S3 Vectors + DynamoDB)
│
▼
Reasoning Agent
│
▼
LLM as a Judge
|
▼
Store Incident Brief (DynamoDB)
│
┌─────────┴──────────┐
▼ ▼
Return incident_id SQS → Lambda → SES
+ status to caller (async notification)
│
▼
Email → Engineer
│
▼
Engineer resolves via UI
│
▼
POST /incidents/{id}/resolve
│
┌──────────┴──────────┐
▼ ▼
Return response SQS → Lambda
(async corpus ingestion)
│
▼
Chunk → Embed → Store
(S3 Vectors + DynamoDB)
| Layer | Choice |
|---|---|
| Frontend framework | React + Vite |
| UI components | Shadcn/ui + Tailwind CSS |
| Markdown rendering | React Markdown + rehype-highlight |
| State management | Zustand |
| Data fetching | TanStack Query |
| Frontend hosting | Amazon S3 |
| Frontend access | Internal ALB (private VPC) |
| API server | FastAPI on ECS Fargate |
| Data validation | Pydantic |
| LLM orchestration | Native Bedrock SDK (boto3) |
| Database | DynamoDB |
| Triage model | claude-haiku-3.5 (via Bedrock) |
| Chat model | claude-haiku-3.5 (via Bedrock) |
| Reasoning model | claude-sonnet-4 (via Bedrock) |
| Embedding model | Cohere embed-v4 (via Bedrock) |
| Vector store | Amazon S3 Vectors |
| Notification service | Amazon SES (async via SQS + Lambda) |
| Observability | Amazon CloudWatch + AWS X-Ray + Langfuse |
| Authentication | AWS Cognito User Pool |
| Secrets | AWS Secrets Manager + Parameter Store |
| Container registry | Amazon ECR |
| IaC | Terraform |
| CI/CD | GitHub Actions |
| Cloud | AWS |
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/analyse |
POST | Receives alert, runs pipeline, returns incident_id |
/chat |
POST | Follow-up questions on an incident |
/incidents/{id} |
GET | Retrieve incident brief |
/incidents/{id}/history |
GET | Retrieve chat history |
/incidents/{id}/resolve |
POST | Mark incident as resolved |
cognis/
├── backend/
│ ├── app/
│ │ ├── main.py
│ │ ├── config.py
│ │ ├── api/
│ │ │ ├── analyse.py
│ │ │ ├── chat.py
│ │ │ ├── incidents.py
│ │ │ ├── resolve.py
│ │ │ └── health.py
│ │ ├── pipeline/
│ │ │ ├── normaliser.py
│ │ │ ├── triage.py
│ │ │ ├── retrieval.py
│ │ │ ├── agent.py
│ │ │ └── judge.py
│ │ ├── providers/
│ │ │ ├── base.py
│ │ │ ├── notifications/
│ │ │ │ ├── ses.py
│ │ │ │ ├── slack.py
│ │ │ │ └── teams.py
│ │ │ └── normalisers/
│ │ │ ├── grafana.py
│ │ │ ├── pagerduty.py
│ │ │ └── generic.py
│ │ ├── registry/
│ │ │ ├── notification_registry.py
│ │ │ └── normaliser_registry.py
│ │ ├── models/
│ │ │ ├── alert.py
│ │ │ ├── incident.py
│ │ │ ├── chat.py
│ │ │ └── resolve.py
│ │ ├── services/
│ │ │ ├── bedrock.py
│ │ │ ├── dynamodb.py
│ │ │ ├── s3vectors.py
│ │ │ └── sqs.py
│ │ └── lambdas/
│ │ ├── notify.py
│ │ └── ingest.py
│ ├── tests/
│ │ ├── unit/
│ │ ├── integration/
│ │ └── smoke/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── .dockerignore
│
├── frontend/
│ ├── src/
│ │ ├── pages/
│ │ ├── components/
│ │ ├── hooks/
│ │ ├── store/
│ │ ├── lib/
│ │ └── types/
│ ├── vite.config.ts
│ ├── tailwind.config.ts
│ └── package.json
│
├── terraform/
│ ├── modules/
│ │ ├── networking/
│ │ ├── compute/
│ │ ├── storage/
│ │ ├── messaging/
│ │ ├── ai/
│ │ ├── auth/
│ │ ├── secrets/
│ │ └── observability/
│ └── environments/
│ ├── dev/
│ └── prod/
│
├── runbooks/
│ ├── kubernetes/
│ ├── redis/
│ ├── postgres/
│ ├── post-mortems/
│ ├── sre-general/
│ └── org-specific/
│
├── scripts/
│ ├── seed_local.py
│ ├── ingest_corpus.py
│ ├── evaluate_retrieval.py
│ └── evaluate_agent.py
│
├── tests/
│ └── evaluation/
│ ├── retrieval_eval_dataset.json
│ └── agent_eval_dataset.json
│
├── .github/
│ └── workflows/
│ ├── backend.yml
│ ├── frontend.yml
│ ├── terraform.yml
│ └── corpus.yml
│
├── .env.local.example
├── .gitignore
├── docker-compose.yml
└── README.md
Prerequisites: Docker, Docker Compose, Node.js, Python 3.12, AWS credentials (for Bedrock)
1. Copy environment template:
cp .env.local.example .env.local
# Fill in AWS credentials for Bedrock2. Start backend services:
docker compose up3. Seed local database:
python scripts/seed_local.py4. Start frontend:
cd frontend
npm install
npm run devServices running locally:
| Service | URL |
|---|---|
| FastAPI | http://localhost:8000 |
| React | http://localhost:3000 |
| DynamoDB Local | http://localhost:8001 |
| ElasticMQ (SQS) | http://localhost:9324 |
Cognito auth is disabled locally. S3 Vectors is mocked. Emails are logged to console.
Backend:
cd backend
pytest tests/unit/
pytest tests/integration/
pytest tests/smoke/ # requires running dev environmentFrontend:
cd frontend
npm run test # Vitest unit tests
npm run test:integration # React Testing LibraryRunbooks are stored as Markdown files in /runbooks. Changes merged to main automatically trigger re-ingestion of updated files and retrieval evaluation via GitHub Actions.
Initial corpus load:
python scripts/ingest_corpus.py --allRe-embed specific files (called automatically by CI/CD):
python scripts/ingest_corpus.py --files runbooks/kubernetes/oomkilled.mdRetrieval evaluation (dev environment only):
python scripts/evaluate_retrieval.py # full evaluation
python scripts/evaluate_retrieval.py --disable-rerank # baseline without rerankingAgent output evaluation (dev environment only):
python scripts/evaluate_agent.pyDeployment is handled automatically via GitHub Actions:
| Trigger | Action |
|---|---|
Merge to main |
Deploy backend + frontend to dev |
| Manual trigger | Deploy backend + frontend to prod |
| PR opened | Terraform plan posted as PR comment |
Merge to main |
Terraform apply to dev |
| Manual trigger | Terraform apply to prod |
Three DynamoDB tables:
Incidents — PK: incident_id
Stores incident briefs created by /analyse and updated by /incidents/{id}/resolve.
ChatMessages — PK: incident_id, SK: message_id
Stores conversation history per incident for /chat and /incidents/{id}/history.
CorpusChunks — PK: chunk_id
Stores runbook and incident chunk text retrieved after S3 Vectors similarity search.
| Variable | Description |
|---|---|
ENVIRONMENT |
local, dev, or prod |
AWS_REGION |
AWS region |
DYNAMODB_ENDPOINT |
DynamoDB endpoint (local only) |
SQS_ENDPOINT |
SQS endpoint (local only) |
S3_VECTORS_MOCK |
Mock S3 Vectors locally (true/false) |
SES_MODE |
log for local, send for deployed |
AUTH_DISABLED |
Disable Cognito auth locally (true/false) |
AWS_ACCESS_KEY_ID |
AWS credentials (local only — for Bedrock) |
AWS_SECRET_ACCESS_KEY |
AWS credentials (local only — for Bedrock) |
In deployed environments all config is injected at runtime from AWS Secrets Manager and Parameter Store.
See SUMMARY.md for detailed architectural decisions, model selection benchmarks, and design rationale.