Serverless order processing system on AWS demonstrating production-grade distributed systems engineering: SAGA pattern, idempotency, circuit breakers, event sourcing, and measured performance under concurrency.
| SAGA Orchestration | Step Functions coordinates reserve → charge → confirm with automatic compensation on failure — no 2-phase commit, no distributed locks |
| Failure Recovery | Every failure mode is handled and tested: payment failure triggers inventory rollback, duplicate requests are deduplicated, circuit breakers prevent cascade failures |
| Measured Performance | Load-tested at 5–50 concurrent threads: 1,100+ req/min, P50 = 47ms, P99 = 120ms on LocalStack. Bottlenecks identified and explained |
| Test Coverage | 50+ tests across unit (moto mocks) and integration (LocalStack) layers — including failure scenarios, compensation paths, pagination, DLQ processing, and idempotency under concurrency |
| Infrastructure as Code | 5 AWS CDK stacks: DynamoDB tables, SQS queues, Step Functions, API Gateway, CloudWatch dashboards — fully reproducible in one deploy |
| Observability | Structured JSON logs (CloudWatch Logs Insights ready), X-Ray distributed tracing, correlation IDs across every service boundary |
| Section | What You'll Find |
|---|---|
| Live Demo | 4 animated GIF demos — all distributed systems scenarios in action |
| Architecture | Mermaid diagram — full system at a glance |
| How the SAGA Works | SAGA, idempotency, circuit breaker, event sourcing |
| Failure Scenarios | Every failure mode + how the system recovers |
| Observability | JSON logs, CloudWatch metrics, X-Ray traces, dashboard mockups |
| Performance Report | Benchmark data: throughput, latency, bottleneck analysis |
| Design Tradeoffs | Why SAGA over 2PC, why DynamoDB over Postgres |
| Security | Input validation, IAM least-privilege, no secrets in code |
| How To Run Locally | Setup + test commands (Windows & Linux) |
| API Reference | Request/response examples for all 4 outcome types |
| Known Limitations | Open consistency question + honest design boundaries |
| TECHNICAL.md | 8-section deep-dive for full architecture analysis |
| Author | Background + links |
Interactive Streamlit dashboard running against LocalStack — all 4 distributed systems scenarios in action. Run locally:
.\run.ps1 local-upthen.\run.ps1 dashboard
Full SAGA trace: Create Order → Reserve Inventory → Charge Payment → Confirm. Each step timed independently. Order status shows
CONFIRMED, inventory count decrements atomically.
Payment deliberately declined. Step Functions triggers the compensation path: inventory is released back, order marked
COMPENSATED. No stock held without a successful charge.
Circuit breaker manually tripped to
OPEN. Subsequent orders fail in < 1ms without touching the payment provider — preventing a cascade timeout across the entire Lambda fleet.
Order quantity exceeds available stock. DynamoDB
ConditionalCheckFailedExceptionfires atomically — no overselling even under concurrent requests. ReturnsINSUFFICIENT_STOCKimmediately.
flowchart TD
Client([Client]) -->|POST /orders| AG[API Gateway]
AG --> OS[Order Service λ]
OS -->|1 - Create order PENDING| ODB[(orders\nDynamoDB)]
OS -->|2 - Publish OrderCreated| EB[EventBridge]
OS -->|3 - Start SAGA execution| SF{Step Functions\nSAGA Orchestrator}
SF -->|reserve| IS[Inventory Service λ]
IS <-->|atomic conditional decrement| IDB[(inventory\nDynamoDB)]
SF -->|charge| PS[Payment Service λ]
PS -. circuit breaker .-> PP([External Payment\nProvider])
PS --> PDB[(payments\nDynamoDB)]
SF -->|confirm| OS
OS -->|append CONFIRMED event| ODB
SF -.->|COMPENSATE: release| IS
IS -.->|restore stock| IDB
SF -.->|COMPENSATE: notify| SQ[SQS Queue]
SQ --> NS[Notification Service λ]
NS --> SNS[SNS Topic]
SQ -.->|3 failures| DLQ[Dead Letter Queues]
DLQ --> DP[DLQ Processor λ]
DP -->|structured error logs| CW[CloudWatch Logs]
IS & PS <-->|deduplication| ID[(idempotency\nDynamoDB)]
PS <-->|circuit state| CB[(circuit-breakers\nDynamoDB)]
style SF fill:#FF9900,color:#000,stroke:#E68A00
style ODB fill:#3F51B5,color:#fff,stroke:#303F9F
style IDB fill:#3F51B5,color:#fff,stroke:#303F9F
style PDB fill:#3F51B5,color:#fff,stroke:#303F9F
style ID fill:#3F51B5,color:#fff,stroke:#303F9F
style CB fill:#3F51B5,color:#fff,stroke:#303F9F
| Service | Lambda | Database | Queue |
|---|---|---|---|
| Order Service | order-handler |
orders DynamoDB (single-table) |
— |
| Inventory Service | inventory-handler |
inventory + reservations DynamoDB |
— |
| Payment Service | payment-handler |
payments DynamoDB |
— |
| Notification Service | notification-handler |
— | notification-queue SQS |
| DLQ Processor | dlq-processor |
— | All 3 Dead Letter Queues |
Long-running distributed transactions without Two-Phase Commit. Each service owns its own database. Step Functions orchestrates the happy path and compensation paths explicitly — visible as a single state machine diagram. When payment fails, inventory is automatically released via a compensating transaction.
Every Lambda handler uses a DynamoDB-backed idempotency decorator. SQS delivers messages
at-least-once — duplicates are deduplicated atomically using attribute_not_exists conditional
writes. No double-charges. No double-reservations. No distributed locks needed.
Idempotency here is not just a reliability mechanism — it is a replay-attack prevention primitive. A duplicate order request with the same idempotency key is indistinguishable from an adversarial replay. By making every handler idempotent at the database level, the system rejects replays and duplicates with identical semantics, without needing a separate security layer. The same DynamoDB conditional write that prevents double-charging also closes the replay vector.
Orders are never updated in place. Every state transition (PENDING → CONFIRMED → PAYMENT_CHARGED)
is appended as an immutable event with a timestamp. Full audit trail. Time-travel debugging.
State is derived by replaying events.
The Payment Service wraps external provider calls in a circuit breaker stored in DynamoDB. After N consecutive failures, the circuit opens — requests fast-fail in < 1ms instead of waiting for a 30-second timeout. Because state lives in DynamoDB (not in-process memory), all Lambda instances share the same circuit state across cold starts and concurrent invocations.
- Inventory reservations: Strongly consistent reads — correctness matters, no overselling.
- Order status queries: Eventually consistent reads — acceptable staleness for read-heavy paths.
- Notifications: At-least-once delivery — idempotent consumers handle duplicates safely.
AWS X-Ray traces propagate across Lambda ↔ SQS ↔ DynamoDB boundaries. Every request gets
a correlation_id that flows through all services and appears in CloudWatch Logs Insights.
This is what separates a distributed system from a CRUD app. Every failure mode is handled explicitly.
| Scenario | What Happens | Guarantee Maintained |
|---|---|---|
| Payment declined | Step Functions triggers compensation: inventory released, customer notified | No stock held without payment |
| Payment provider timeout | Circuit breaker records failure; after 5 timeouts circuit opens; orders fast-fail | No cascade timeout across Lambda fleet |
| Circuit breaker open | Returns PAYMENT_PROVIDER_UNAVAILABLE with retry_after_seconds |
No hanging requests |
| Duplicate order request | Idempotency table returns cached result; SAGA not started twice | Exactly-once order creation |
| Inventory conflict | DynamoDB ConditionalCheckFailedException; returns INSUFFICIENT_STOCK |
No overselling; atomic at DB level |
| SQS message redelivered | Notification handler checks idempotency key; skips duplicate send | No double-emails |
| Lambda crash mid-SAGA | Step Functions retries the failed step from last checkpoint | No partial execution |
| DynamoDB write fails | Raises exception; Step Functions retries with backoff | At-least-once execution, idempotency handles duplicates |
# Run unit tests covering all failure scenarios
.\run.ps1 test-unit
# Specific failure test file
python -m pytest tests/unit/test_failure_scenarios.py -vSee tests/unit/test_failure_scenarios.py for:
test_payment_failure_triggers_compensation— payment declines → inventory releasedtest_circuit_breaker_open_fast_fails— open circuit rejects calls in < 1mstest_duplicate_order_is_idempotent— same order submitted twice, SAGA runs oncetest_inventory_conflict_atomic— concurrent reservations, one fails cleanly
Every service emits structured JSON logs (see services/shared/logger.py):
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"service": "inventory_service.handler",
"message": "Inventory reserved",
"order_id": "abc-123",
"reservation_id": "res-456",
"correlation_id": "corr-789"
}Query in CloudWatch Logs Insights:
fields @timestamp, order_id, level, message
| filter service = "inventory_service.handler" and level = "ERROR"
| sort @timestamp desc
| limit 20| Metric | Alarm Threshold |
|---|---|
OrdersCreated |
— |
SAGASuccessRate |
< 95% for 5 minutes |
SAGACompensationRate |
> 10% triggers alert |
LambdaDuration p99 |
> 3000ms |
CircuitBreakerStateChange |
Any OPEN transition |
DLQMessageCount |
> 0 (no messages should reach DLQ) |
Run: python scripts/load_test.py --orders 50 --concurrency 10
See Performance Evaluation below for the full benchmark report.
| Decision | Chosen Approach | Core Reason | Trade-off Accepted |
|---|---|---|---|
| Distributed transaction | SAGA + Step Functions | Explicit, auditable compensation — critical for money flows | Central orchestrator required; choreography avoided |
| Database design | Separate DynamoDB table per service | Service ownership enforced at infra level; independent scaling | No cross-service joins; app-layer aggregation needed |
| Idempotency store | DynamoDB attribute_not_exists |
Atomic check-and-set at DB level; no extra infrastructure | +1 read per request; TTL expiry after 24h |
| Circuit breaker state | DynamoDB (not Lambda memory) | Shared across entire Lambda fleet; survives cold starts | +6ms per payment call for the GetItem |
| Local development | LocalStack on Docker | Zero AWS cost during dev and CI; same API surface | ~6× slower than real DynamoDB; not production equivalent |
| Event routing | Step Functions for SAGA + EventBridge for side effects | Orchestration where compensation matters; choreography for non-critical events | Two event systems to understand |
| Amount precision | Integer cents (no floats) | Eliminates floating-point rounding errors in financial math | Client-side conversion required for display |
| Approach | Pros | Cons | Cloud Fit |
|---|---|---|---|
| SAGA + Orchestration ✅ | Explicit flow, testable compensation, full audit trail, visible in one diagram | Central coordinator required; more moving parts | Excellent — no distributed locks, works with Lambda |
| Event Choreography | Loose coupling; services are fully independent | Implicit rollback; hard to debug across 4 services; dangerous when money is involved | Good for simple flows with no compensation |
| Two-Phase Commit (2PC) | Strong atomicity; familiar from relational DBs | Distributed locks kill throughput; coordinator crash = system stuck; AWS Lambda can't hold connections | Poor — designed for monoliths with persistent DB connections |
| Outbox Pattern | Reliable event publishing; eliminates dual-write problem | Requires CDC or polling; extra operational overhead | Complements SAGA but solves a different problem |
CloudFlow uses SAGA + Orchestration because money flows require visible, auditable, testable rollback — not implicit event chains that are difficult to trace when they fail.
2PC problems:
- All participants hold locks during the protocol → throughput collapses under load
- If the coordinator crashes between Phase 1 and Phase 2 → system stuck in limbo forever
- Requires all services to implement a
prepareprotocol — tightly coupled - AWS Lambda has no persistent connections → coordinator can't hold distributed locks
SAGA advantages:
- Each step is a local transaction — no distributed locks
- Compensation is business logic, not database magic → testable, readable
- Step Functions makes the entire saga visible in one diagram
- Partial failures have explicit, known recovery paths
Choreography: Services react to each other's events. Inventory hears OrderCreated,
publishes StockReserved. Payment hears StockReserved, publishes PaymentCharged.
Why we didn't:
- Compensation is implicit — each service must listen for failure events from every other service
- Debugging requires correlating logs across 4 services to reconstruct what happened
- Adding a new step requires modifying multiple services
- When money is involved, implicit rollback is dangerous
Orchestration (our approach):
- One state machine shows the entire flow
- Compensation paths are explicit and version-controlled
- One place to add retry policies, timeouts, and error handling
- DynamoDB
attribute_not_existsis an atomic check-and-set at the database level - Redis requires Lua scripts for atomic check-and-set (more complexity)
- DynamoDB survives Lambda cold starts — Redis connections can be stale
- No extra infrastructure: DynamoDB is already the primary datastore
- Scales infinitely with zero ops overhead
Each service owns its data. This enforces the microservices boundary at the infrastructure level:
- Services cannot bypass each other's APIs to read/write data
- Schema changes in one service don't break others
- Each table can be tuned independently (read/write capacity, GSIs)
- This is why Amazon itself moved from a shared Oracle database to service-owned data
CloudFlow prioritizes availability over strict consistency for reads, with strong consistency only where correctness is critical:
Inventory writes: Strong consistency (ConditionalCheckFailed catches conflicts)
Inventory reads: Strong consistency (prevents overselling)
Order reads: Eventual consistency (acceptable: user polling for status)
Notifications: At-least-once (idempotent consumer handles duplicates)
This matches how large-scale systems (Amazon, Netflix) handle the CAP tradeoff: strong consistency only where business rules require it, eventual consistency everywhere else.
| Threat | Where It Bites | Mitigation | Tested |
|---|---|---|---|
| Duplicate requests | SQS delivers at-least-once; client retries | DynamoDB idempotency table — attribute_not_exists atomic check-and-set per order_id |
test_duplicate_reservation_is_idempotent |
| Slow / failed payment API | One hung provider call blocks the thread; cascade to all orders | Circuit breaker (state in DynamoDB, survives Lambda cold starts) — opens after 3 failures, fast-fails for 60s | test_circuit_breaker_open_fast_fails |
| Stock oversell | Two threads reserve the last item simultaneously | DynamoDB conditional write — quantity >= requested enforced atomically; one wins, one gets ConditionalCheckFailedException |
test_reserve_fails_on_insufficient_stock |
| Partial saga failure | Lambda crashes between reserve and charge | Step Functions checkpoints each step — retries from last successful state, never from the beginning | Compensation path integration tests |
| Forged / tampered order | Client sends manipulated total_cents or negative quantity |
Pydantic validation rejects at API boundary; amounts recalculated server-side, never trusted from input | Input validation schema |
| Cross-service data access | Inventory Lambda reads the payments table | IAM least-privilege — each Lambda role grants dynamodb:* only on its own table ARN |
CDK IAM policy per stack |
The full STRIDE analysis (Spoofing, Tampering, Repudiation, Info Disclosure, DoS, Elevation) is at the bottom of this file.
All requests are validated via Pydantic schemas before any database operation:
class CreateOrderRequest(BaseModel):
customer_id: str = Field(min_length=1)
items: list[OrderItem] = Field(min_length=1)
# Negative quantities, zero-price items, empty orders → rejected at validationEach Lambda has its own IAM role with only the permissions it needs:
order-handler:dynamodb:PutItemonorderstable,events:PutEvents,states:StartExecutioninventory-handler:dynamodb:UpdateItemoninventory,dynamodb:PutItemonreservationspayment-handler:dynamodb:PutItemonpayments,dynamodb:GetItemonpayments- No Lambda has admin permissions or cross-service database access
- Payment provider credentials: fetched from AWS Secrets Manager at runtime (not env vars)
- DynamoDB table names: passed via environment variables set by CDK at deploy time
- No hardcoded credentials anywhere — verified by pre-commit scan
- Request validation: API Gateway rejects malformed JSON before Lambda is invoked
- Rate limiting: Usage plans limit requests per second per API key
- CORS: Configured to allow only trusted origins
- All financial amounts stored as integer cents (no floating point)
- DynamoDB optimistic locking with version counters prevents lost updates
- Idempotency keys expire after 24 hours (TTL on idempotency table)
cloudflow/
├── infrastructure/ # AWS CDK (Python) — all cloud resources as code
│ └── cloudflow/
│ ├── api_stack.py # API Gateway + Order Service Lambda
│ ├── database_stack.py # DynamoDB tables (5 tables)
│ ├── messaging_stack.py # SQS queues + EventBridge bus
│ ├── saga_stack.py # Step Functions state machine (SAGA)
│ └── monitoring_stack.py # CloudWatch dashboards + alarms
├── services/
│ ├── shared/ # Cross-cutting concerns
│ │ ├── events.py # Pydantic event schemas
│ │ ├── idempotency.py # @idempotent decorator (DynamoDB-backed)
│ │ ├── circuit_breaker.py # Circuit breaker (DynamoDB-backed)
│ │ ├── dynamodb.py # DynamoDB helpers
│ │ └── logger.py # Structured JSON logging
│ ├── order_service/ # Create orders, manage event sourcing log
│ ├── inventory_service/ # Atomic reserve / release (SAGA steps)
│ ├── payment_service/ # Charge / refund with circuit breaker
│ ├── notification_service/ # SQS-triggered email/SMS notifications
│ └── dlq_processor/ # Structured logging for failed DLQ messages
├── tests/
│ ├── unit/ # moto mocks — pure logic, no Docker
│ │ ├── test_circuit_breaker.py
│ │ ├── test_idempotency.py
│ │ ├── test_events.py
│ │ ├── test_failure_scenarios.py # failure mode coverage
│ │ ├── test_dlq_processor.py # DLQ handler resilience
│ │ ├── test_pagination.py # cursor-based event pagination
│ │ └── test_order_service.py # order handler routes + validation
│ └── integration/ # LocalStack — real DynamoDB via Docker
│ └── test_saga_flow.py
├── scripts/
│ ├── load_test.py # Concurrent load test with latency metrics
│ └── seed_data.py # Seed demo data into LocalStack
├── docker-compose.yml # LocalStack for local development
├── run.ps1 # Windows PowerShell commands
├── Makefile # Linux/macOS commands
├── TECHNICAL.md # Deep-dive: architecture decisions & analysis
└── .github/workflows/ # CI/CD: test on PR, deploy on merge to main
- Python 3.11+
- Docker Desktop (for LocalStack integration tests)
- Node.js 18+ (for AWS CDK)
# Clone
git clone https://github.com/UTKARSH698/CloudFlow
cd CloudFlow
# Linux / macOS
make setup
make test-unit # 30+ unit tests, no Docker needed (~6s)
make local-up # Start LocalStack
make test # All 50+ tests
make local-down
# Windows (PowerShell)
.\run.ps1 setup
.\run.ps1 test-unit
.\run.ps1 local-up
.\run.ps1 test
.\run.ps1 local-down# Configure AWS credentials first (aws configure)
make cdk-bootstrap # First time only
make deploy # Deploys all 5 stacks
make demo # Submit a sample order end-to-endAll endpoints accept and return application/json. Deployed via API Gateway; test locally with .\run.ps1 local-up + .\run.ps1 demo.
No API key required. Used by load balancers and deployment verification.
Response — 200 OK:
{ "status": "healthy", "service": "order-service" }Request:
{
"customer_id": "cust-001",
"items": [
{ "product_id": "KEYBD-01", "quantity": 1, "unit_price_cents": 8999 }
]
}Headers: Idempotency-Key: <uuid> (required), x-api-key: <api-key> (required)
Response — 202 Accepted (SAGA started):
{ "order_id": "ord-abc-123", "status": "PENDING", "correlation_id": "corr-xyz-789" }SAGA runs asynchronously. Poll
GET /orders/{order_id}for final status. Thecorrelation_idtraces the request across all services — use it for debugging.
Query parameters: ?limit=50 (max 100), ?cursor=<token> (from next_cursor in previous response)
Happy path — CONFIRMED:
{
"order_id": "ord-abc-123",
"status": "CONFIRMED",
"customer_id": "cust-001",
"total_cents": 8999,
"events": [
{ "type": "ORDER_CREATED", "timestamp": "2024-01-15T10:30:00.012Z" },
{ "type": "STOCK_RESERVED", "timestamp": "2024-01-15T10:30:00.047Z" },
{ "type": "PAYMENT_CHARGED", "timestamp": "2024-01-15T10:30:00.145Z" },
{ "type": "ORDER_CONFIRMED", "timestamp": "2024-01-15T10:30:00.167Z" }
]
}Payment declined — COMPENSATED (SAGA rolled back):
{
"order_id": "ord-def-456",
"status": "COMPENSATED",
"failure_reason": "PAYMENT_DECLINED",
"events": [
{ "type": "ORDER_CREATED", "timestamp": "2024-01-15T10:30:00.012Z" },
{ "type": "STOCK_RESERVED", "timestamp": "2024-01-15T10:30:00.047Z" },
{ "type": "PAYMENT_FAILED", "timestamp": "2024-01-15T10:30:00.063Z" },
{ "type": "STOCK_RELEASED", "timestamp": "2024-01-15T10:30:00.078Z" },
{ "type": "ORDER_COMPENSATED", "timestamp": "2024-01-15T10:30:00.110Z" }
]
}Circuit breaker open — 503:
{
"error": "PAYMENT_PROVIDER_UNAVAILABLE",
"retry_after_seconds": 57,
"message": "Circuit breaker OPEN — fast-failing to prevent cascade timeout"
}Insufficient stock — 409:
{
"error": "INSUFFICIENT_STOCK",
"product_id": "KEYBD-01",
"requested": 2,
"available": 1
}Tool:
scripts/load_test.py· Target: LocalStack DynamoDB · Machine: Windows 11 AMD64
Metric Result Visual (each █ ≈ 100ms or 100 req/min)
─────────────────────────────────────────────────────────────────────
Throughput 1,100/min ███████████
P50 latency 47ms ████▌
P95 latency 89ms ████████▉
P99 latency 120ms ████████████
Test suite 50+ tests ✅ all passing, <30s
─────────────────────────────────────────────────────────────────────
LocalStack is ~6× slower than real AWS DynamoDB (8ms vs 50ms writes).
Correctness properties — idempotency, compensation, atomic writes — are identical.
Each ● is one reservation request. Horizontal markers show percentile thresholds.
Latency (ms)
140 │ ●
120 ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ P99 = 120ms
100 │ ● ●
89 ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ P95 = 89ms
70 │ ● ● ● ●
47 ├━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ P50 = 47ms
35 │ ●●●●● ●● ●●● ●● ●● ●● ●●● ●●● ● ●●
22 │ ● ●● ●● ● ● ●● ●● ● ●●● ● ●● ●●
0 └──────────────────────────────────────────>
0 5 10 15 20 25 30 35 40 45 (req #)
88% of requests complete in 20–60ms. Outliers beyond P95 are rare DynamoDB write contention spikes — not errors.
Symbol width is proportional to latency (1 symbol ≈ 20ms). Notice P99 fans out much faster than P50:
Threads P50 (median) P95 (tail) P99 (worst)
──────────────────────────────────────────────────────────────────────
5 th ·· 37ms ○○○ 68ms ◆◆◆◆◆ 95ms
10 th ··· 47ms ○○○○ 89ms ◆◆◆◆◆◆ 120ms
20 th ··· 54ms ○○○○○ 98ms ◆◆◆◆◆◆◆ 140ms
50 th ···· 61ms ○○○○○○ 110ms ◆◆◆◆◆◆◆◆ 160ms
──────────────────────────────────────────────────────────────────────
Growth +65% +62% +69%
P50 stays flat because it's bounded by DynamoDB write time. P99 grows because rare contention retries accumulate under higher concurrency.
Req/min
1400 │ ●───────● ← plateau ~1,300 req/min
1200 │ ●─────────╯ (LocalStack Docker bound)
1000 │ ●───────╯
800 │
600 │
400 │
200 │
0 └──────────────────────────────>
5 th 10 th 20 th 50 th
940 1,100 1,300 1,280 req/min
Throughput plateaus at ~20 threads — LocalStack Docker container CPU bound, not a DynamoDB limit. On real AWS, DynamoDB partitions scale horizontally; throughput grows linearly with concurrency.
Latency (ms) │ Distribution
──────────────┼───────────────────────────────────────────────
0 – 20ms │ ▏ 2% (1 req)
20 – 40ms │ ████████████████████ 40% (20 req) ← bulk
40 – 60ms │ ████████████████████████ 48% (24 req)
60 – 80ms │ ████ 8% (4 req)
80 – 100ms │ ▏ 1% (1 req) ← P95
100ms+ │ ▏ 1% (1 req) ← P99
──────────────┴───────────────────────────────────────────────
Avg = 45ms · P50 = 47ms · P95 = 89ms · P99 = 120ms
DynamoDB Write Latency (ms) Throughput Ceiling (req/min)
────────────────────────────────────── ─────────────────────────────────────
LocalStack ████████████████████ 50ms LocalStack ███████ 1,300
Real AWS ████ 8ms Real AWS ████████████████ 100,000+
────────────────────────────────────── ─────────────────────────────────────
~6x slower clock ~75x lower ceiling
Correctness is identical — atomic writes, idempotency, compensation all behave
the same way. Only speed differs. LocalStack is a faithful emulator, not a toy.
| Bottleneck | Root Cause | Production Mitigation |
|---|---|---|
| LocalStack throughput cap ~1,300 req/min | Single Docker container, no parallel I/O | Not present on real AWS — DynamoDB scales horizontally |
| DynamoDB write: ~45ms | LocalStack adds ~35ms network emulation overhead | Real AWS: 2-8ms single-digit SLA |
| Circuit breaker read: +6ms | Extra GetItem per payment call for circuit state |
Acceptable — prevents 30s cascade timeouts |
| Lambda cold start: ~200ms (first call) | Python + boto3 import time | Provisioned concurrency → 0ms in production |
| UUID partition keys | N/A — well distributed, zero hot partitions observed | Already mitigated by design |
Key insight: Correctness properties are identical between LocalStack and real AWS — atomic conditional writes, idempotency, and compensation work the same way. Only latency and throughput numbers differ.
CloudFlow — Operations Dashboard [Last 15 min] [Auto-refresh: 1min]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Orders Created / min SAGA Success Rate (%)
┌────────────────────────────┐ ┌────────────────────────────┐
│ 120 │ ╭─╮ │ │ 100 │ ───────────────── │ 🟢 OK
│ 90 │ ╭───╯ ╰──╮ │ │ 95 │ ╲ │
│ 60 │ ╭──╯ ╰─╮ │ │ 90 │ ╲── │ ← alarm at 95%
│ 30 │──╯ ╰── │ │ 85 │ │
│ └────────────────── │ │ └────────────────── │
│ 0 5 10 15 min │ │ 0 5 10 15 min │
└────────────────────────────┘ └────────────────────────────┘
P99 Lambda Duration (ms) SAGA Compensations / min
┌────────────────────────────┐ ┌────────────────────────────┐
│ 500 │ │ │ 10 │ │ 🟢 OK
│ 400 │ ─────────────╮ │ │ 8 │ │
│ 300 │ ╰─╮ │ │ 6 │ │ ← alarm at 10%
│ 200 │ ╰──── │ │ 4 │ │
│ 100 │ │ │ 2 │ ▄ │
│ └────────────────── │ │ 0 │──────────█────────── │
│ 0 5 10 15 min │ │ 0 5 10 15 min │
└────────────────────────────┘ └────────────────────────────┘
Circuit Breaker State DLQ Message Count
┌────────────────────────────┐ ┌────────────────────────────┐
│ CLOSED ━━━━━━━━━━━━━━━━ │ │ 0 │ ─────────────────── │ 🟢 OK
│ OPEN │ │ │ (alarm if > 0) │
│ HALF [healthy 14m 32s] │ │ └────────────────── │
└────────────────────────────┘ └────────────────────────────┘
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌─────────────┐
┌──────────│ API Gateway │
│ └─────────────┘
▼ Avg: 5ms OK: 100%
┌─────────────┐
│order-service│──────────────────┐
└─────────────┘ │
│ Avg: 28ms OK: 100% ▼
│ ┌─────────────────┐
├──▶ DynamoDB (orders) │ Step Functions │
└──▶ EventBridge └────────┬────────┘
│
┌────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ inventory- │ │ payment- │ │notification- │
│ service │ │ service │ │ service │
└──────────────┘ └──────────────┘ └──────────────┘
Avg: 35ms Avg: 98ms Avg: 12ms
OK: 100% OK: 97% OK: 100%
│ │ └── 3% circuit breaker
▼ ▼
DynamoDB Payment Provider
(inventory) (external)
Trace: 1-65a3f2b1-abc123 ✅ SUCCESS Total: 347ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Segment Start Duration Timeline (347ms)
───────────────────── ───── ──────── ─────────────────
API Gateway 0ms 12ms ████
order-handler 12ms 28ms ████████
DynamoDB.PutItem 15ms 8ms ███
EventBridge.Put 25ms 6ms ██
StepFunctions.Start 34ms 5ms ██
inventory-handler 40ms 35ms ██████████
DynamoDB.UpdateItem 44ms 9ms ████
payment-handler 78ms 98ms ██████████████████████████
DynamoDB.GetItem 80ms 5ms ██
PaymentProvider.POST 86ms 72ms ████████████████████
DynamoDB.PutItem 159ms 6ms ██
order-handler 179ms 22ms ██████
DynamoDB.PutItem 181ms 8ms ████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Trace: 1-65a3f2b1-def456 ❌ COMPENSATED Total: 285ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Segment Start Duration Status
───────────────────── ───── ──────── ─────────────────
order-handler 0ms 22ms ✅ Created PENDING
inventory-handler 25ms 35ms ✅ Stock reserved
payment-handler 63ms 12ms ❌ PAYMENT_DECLINED
[circuit breaker] 65ms 5ms CB: CLOSED → recorded failure
Step Functions 76ms 0ms → Triggers compensation
inventory-handler 78ms 30ms ↩ Stock released (compensation)
DynamoDB.UpdateItem 80ms 8ms ADD quantity :3
order-handler 110ms 18ms ❌ Status → FAILED
notification-handler 131ms 8ms 📧 Queued: ORDER_FAILED email
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Result: Stock restored ✅ | Customer notified ✅ | No charge made ✅
fields @timestamp, level, service, message, order_id, reservation_id
| filter correlation_id = "corr-abc-123"
| sort @timestamp asc
@timestamp level service message
────────────────────── ────── ───────────────────────── ─────────────────────────
2024-01-15 10:30:00.012 INFO order_service.handler Order created
2024-01-15 10:30:00.040 INFO order_service.handler SAGA execution started
2024-01-15 10:30:00.047 INFO inventory_service.handler Inventory reserved ← res-456
2024-01-15 10:30:00.145 INFO payment_service.handler Payment charged ← pay-789
2024-01-15 10:30:00.167 INFO order_service.handler Order confirmed
2024-01-15 10:30:00.171 INFO notification_service Notification queued| Criterion | SAGA + Step Functions ✅ | SAGA + Choreography | Two-Phase Commit | Outbox Pattern |
|---|---|---|---|---|
| Failure visibility | Explicit — one diagram | Implicit — N event streams | Coordinator log | Moderate |
| Compensation clarity | Defined in state machine | Distributed across services | Automatic rollback | Per-service logic |
| Service coupling | Low (orchestrator only) | Very low | Very high | Low |
| Debugging ease | One execution history | Correlate across N logs | Single DB log | Moderate |
| Cloud-native fit | Excellent | Excellent | Poor | Good |
| Financial safety | High — explicit paths | Medium — implicit rollback | High — but blocking | High |
| Operational overhead | Low (managed service) | Very low | High | Medium |
| CloudFlow choice | ✅ Used | EventBridge for non-critical | Rejected | Future improvement |
Step Functions Orchestration → when you need explicit compensation + visibility
Best for: payment, inventory, anything involving money
CloudFlow uses: SAGA happy path + compensation paths
Event Choreography → when services are independent + compensation isn't needed
Best for: analytics events, audit logs, cache invalidation
CloudFlow uses: EventBridge OrderCreated for downstream consumers
Two-Phase Commit → when you have a single shared database + ACID required
Best for: monolith with PostgreSQL, financial ledger within one service
CloudFlow avoids: no shared DB, Lambda is stateless
Outbox Pattern → when you need guaranteed event delivery with DB atomicity
Best for: preventing lost events between DB write and message publish
CloudFlow future: would strengthen order creation + EventBridge reliability
What .\run.ps1 deploy produces on a real AWS account:
$ cdk deploy --all
CloudFlow: deploying... [5/5 stacks]
✅ CloudFlowDatabaseStack (cloudflow-database)
Outputs:
CloudFlowDatabaseStack.OrdersTableArn =
arn:aws:dynamodb:us-east-1:123456789:table/cloudflow-orders
CloudFlowDatabaseStack.InventoryTableArn =
arn:aws:dynamodb:us-east-1:123456789:table/cloudflow-inventory
✅ CloudFlowMessagingStack (cloudflow-messaging)
Outputs:
CloudFlowMessagingStack.NotificationQueueUrl =
https://sqs.us-east-1.amazonaws.com/123456789/cloudflow-notifications
CloudFlowMessagingStack.EventBusArn =
arn:aws:events:us-east-1:123456789:event-bus/cloudflow-events
✅ CloudFlowSagaStack (cloudflow-saga)
Outputs:
CloudFlowSagaStack.SagaStateMachineArn =
arn:aws:states:us-east-1:123456789:stateMachine:cloudflow-saga
CloudFlowSagaStack.SagaStateMachineUrl =
https://console.aws.amazon.com/states/home#/statemachines/view/cloudflow-saga
✅ CloudFlowApiStack (cloudflow-api)
Outputs:
CloudFlowApiStack.ApiUrl = https://abc123.execute-api.us-east-1.amazonaws.com/prod
CloudFlowApiStack.OrderServiceFunctionArn =
arn:aws:lambda:us-east-1:123456789:function:cloudflow-order-handler
✅ CloudFlowMonitoringStack (cloudflow-monitoring)
Outputs:
CloudFlowMonitoringStack.DashboardUrl =
https://console.aws.amazon.com/cloudwatch/home#dashboards:name=CloudFlow
─────────────────────────────────────────────────
✅ Deployment complete. 5 stacks, 0 errors.
Total deploy time: 3m 42s
Send a test order:
curl -X POST https://abc123.execute-api.us-east-1.amazonaws.com/prod/orders \
-H "Content-Type: application/json" \
-d '{"customer_id":"cust-1","items":[{"product_id":"p1","quantity":1,"unit_price_cents":999}]}'
Response: {"order_id": "ord-abc-123", "status": "PENDING"}
─────────────────────────────────────────────────
| Threat | Attack Vector | Impact | Mitigation in CloudFlow |
|---|---|---|---|
| Spoofing | Forged customer_id in order request |
Orders placed as another customer | API Gateway JWT authorizer (production) — validates identity before Lambda |
| Tampering | Modified total_cents in Step Functions input |
Undercharge for order | DynamoDB optimistic locking; Step Functions input is signed; amounts recalculated server-side |
| Repudiation | Customer denies placing order | Chargeback fraud | Event sourcing — immutable event log per order with timestamps, idempotency keys, correlation IDs |
| Info Disclosure | CloudWatch logs expose PII | GDPR violation | Logs contain order_id and customer_id only — no card numbers, addresses, or names |
| Denial of Service | Flood /orders endpoint |
Lambda concurrency exhausted | API Gateway rate limiting (requests/sec per API key); circuit breaker prevents payment provider cascade |
| Elevation of Privilege | Lambda reads another service's DynamoDB table | Data breach | IAM least-privilege: each Lambda role has dynamodb:* only on its own table ARN |
{
"Effect": "Allow",
"Action": [
"dynamodb:UpdateItem",
"dynamodb:GetItem"
],
"Resource": [
"arn:aws:dynamodb:us-east-1:*:table/cloudflow-inventory",
"arn:aws:dynamodb:us-east-1:*:table/cloudflow-reservations"
]
}No Lambda can read another service's table. No Lambda has dynamodb:* on *. No Lambda has iam:*.
| Data | Classification | Storage | Retention |
|---|---|---|---|
order_id, customer_id |
Internal | DynamoDB (encrypted at rest) | 7 years (audit) |
total_cents, items |
Internal | DynamoDB (encrypted at rest) | 7 years (audit) |
| Payment card numbers | Never stored | Not in CloudFlow — passed to provider only | N/A |
provider_charge_id |
Internal | DynamoDB payments table | 7 years |
| CloudWatch logs | Internal | CloudWatch (encrypted) | 90 days |
Card numbers never touch CloudFlow's infrastructure. The payment provider (Stripe/Braintree) holds PCI-DSS scope. CloudFlow stores only the opaque
charge_idreturned by the provider.
These are documented design boundaries — honest accounts of what the system does not yet handle.
| Limitation | Root Cause | What Would Be Needed |
|---|---|---|
| Compensation-state consistency is not formally proved — compensating transactions are designed so partial rollback cannot produce an exploitable inconsistent state, but this property is verified by test coverage, not by proof | No formal model of the compensation state space; correctness argued by case analysis over the failure modes I could construct | A formal specification of the SAGA state machine (e.g. in TLA+) with a proof that every reachable compensation path terminates in a consistent state |
Circuit breaker state is eventually consistent across Lambda instances — DynamoDB propagation delay means two concurrent Lambda invocations can both read CLOSED and both attempt the external call before either writes OPEN |
DynamoDB read-after-write consistency applies per-item per-instance, but concurrent reads before either write races | Compare-and-swap on the circuit state item using DynamoDB conditional writes on a version counter |
| No cross-service idempotency — idempotency is enforced per-service (Order, Inventory, Payment each have their own idempotency table), but a request that passes Order idempotency and fails mid-SAGA may replay Inventory with a different key | Idempotency keys are scoped to each service handler independently | A saga-level idempotency token propagated through Step Functions input that each downstream service checks before executing |
| Step Functions execution history is lost after 90 days — audit trail for long-running investigations relies on CloudWatch logs, not Step Functions | AWS hard limit on execution history retention | Export execution history to S3 on completion; query via Athena for investigations beyond 90 days |
| LocalStack behaviour diverges from real AWS under concurrency — load test results (1,100+ req/min, P99 < 120ms) were measured on LocalStack; real DynamoDB provisioned throughput and Lambda cold start distributions will differ | LocalStack is an approximation; network latency and AWS throttling are not modelled | Re-run load tests on a real AWS account with provisioned DynamoDB capacity and compare P99 tail latencies |
The core security-relevant open question in CloudFlow is whether the SAGA compensation design is exhaustively safe — not just safe for the failure modes I tested.
Specifically: can a partial compensation path (where some compensating transactions succeed and others fail mid-compensation) produce a state that is internally consistent from each service's perspective but globally inconsistent — and exploitable? My design argues no, because each compensating transaction is idempotent and Step Functions retries failed compensation steps. But this argument relies on the assumption that Step Functions will eventually deliver every compensation step successfully, which is not guaranteed under all failure modes (e.g. Step Functions service disruption mid-compensation).
This is the same class of problem as the CSPM remediation completeness question — the gap between per-component correctness and system-level safety guarantees. Both motivate the same research direction: formal verification of distributed security policies.
- SAGA Pattern — Microservices.io
- AWS Step Functions SAGA example
- Designing Data-Intensive Applications — Kleppmann (Chapter 9)
- AWS Well-Architected Framework — Reliability Pillar
- DynamoDB Single-Table Design — Rick Houlihan
- Circuit Breaker Pattern — Martin Fowler
MIT © Utkarsh Batham
Utkarsh Batham — B.Tech CSE · Cloud Technology & Information Security



