Skip to content

UTKARSH698/CloudFlow

Repository files navigation

CloudFlow — Distributed SAGA Order Processing on AWS

Serverless order processing system on AWS demonstrating production-grade distributed systems engineering: SAGA pattern, idempotency, circuit breakers, event sourcing, and measured performance under concurrency.

Python AWS Lambda Step Functions AWS CDK GitHub Actions Tests Serverless License


Key Takeaways

SAGA Orchestration Step Functions coordinates reserve → charge → confirm with automatic compensation on failure — no 2-phase commit, no distributed locks
Failure Recovery Every failure mode is handled and tested: payment failure triggers inventory rollback, duplicate requests are deduplicated, circuit breakers prevent cascade failures
Measured Performance Load-tested at 5–50 concurrent threads: 1,100+ req/min, P50 = 47ms, P99 = 120ms on LocalStack. Bottlenecks identified and explained
Test Coverage 50+ tests across unit (moto mocks) and integration (LocalStack) layers — including failure scenarios, compensation paths, pagination, DLQ processing, and idempotency under concurrency
Infrastructure as Code 5 AWS CDK stacks: DynamoDB tables, SQS queues, Step Functions, API Gateway, CloudWatch dashboards — fully reproducible in one deploy
Observability Structured JSON logs (CloudWatch Logs Insights ready), X-Ray distributed tracing, correlation IDs across every service boundary

Quick Navigation

Section What You'll Find
Live Demo 4 animated GIF demos — all distributed systems scenarios in action
Architecture Mermaid diagram — full system at a glance
How the SAGA Works SAGA, idempotency, circuit breaker, event sourcing
Failure Scenarios Every failure mode + how the system recovers
Observability JSON logs, CloudWatch metrics, X-Ray traces, dashboard mockups
Performance Report Benchmark data: throughput, latency, bottleneck analysis
Design Tradeoffs Why SAGA over 2PC, why DynamoDB over Postgres
Security Input validation, IAM least-privilege, no secrets in code
How To Run Locally Setup + test commands (Windows & Linux)
API Reference Request/response examples for all 4 outcome types
Known Limitations Open consistency question + honest design boundaries
TECHNICAL.md 8-section deep-dive for full architecture analysis
Author Background + links

Live Demo

Interactive Streamlit dashboard running against LocalStack — all 4 distributed systems scenarios in action. Run locally: .\run.ps1 local-up then .\run.ps1 dashboard

Scenario 1 — Happy Path (SAGA completes end-to-end)

Happy Path

Full SAGA trace: Create Order → Reserve Inventory → Charge Payment → Confirm. Each step timed independently. Order status shows CONFIRMED, inventory count decrements atomically.

Scenario 2 — Payment Failure + Automatic Compensation

Payment Compensation

Payment deliberately declined. Step Functions triggers the compensation path: inventory is released back, order marked COMPENSATED. No stock held without a successful charge.

Scenario 3 — Circuit Breaker Fast-Fail

Circuit Breaker Open

Circuit breaker manually tripped to OPEN. Subsequent orders fail in < 1ms without touching the payment provider — preventing a cascade timeout across the entire Lambda fleet.

Scenario 4 — Oversell Protection

Oversell Protection

Order quantity exceeds available stock. DynamoDB ConditionalCheckFailedException fires atomically — no overselling even under concurrent requests. Returns INSUFFICIENT_STOCK immediately.


Architecture

flowchart TD
    Client([Client]) -->|POST /orders| AG[API Gateway]
    AG --> OS[Order Service λ]

    OS -->|1 - Create order PENDING| ODB[(orders\nDynamoDB)]
    OS -->|2 - Publish OrderCreated| EB[EventBridge]
    OS -->|3 - Start SAGA execution| SF{Step Functions\nSAGA Orchestrator}

    SF -->|reserve| IS[Inventory Service λ]
    IS <-->|atomic conditional decrement| IDB[(inventory\nDynamoDB)]

    SF -->|charge| PS[Payment Service λ]
    PS -. circuit breaker .-> PP([External Payment\nProvider])
    PS --> PDB[(payments\nDynamoDB)]

    SF -->|confirm| OS
    OS -->|append CONFIRMED event| ODB

    SF -.->|COMPENSATE: release| IS
    IS -.->|restore stock| IDB
    SF -.->|COMPENSATE: notify| SQ[SQS Queue]
    SQ --> NS[Notification Service λ]
    NS --> SNS[SNS Topic]

    SQ -.->|3 failures| DLQ[Dead Letter Queues]
    DLQ --> DP[DLQ Processor λ]
    DP -->|structured error logs| CW[CloudWatch Logs]

    IS & PS <-->|deduplication| ID[(idempotency\nDynamoDB)]
    PS <-->|circuit state| CB[(circuit-breakers\nDynamoDB)]

    style SF fill:#FF9900,color:#000,stroke:#E68A00
    style ODB fill:#3F51B5,color:#fff,stroke:#303F9F
    style IDB fill:#3F51B5,color:#fff,stroke:#303F9F
    style PDB fill:#3F51B5,color:#fff,stroke:#303F9F
    style ID fill:#3F51B5,color:#fff,stroke:#303F9F
    style CB fill:#3F51B5,color:#fff,stroke:#303F9F
Loading

Services & AWS Resources

Service Lambda Database Queue
Order Service order-handler orders DynamoDB (single-table)
Inventory Service inventory-handler inventory + reservations DynamoDB
Payment Service payment-handler payments DynamoDB
Notification Service notification-handler notification-queue SQS
DLQ Processor dlq-processor All 3 Dead Letter Queues

Distributed Systems Concepts

1. SAGA Pattern (Orchestration)

Long-running distributed transactions without Two-Phase Commit. Each service owns its own database. Step Functions orchestrates the happy path and compensation paths explicitly — visible as a single state machine diagram. When payment fails, inventory is automatically released via a compensating transaction.

2. Idempotency as a Security Primitive

Every Lambda handler uses a DynamoDB-backed idempotency decorator. SQS delivers messages at-least-once — duplicates are deduplicated atomically using attribute_not_exists conditional writes. No double-charges. No double-reservations. No distributed locks needed.

Idempotency here is not just a reliability mechanism — it is a replay-attack prevention primitive. A duplicate order request with the same idempotency key is indistinguishable from an adversarial replay. By making every handler idempotent at the database level, the system rejects replays and duplicates with identical semantics, without needing a separate security layer. The same DynamoDB conditional write that prevents double-charging also closes the replay vector.

3. Event Sourcing

Orders are never updated in place. Every state transition (PENDING → CONFIRMED → PAYMENT_CHARGED) is appended as an immutable event with a timestamp. Full audit trail. Time-travel debugging. State is derived by replaying events.

4. Circuit Breaker (DynamoDB-backed)

The Payment Service wraps external provider calls in a circuit breaker stored in DynamoDB. After N consecutive failures, the circuit opens — requests fast-fail in < 1ms instead of waiting for a 30-second timeout. Because state lives in DynamoDB (not in-process memory), all Lambda instances share the same circuit state across cold starts and concurrent invocations.

5. CAP Theorem Tradeoffs

  • Inventory reservations: Strongly consistent reads — correctness matters, no overselling.
  • Order status queries: Eventually consistent reads — acceptable staleness for read-heavy paths.
  • Notifications: At-least-once delivery — idempotent consumers handle duplicates safely.

6. Distributed Tracing

AWS X-Ray traces propagate across Lambda ↔ SQS ↔ DynamoDB boundaries. Every request gets a correlation_id that flows through all services and appears in CloudWatch Logs Insights.


Failure Scenarios & Recovery

This is what separates a distributed system from a CRUD app. Every failure mode is handled explicitly.

Scenario What Happens Guarantee Maintained
Payment declined Step Functions triggers compensation: inventory released, customer notified No stock held without payment
Payment provider timeout Circuit breaker records failure; after 5 timeouts circuit opens; orders fast-fail No cascade timeout across Lambda fleet
Circuit breaker open Returns PAYMENT_PROVIDER_UNAVAILABLE with retry_after_seconds No hanging requests
Duplicate order request Idempotency table returns cached result; SAGA not started twice Exactly-once order creation
Inventory conflict DynamoDB ConditionalCheckFailedException; returns INSUFFICIENT_STOCK No overselling; atomic at DB level
SQS message redelivered Notification handler checks idempotency key; skips duplicate send No double-emails
Lambda crash mid-SAGA Step Functions retries the failed step from last checkpoint No partial execution
DynamoDB write fails Raises exception; Step Functions retries with backoff At-least-once execution, idempotency handles duplicates

Failure Scenario Tests

# Run unit tests covering all failure scenarios
.\run.ps1 test-unit

# Specific failure test file
python -m pytest tests/unit/test_failure_scenarios.py -v

See tests/unit/test_failure_scenarios.py for:

  • test_payment_failure_triggers_compensation — payment declines → inventory released
  • test_circuit_breaker_open_fast_fails — open circuit rejects calls in < 1ms
  • test_duplicate_order_is_idempotent — same order submitted twice, SAGA runs once
  • test_inventory_conflict_atomic — concurrent reservations, one fails cleanly

Observability

Structured JSON Logging

Every service emits structured JSON logs (see services/shared/logger.py):

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "service": "inventory_service.handler",
  "message": "Inventory reserved",
  "order_id": "abc-123",
  "reservation_id": "res-456",
  "correlation_id": "corr-789"
}

Query in CloudWatch Logs Insights:

fields @timestamp, order_id, level, message
| filter service = "inventory_service.handler" and level = "ERROR"
| sort @timestamp desc
| limit 20

CloudWatch Metrics (via Monitoring Stack)

Metric Alarm Threshold
OrdersCreated
SAGASuccessRate < 95% for 5 minutes
SAGACompensationRate > 10% triggers alert
LambdaDuration p99 > 3000ms
CircuitBreakerStateChange Any OPEN transition
DLQMessageCount > 0 (no messages should reach DLQ)

Load Test Results

Run: python scripts/load_test.py --orders 50 --concurrency 10

See Performance Evaluation below for the full benchmark report.


Key Design Decisions & Tradeoffs

Decision Summary

Decision Chosen Approach Core Reason Trade-off Accepted
Distributed transaction SAGA + Step Functions Explicit, auditable compensation — critical for money flows Central orchestrator required; choreography avoided
Database design Separate DynamoDB table per service Service ownership enforced at infra level; independent scaling No cross-service joins; app-layer aggregation needed
Idempotency store DynamoDB attribute_not_exists Atomic check-and-set at DB level; no extra infrastructure +1 read per request; TTL expiry after 24h
Circuit breaker state DynamoDB (not Lambda memory) Shared across entire Lambda fleet; survives cold starts +6ms per payment call for the GetItem
Local development LocalStack on Docker Zero AWS cost during dev and CI; same API surface ~6× slower than real DynamoDB; not production equivalent
Event routing Step Functions for SAGA + EventBridge for side effects Orchestration where compensation matters; choreography for non-critical events Two event systems to understand
Amount precision Integer cents (no floats) Eliminates floating-point rounding errors in financial math Client-side conversion required for display

Distributed Transaction Approach Comparison

Approach Pros Cons Cloud Fit
SAGA + Orchestration Explicit flow, testable compensation, full audit trail, visible in one diagram Central coordinator required; more moving parts Excellent — no distributed locks, works with Lambda
Event Choreography Loose coupling; services are fully independent Implicit rollback; hard to debug across 4 services; dangerous when money is involved Good for simple flows with no compensation
Two-Phase Commit (2PC) Strong atomicity; familiar from relational DBs Distributed locks kill throughput; coordinator crash = system stuck; AWS Lambda can't hold connections Poor — designed for monoliths with persistent DB connections
Outbox Pattern Reliable event publishing; eliminates dual-write problem Requires CDC or polling; extra operational overhead Complements SAGA but solves a different problem

CloudFlow uses SAGA + Orchestration because money flows require visible, auditable, testable rollback — not implicit event chains that are difficult to trace when they fail.


Why SAGA instead of Two-Phase Commit (2PC)?

2PC problems:

  • All participants hold locks during the protocol → throughput collapses under load
  • If the coordinator crashes between Phase 1 and Phase 2 → system stuck in limbo forever
  • Requires all services to implement a prepare protocol — tightly coupled
  • AWS Lambda has no persistent connections → coordinator can't hold distributed locks

SAGA advantages:

  • Each step is a local transaction — no distributed locks
  • Compensation is business logic, not database magic → testable, readable
  • Step Functions makes the entire saga visible in one diagram
  • Partial failures have explicit, known recovery paths

Why Orchestration (Step Functions) instead of Choreography (events)?

Choreography: Services react to each other's events. Inventory hears OrderCreated, publishes StockReserved. Payment hears StockReserved, publishes PaymentCharged.

Why we didn't:

  • Compensation is implicit — each service must listen for failure events from every other service
  • Debugging requires correlating logs across 4 services to reconstruct what happened
  • Adding a new step requires modifying multiple services
  • When money is involved, implicit rollback is dangerous

Orchestration (our approach):

  • One state machine shows the entire flow
  • Compensation paths are explicit and version-controlled
  • One place to add retry policies, timeouts, and error handling

Why DynamoDB for idempotency instead of Redis?

  • DynamoDB attribute_not_exists is an atomic check-and-set at the database level
  • Redis requires Lua scripts for atomic check-and-set (more complexity)
  • DynamoDB survives Lambda cold starts — Redis connections can be stale
  • No extra infrastructure: DynamoDB is already the primary datastore
  • Scales infinitely with zero ops overhead

Why per-service DynamoDB tables instead of a shared database?

Each service owns its data. This enforces the microservices boundary at the infrastructure level:

  • Services cannot bypass each other's APIs to read/write data
  • Schema changes in one service don't break others
  • Each table can be tuned independently (read/write capacity, GSIs)
  • This is why Amazon itself moved from a shared Oracle database to service-owned data

CAP Theorem Positioning

CloudFlow prioritizes availability over strict consistency for reads, with strong consistency only where correctness is critical:

Inventory writes:  Strong consistency  (ConditionalCheckFailed catches conflicts)
Inventory reads:   Strong consistency  (prevents overselling)
Order reads:       Eventual consistency (acceptable: user polling for status)
Notifications:     At-least-once       (idempotent consumer handles duplicates)

This matches how large-scale systems (Amazon, Netflix) handle the CAP tradeoff: strong consistency only where business rules require it, eventual consistency everywhere else.


Security

Brief Threat Model

Threat Where It Bites Mitigation Tested
Duplicate requests SQS delivers at-least-once; client retries DynamoDB idempotency table — attribute_not_exists atomic check-and-set per order_id test_duplicate_reservation_is_idempotent
Slow / failed payment API One hung provider call blocks the thread; cascade to all orders Circuit breaker (state in DynamoDB, survives Lambda cold starts) — opens after 3 failures, fast-fails for 60s test_circuit_breaker_open_fast_fails
Stock oversell Two threads reserve the last item simultaneously DynamoDB conditional write — quantity >= requested enforced atomically; one wins, one gets ConditionalCheckFailedException test_reserve_fails_on_insufficient_stock
Partial saga failure Lambda crashes between reserve and charge Step Functions checkpoints each step — retries from last successful state, never from the beginning Compensation path integration tests
Forged / tampered order Client sends manipulated total_cents or negative quantity Pydantic validation rejects at API boundary; amounts recalculated server-side, never trusted from input Input validation schema
Cross-service data access Inventory Lambda reads the payments table IAM least-privilege — each Lambda role grants dynamodb:* only on its own table ARN CDK IAM policy per stack

The full STRIDE analysis (Spoofing, Tampering, Repudiation, Info Disclosure, DoS, Elevation) is at the bottom of this file.

Input Validation

All requests are validated via Pydantic schemas before any database operation:

class CreateOrderRequest(BaseModel):
    customer_id: str = Field(min_length=1)
    items: list[OrderItem] = Field(min_length=1)
    # Negative quantities, zero-price items, empty orders → rejected at validation

IAM Least Privilege

Each Lambda has its own IAM role with only the permissions it needs:

  • order-handler: dynamodb:PutItem on orders table, events:PutEvents, states:StartExecution
  • inventory-handler: dynamodb:UpdateItem on inventory, dynamodb:PutItem on reservations
  • payment-handler: dynamodb:PutItem on payments, dynamodb:GetItem on payments
  • No Lambda has admin permissions or cross-service database access

No Secrets in Code

  • Payment provider credentials: fetched from AWS Secrets Manager at runtime (not env vars)
  • DynamoDB table names: passed via environment variables set by CDK at deploy time
  • No hardcoded credentials anywhere — verified by pre-commit scan

API Gateway Protections

  • Request validation: API Gateway rejects malformed JSON before Lambda is invoked
  • Rate limiting: Usage plans limit requests per second per API key
  • CORS: Configured to allow only trusted origins

Data Integrity

  • All financial amounts stored as integer cents (no floating point)
  • DynamoDB optimistic locking with version counters prevents lost updates
  • Idempotency keys expire after 24 hours (TTL on idempotency table)

Project Structure

cloudflow/
├── infrastructure/              # AWS CDK (Python) — all cloud resources as code
│   └── cloudflow/
│       ├── api_stack.py         # API Gateway + Order Service Lambda
│       ├── database_stack.py    # DynamoDB tables (5 tables)
│       ├── messaging_stack.py   # SQS queues + EventBridge bus
│       ├── saga_stack.py        # Step Functions state machine (SAGA)
│       └── monitoring_stack.py  # CloudWatch dashboards + alarms
├── services/
│   ├── shared/                  # Cross-cutting concerns
│   │   ├── events.py            # Pydantic event schemas
│   │   ├── idempotency.py       # @idempotent decorator (DynamoDB-backed)
│   │   ├── circuit_breaker.py   # Circuit breaker (DynamoDB-backed)
│   │   ├── dynamodb.py          # DynamoDB helpers
│   │   └── logger.py            # Structured JSON logging
│   ├── order_service/           # Create orders, manage event sourcing log
│   ├── inventory_service/       # Atomic reserve / release (SAGA steps)
│   ├── payment_service/         # Charge / refund with circuit breaker
│   ├── notification_service/    # SQS-triggered email/SMS notifications
│   └── dlq_processor/           # Structured logging for failed DLQ messages
├── tests/
│   ├── unit/                    # moto mocks — pure logic, no Docker
│   │   ├── test_circuit_breaker.py
│   │   ├── test_idempotency.py
│   │   ├── test_events.py
│   │   ├── test_failure_scenarios.py   # failure mode coverage
│   │   ├── test_dlq_processor.py       # DLQ handler resilience
│   │   ├── test_pagination.py          # cursor-based event pagination
│   │   └── test_order_service.py       # order handler routes + validation
│   └── integration/             # LocalStack — real DynamoDB via Docker
│       └── test_saga_flow.py
├── scripts/
│   ├── load_test.py             # Concurrent load test with latency metrics
│   └── seed_data.py             # Seed demo data into LocalStack
├── docker-compose.yml           # LocalStack for local development
├── run.ps1                      # Windows PowerShell commands
├── Makefile                     # Linux/macOS commands
├── TECHNICAL.md                 # Deep-dive: architecture decisions & analysis
└── .github/workflows/           # CI/CD: test on PR, deploy on merge to main

Quick Start

Prerequisites

  • Python 3.11+
  • Docker Desktop (for LocalStack integration tests)
  • Node.js 18+ (for AWS CDK)

Install & Run Tests

# Clone
git clone https://github.com/UTKARSH698/CloudFlow
cd CloudFlow

# Linux / macOS
make setup
make test-unit          # 30+ unit tests, no Docker needed (~6s)
make local-up           # Start LocalStack
make test               # All 50+ tests
make local-down

# Windows (PowerShell)
.\run.ps1 setup
.\run.ps1 test-unit
.\run.ps1 local-up
.\run.ps1 test
.\run.ps1 local-down

Deploy to AWS (optional)

# Configure AWS credentials first (aws configure)
make cdk-bootstrap      # First time only
make deploy             # Deploys all 5 stacks
make demo               # Submit a sample order end-to-end

API Reference

All endpoints accept and return application/json. Deployed via API Gateway; test locally with .\run.ps1 local-up + .\run.ps1 demo.

GET /health — Health check

No API key required. Used by load balancers and deployment verification.

Response — 200 OK:

{ "status": "healthy", "service": "order-service" }

POST /orders — Place an order

Request:

{
  "customer_id": "cust-001",
  "items": [
    { "product_id": "KEYBD-01", "quantity": 1, "unit_price_cents": 8999 }
  ]
}

Headers: Idempotency-Key: <uuid> (required), x-api-key: <api-key> (required)

Response — 202 Accepted (SAGA started):

{ "order_id": "ord-abc-123", "status": "PENDING", "correlation_id": "corr-xyz-789" }

SAGA runs asynchronously. Poll GET /orders/{order_id} for final status. The correlation_id traces the request across all services — use it for debugging.


GET /orders/{order_id} — Get order status

Query parameters: ?limit=50 (max 100), ?cursor=<token> (from next_cursor in previous response)

Happy path — CONFIRMED:

{
  "order_id": "ord-abc-123",
  "status": "CONFIRMED",
  "customer_id": "cust-001",
  "total_cents": 8999,
  "events": [
    { "type": "ORDER_CREATED",   "timestamp": "2024-01-15T10:30:00.012Z" },
    { "type": "STOCK_RESERVED",  "timestamp": "2024-01-15T10:30:00.047Z" },
    { "type": "PAYMENT_CHARGED", "timestamp": "2024-01-15T10:30:00.145Z" },
    { "type": "ORDER_CONFIRMED", "timestamp": "2024-01-15T10:30:00.167Z" }
  ]
}

Payment declined — COMPENSATED (SAGA rolled back):

{
  "order_id": "ord-def-456",
  "status": "COMPENSATED",
  "failure_reason": "PAYMENT_DECLINED",
  "events": [
    { "type": "ORDER_CREATED",     "timestamp": "2024-01-15T10:30:00.012Z" },
    { "type": "STOCK_RESERVED",    "timestamp": "2024-01-15T10:30:00.047Z" },
    { "type": "PAYMENT_FAILED",    "timestamp": "2024-01-15T10:30:00.063Z" },
    { "type": "STOCK_RELEASED",    "timestamp": "2024-01-15T10:30:00.078Z" },
    { "type": "ORDER_COMPENSATED", "timestamp": "2024-01-15T10:30:00.110Z" }
  ]
}

Circuit breaker open — 503:

{
  "error": "PAYMENT_PROVIDER_UNAVAILABLE",
  "retry_after_seconds": 57,
  "message": "Circuit breaker OPEN — fast-failing to prevent cascade timeout"
}

Insufficient stock — 409:

{
  "error": "INSUFFICIENT_STOCK",
  "product_id": "KEYBD-01",
  "requested": 2,
  "available": 1
}

Performance Evaluation

Tool: scripts/load_test.py · Target: LocalStack DynamoDB · Machine: Windows 11 AMD64

At a Glance — 10 threads, 50 orders

Metric           Result      Visual (each █ ≈ 100ms or 100 req/min)
─────────────────────────────────────────────────────────────────────
Throughput       1,100/min   ███████████
P50 latency         47ms     ████▌
P95 latency         89ms     ████████▉
P99 latency        120ms     ████████████
Test suite      50+ tests    ✅ all passing, <30s
─────────────────────────────────────────────────────────────────────
LocalStack is ~6× slower than real AWS DynamoDB (8ms vs 50ms writes).
Correctness properties — idempotency, compensation, atomic writes — are identical.

Latency per Request — Scatter Plot

Each is one reservation request. Horizontal markers show percentile thresholds.

Latency (ms)
  140 │                                              ●
  120 ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌  P99 = 120ms
  100 │           ●                    ●
   89 ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  P95 = 89ms
   70 │      ●       ●       ●    ●
   47 ├━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  P50 = 47ms
   35 │ ●●●●● ●●  ●●●  ●●  ●●  ●●  ●●● ●●● ●  ●●
   22 │  ●  ●● ●● ●  ●   ●● ●● ●  ●●●  ●  ●●  ●●
    0 └──────────────────────────────────────────>
       0   5  10  15  20  25  30  35  40  45  (req #)

88% of requests complete in 20–60ms. Outliers beyond P95 are rare DynamoDB write contention spikes — not errors.

Tail Latency Fan — P50 / P95 / P99 vs Concurrency

Symbol width is proportional to latency (1 symbol ≈ 20ms). Notice P99 fans out much faster than P50:

Threads  P50 (median)          P95 (tail)            P99 (worst)
──────────────────────────────────────────────────────────────────────
   5 th  ··  37ms              ○○○  68ms             ◆◆◆◆◆  95ms
  10 th  ···  47ms             ○○○○  89ms            ◆◆◆◆◆◆  120ms
  20 th  ···  54ms             ○○○○○  98ms           ◆◆◆◆◆◆◆  140ms
  50 th  ····  61ms            ○○○○○○  110ms         ◆◆◆◆◆◆◆◆  160ms
──────────────────────────────────────────────────────────────────────
 Growth        +65%                   +62%                    +69%

P50 stays flat because it's bounded by DynamoDB write time. P99 grows because rare contention retries accumulate under higher concurrency.

Throughput Curve

Req/min
 1400 │                    ●───────●  ← plateau ~1,300 req/min
 1200 │          ●─────────╯          (LocalStack Docker bound)
 1000 │  ●───────╯
  800 │
  600 │
  400 │
  200 │
    0 └──────────────────────────────>
        5 th    10 th   20 th  50 th
        940     1,100   1,300  1,280   req/min

Throughput plateaus at ~20 threads — LocalStack Docker container CPU bound, not a DynamoDB limit. On real AWS, DynamoDB partitions scale horizontally; throughput grows linearly with concurrency.

Latency Distribution (10 threads, 50 orders)

Latency (ms)  │ Distribution
──────────────┼───────────────────────────────────────────────
  0 –  20ms   │ ▏  2%   (1 req)
 20 –  40ms   │ ████████████████████  40%  (20 req)  ← bulk
 40 –  60ms   │ ████████████████████████  48%  (24 req)
 60 –  80ms   │ ████  8%  (4 req)
 80 – 100ms   │ ▏  1%  (1 req)   ← P95
    100ms+    │ ▏  1%  (1 req)   ← P99
──────────────┴───────────────────────────────────────────────
 Avg = 45ms · P50 = 47ms · P95 = 89ms · P99 = 120ms

AWS vs LocalStack — Side-by-Side

DynamoDB Write Latency (ms)              Throughput Ceiling (req/min)
──────────────────────────────────────   ─────────────────────────────────────
LocalStack  ████████████████████  50ms   LocalStack  ███████  1,300
Real AWS    ████  8ms                    Real AWS    ████████████████  100,000+
──────────────────────────────────────   ─────────────────────────────────────
            ~6x slower clock                         ~75x lower ceiling

  Correctness is identical — atomic writes, idempotency, compensation all behave
  the same way. Only speed differs. LocalStack is a faithful emulator, not a toy.

Bottleneck Analysis

Bottleneck Root Cause Production Mitigation
LocalStack throughput cap ~1,300 req/min Single Docker container, no parallel I/O Not present on real AWS — DynamoDB scales horizontally
DynamoDB write: ~45ms LocalStack adds ~35ms network emulation overhead Real AWS: 2-8ms single-digit SLA
Circuit breaker read: +6ms Extra GetItem per payment call for circuit state Acceptable — prevents 30s cascade timeouts
Lambda cold start: ~200ms (first call) Python + boto3 import time Provisioned concurrency → 0ms in production
UUID partition keys N/A — well distributed, zero hot partitions observed Already mitigated by design

Key insight: Correctness properties are identical between LocalStack and real AWS — atomic conditional writes, idempotency, and compensation work the same way. Only latency and throughput numbers differ.


Observability Mockups

CloudWatch Metrics Dashboard

CloudFlow — Operations Dashboard               [Last 15 min] [Auto-refresh: 1min]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Orders Created / min              SAGA Success Rate (%)
 ┌────────────────────────────┐    ┌────────────────────────────┐
 │ 120 │         ╭─╮          │    │ 100 │ ─────────────────    │  🟢 OK
 │  90 │     ╭───╯ ╰──╮       │    │  95 │                 ╲    │
 │  60 │  ╭──╯        ╰─╮     │    │  90 │                  ╲── │  ← alarm at 95%
 │  30 │──╯             ╰──   │    │  85 │                      │
 │     └──────────────────    │    │     └──────────────────    │
 │     0    5    10   15 min  │    │     0    5    10   15 min  │
 └────────────────────────────┘    └────────────────────────────┘

 P99 Lambda Duration (ms)          SAGA Compensations / min
 ┌────────────────────────────┐    ┌────────────────────────────┐
 │ 500 │                      │    │  10 │                      │  🟢 OK
 │ 400 │ ─────────────╮       │    │   8 │                      │
 │ 300 │              ╰─╮     │    │   6 │                      │  ← alarm at 10%
 │ 200 │                ╰──── │    │   4 │                      │
 │ 100 │                      │    │   2 │          ▄           │
 │     └──────────────────    │    │   0 │──────────█────────── │
 │     0    5    10   15 min  │    │     0    5    10   15 min  │
 └────────────────────────────┘    └────────────────────────────┘

 Circuit Breaker State              DLQ Message Count
 ┌────────────────────────────┐    ┌────────────────────────────┐
 │  CLOSED  ━━━━━━━━━━━━━━━━  │    │   0 │ ─────────────────── │  🟢 OK
 │  OPEN                      │    │     │  (alarm if > 0)      │
 │  HALF    [healthy 14m 32s] │    │     └──────────────────    │
 └────────────────────────────┘    └────────────────────────────┘
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

X-Ray Service Map

                    ┌─────────────┐
         ┌──────────│ API Gateway │
         │          └─────────────┘
         ▼               Avg: 5ms  OK: 100%
  ┌─────────────┐
  │order-service│──────────────────┐
  └─────────────┘                  │
    │   Avg: 28ms  OK: 100%        ▼
    │                       ┌─────────────────┐
    ├──▶ DynamoDB (orders)  │ Step Functions  │
    └──▶ EventBridge        └────────┬────────┘
                                     │
                    ┌────────────────┼─────────────────┐
                    ▼                ▼                  ▼
           ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
           │  inventory-  │  │  payment-    │  │notification- │
           │  service     │  │  service     │  │  service     │
           └──────────────┘  └──────────────┘  └──────────────┘
             Avg: 35ms          Avg: 98ms         Avg: 12ms
             OK: 100%           OK: 97%           OK: 100%
                │                  │   └── 3% circuit breaker
                ▼                  ▼
           DynamoDB            Payment Provider
           (inventory)         (external)

X-Ray Trace — Happy Path (347ms total)

Trace: 1-65a3f2b1-abc123   ✅ SUCCESS   Total: 347ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Segment                  Start    Duration  Timeline (347ms)
─────────────────────    ─────    ────────  ─────────────────
API Gateway              0ms      12ms      ████
order-handler            12ms     28ms          ████████
  DynamoDB.PutItem       15ms     8ms               ███
  EventBridge.Put        25ms     6ms                 ██
  StepFunctions.Start    34ms     5ms                  ██
inventory-handler        40ms     35ms                   ██████████
  DynamoDB.UpdateItem    44ms     9ms                        ████
payment-handler          78ms     98ms                              ██████████████████████████
  DynamoDB.GetItem       80ms     5ms                               ██
  PaymentProvider.POST   86ms     72ms                                ████████████████████
  DynamoDB.PutItem       159ms    6ms                                                  ██
order-handler            179ms    22ms                                                    ██████
  DynamoDB.PutItem       181ms    8ms                                                     ████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

X-Ray Trace — Compensation Path (payment failed)

Trace: 1-65a3f2b1-def456   ❌ COMPENSATED   Total: 285ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Segment                  Start    Duration  Status
─────────────────────    ─────    ────────  ─────────────────
order-handler            0ms      22ms      ✅ Created PENDING
inventory-handler        25ms     35ms      ✅ Stock reserved
payment-handler          63ms     12ms      ❌ PAYMENT_DECLINED
  [circuit breaker]      65ms     5ms          CB: CLOSED → recorded failure
Step Functions           76ms     0ms       → Triggers compensation
inventory-handler        78ms     30ms      ↩ Stock released (compensation)
  DynamoDB.UpdateItem    80ms     8ms          ADD quantity :3
order-handler            110ms    18ms      ❌ Status → FAILED
notification-handler     131ms    8ms       📧 Queued: ORDER_FAILED email
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Result: Stock restored ✅ | Customer notified ✅ | No charge made ✅

CloudWatch Logs Insights — Full Order Trace

fields @timestamp, level, service, message, order_id, reservation_id
| filter correlation_id = "corr-abc-123"
| sort @timestamp asc

@timestamp              level   service                    message
──────────────────────  ──────  ─────────────────────────  ─────────────────────────
2024-01-15 10:30:00.012  INFO   order_service.handler      Order created
2024-01-15 10:30:00.040  INFO   order_service.handler      SAGA execution started
2024-01-15 10:30:00.047  INFO   inventory_service.handler  Inventory reserved         ← res-456
2024-01-15 10:30:00.145  INFO   payment_service.handler    Payment charged             ← pay-789
2024-01-15 10:30:00.167  INFO   order_service.handler      Order confirmed
2024-01-15 10:30:00.171  INFO   notification_service       Notification queued

Architectural Pattern Comparison

Decision Matrix

Criterion SAGA + Step Functions ✅ SAGA + Choreography Two-Phase Commit Outbox Pattern
Failure visibility Explicit — one diagram Implicit — N event streams Coordinator log Moderate
Compensation clarity Defined in state machine Distributed across services Automatic rollback Per-service logic
Service coupling Low (orchestrator only) Very low Very high Low
Debugging ease One execution history Correlate across N logs Single DB log Moderate
Cloud-native fit Excellent Excellent Poor Good
Financial safety High — explicit paths Medium — implicit rollback High — but blocking High
Operational overhead Low (managed service) Very low High Medium
CloudFlow choice Used EventBridge for non-critical Rejected Future improvement

When Each Pattern Fits

Step Functions Orchestration → when you need explicit compensation + visibility
  Best for: payment, inventory, anything involving money
  CloudFlow uses: SAGA happy path + compensation paths

Event Choreography → when services are independent + compensation isn't needed
  Best for: analytics events, audit logs, cache invalidation
  CloudFlow uses: EventBridge OrderCreated for downstream consumers

Two-Phase Commit → when you have a single shared database + ACID required
  Best for: monolith with PostgreSQL, financial ledger within one service
  CloudFlow avoids: no shared DB, Lambda is stateless

Outbox Pattern → when you need guaranteed event delivery with DB atomicity
  Best for: preventing lost events between DB write and message publish
  CloudFlow future: would strengthen order creation + EventBridge reliability

Sample Deployment Output

What .\run.ps1 deploy produces on a real AWS account:

$ cdk deploy --all

CloudFlow: deploying... [5/5 stacks]
✅  CloudFlowDatabaseStack (cloudflow-database)

Outputs:
CloudFlowDatabaseStack.OrdersTableArn =
  arn:aws:dynamodb:us-east-1:123456789:table/cloudflow-orders
CloudFlowDatabaseStack.InventoryTableArn =
  arn:aws:dynamodb:us-east-1:123456789:table/cloudflow-inventory

✅  CloudFlowMessagingStack (cloudflow-messaging)

Outputs:
CloudFlowMessagingStack.NotificationQueueUrl =
  https://sqs.us-east-1.amazonaws.com/123456789/cloudflow-notifications
CloudFlowMessagingStack.EventBusArn =
  arn:aws:events:us-east-1:123456789:event-bus/cloudflow-events

✅  CloudFlowSagaStack (cloudflow-saga)

Outputs:
CloudFlowSagaStack.SagaStateMachineArn =
  arn:aws:states:us-east-1:123456789:stateMachine:cloudflow-saga
CloudFlowSagaStack.SagaStateMachineUrl =
  https://console.aws.amazon.com/states/home#/statemachines/view/cloudflow-saga

✅  CloudFlowApiStack (cloudflow-api)

Outputs:
CloudFlowApiStack.ApiUrl = https://abc123.execute-api.us-east-1.amazonaws.com/prod
CloudFlowApiStack.OrderServiceFunctionArn =
  arn:aws:lambda:us-east-1:123456789:function:cloudflow-order-handler

✅  CloudFlowMonitoringStack (cloudflow-monitoring)

Outputs:
CloudFlowMonitoringStack.DashboardUrl =
  https://console.aws.amazon.com/cloudwatch/home#dashboards:name=CloudFlow

─────────────────────────────────────────────────
✅ Deployment complete. 5 stacks, 0 errors.
   Total deploy time: 3m 42s

Send a test order:
  curl -X POST https://abc123.execute-api.us-east-1.amazonaws.com/prod/orders \
    -H "Content-Type: application/json" \
    -d '{"customer_id":"cust-1","items":[{"product_id":"p1","quantity":1,"unit_price_cents":999}]}'

Response: {"order_id": "ord-abc-123", "status": "PENDING"}
─────────────────────────────────────────────────

Threat Model

STRIDE Analysis

Threat Attack Vector Impact Mitigation in CloudFlow
Spoofing Forged customer_id in order request Orders placed as another customer API Gateway JWT authorizer (production) — validates identity before Lambda
Tampering Modified total_cents in Step Functions input Undercharge for order DynamoDB optimistic locking; Step Functions input is signed; amounts recalculated server-side
Repudiation Customer denies placing order Chargeback fraud Event sourcing — immutable event log per order with timestamps, idempotency keys, correlation IDs
Info Disclosure CloudWatch logs expose PII GDPR violation Logs contain order_id and customer_id only — no card numbers, addresses, or names
Denial of Service Flood /orders endpoint Lambda concurrency exhausted API Gateway rate limiting (requests/sec per API key); circuit breaker prevents payment provider cascade
Elevation of Privilege Lambda reads another service's DynamoDB table Data breach IAM least-privilege: each Lambda role has dynamodb:* only on its own table ARN

IAM Policy Example (Inventory Service)

{
  "Effect": "Allow",
  "Action": [
    "dynamodb:UpdateItem",
    "dynamodb:GetItem"
  ],
  "Resource": [
    "arn:aws:dynamodb:us-east-1:*:table/cloudflow-inventory",
    "arn:aws:dynamodb:us-east-1:*:table/cloudflow-reservations"
  ]
}

No Lambda can read another service's table. No Lambda has dynamodb:* on *. No Lambda has iam:*.

Data Classification

Data Classification Storage Retention
order_id, customer_id Internal DynamoDB (encrypted at rest) 7 years (audit)
total_cents, items Internal DynamoDB (encrypted at rest) 7 years (audit)
Payment card numbers Never stored Not in CloudFlow — passed to provider only N/A
provider_charge_id Internal DynamoDB payments table 7 years
CloudWatch logs Internal CloudWatch (encrypted) 90 days

Card numbers never touch CloudFlow's infrastructure. The payment provider (Stripe/Braintree) holds PCI-DSS scope. CloudFlow stores only the opaque charge_id returned by the provider.


Known Limitations

These are documented design boundaries — honest accounts of what the system does not yet handle.

Limitation Root Cause What Would Be Needed
Compensation-state consistency is not formally proved — compensating transactions are designed so partial rollback cannot produce an exploitable inconsistent state, but this property is verified by test coverage, not by proof No formal model of the compensation state space; correctness argued by case analysis over the failure modes I could construct A formal specification of the SAGA state machine (e.g. in TLA+) with a proof that every reachable compensation path terminates in a consistent state
Circuit breaker state is eventually consistent across Lambda instances — DynamoDB propagation delay means two concurrent Lambda invocations can both read CLOSED and both attempt the external call before either writes OPEN DynamoDB read-after-write consistency applies per-item per-instance, but concurrent reads before either write races Compare-and-swap on the circuit state item using DynamoDB conditional writes on a version counter
No cross-service idempotency — idempotency is enforced per-service (Order, Inventory, Payment each have their own idempotency table), but a request that passes Order idempotency and fails mid-SAGA may replay Inventory with a different key Idempotency keys are scoped to each service handler independently A saga-level idempotency token propagated through Step Functions input that each downstream service checks before executing
Step Functions execution history is lost after 90 days — audit trail for long-running investigations relies on CloudWatch logs, not Step Functions AWS hard limit on execution history retention Export execution history to S3 on completion; query via Athena for investigations beyond 90 days
LocalStack behaviour diverges from real AWS under concurrency — load test results (1,100+ req/min, P99 < 120ms) were measured on LocalStack; real DynamoDB provisioned throughput and Lambda cold start distributions will differ LocalStack is an approximation; network latency and AWS throttling are not modelled Re-run load tests on a real AWS account with provisioned DynamoDB capacity and compare P99 tail latencies

The Open Consistency Question

The core security-relevant open question in CloudFlow is whether the SAGA compensation design is exhaustively safe — not just safe for the failure modes I tested.

Specifically: can a partial compensation path (where some compensating transactions succeed and others fail mid-compensation) produce a state that is internally consistent from each service's perspective but globally inconsistent — and exploitable? My design argues no, because each compensating transaction is idempotent and Step Functions retries failed compensation steps. But this argument relies on the assumption that Step Functions will eventually deliver every compensation step successfully, which is not guaranteed under all failure modes (e.g. Step Functions service disruption mid-compensation).

This is the same class of problem as the CSPM remediation completeness question — the gap between per-component correctness and system-level safety guarantees. Both motivate the same research direction: formal verification of distributed security policies.


References & Further Reading


License

MIT © Utkarsh Batham


Author

Utkarsh Batham — B.Tech CSE · Cloud Technology & Information Security

GitHub LinkedIn

About

Production-grade SAGA orchestration on AWS Serverless — Step Functions, DynamoDB circuit breakers, event sourcing, idempotency-as-security, explicit compensation. Load-tested 1,100+ req/min.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors