Skip to content

leopardcodeai/aws-haystack-architecture-compliance-agent

Repository files navigation

Architecture Compliance Agent — Haystack 2.x RAG Pipeline

An enterprise-grade Haystack 2.x RAG agent that automatically checks IT requirements against internal architecture standards.

Deployed on AWS ECS Fargate with a FastAPI REST API, GitHub Actions CI/CD, and OIDC-based AWS authentication. LLM: Claude Sonnet 4.6 via the AWS Bedrock Converse API, protected by Bedrock Guardrails.

Built and maintained by LeopardCode.AI.


Architecture Overview

graph TB
    subgraph "Client"
        A[IT Requirement]
    end

    subgraph "AWS ECS Fargate"
        B["FastAPI REST API (port 8000)"]
        C["Haystack 2.x Pipeline"]
        D["InMemoryDocumentStore<br/>3 Architecture Standards"]
        G["Bedrock Guardrails<br/>Content Filter + PII Block"]
    end

    subgraph "CI/CD (GitHub Actions → OIDC)"
        E["GitHub Actions"]
        F["Amazon ECR"]
        H["ECS Deployment"]
    end

    A -- "POST /check {requirement}" --> B
    B -- "JSON request" --> C
    C -- "semantic search" --> D
    D -- "relevant standards" --> C
    C -- "LLM call" --> G
    G -- "filtered response" --> C
    C -- "structured report (Pydantic)" --> B
    B -- "JSON response" --> A

    E -- "docker build & push" --> F
    F -- "image pull" --> H
    H -- "force new deployment" --> B
Loading

Haystack 2.x Pipeline (Detail)

graph LR
    subgraph "Production Pipeline"
        A["SentenceTransformersTextEmbedder<br/>all-MiniLM-L6-v2"] -- "embedding" --> B["InMemoryEmbeddingRetriever"]
        B -- "documents" --> C["ChatPromptBuilder<br/>Compliance Agent"]
        C -- "ChatMessage[]" --> D["AmazonBedrockChatGenerator<br/>Claude Sonnet 4.6"]
        D -- "LLM reply (JSON)" --> E["ComplianceOutputValidator<br/>→ Pydantic Schema"]
    end

    subgraph "Fallback / Test Mode"
        F["MockComplianceEngine<br/>keyword-based, deterministic"] --> E
    end

    subgraph "Data Indexing"
        H["Architecture Standards<br/>3 standards"] --> I["SentenceTransformersDocumentEmbedder"]
        I --> J["InMemoryDocumentStore"]
    end

    J --> B
    E -- "structured Dict[str, Any]" --> K["FastAPI Response / Test Assertion"]
Loading

Two Operation Modes

Mode Component Use Case
Production ChatPromptBuilderAmazonBedrockChatGenerator (Sonnet 4.6) AWS Bedrock runtime; handles negation, context, natural language
Mock MockComplianceEngine (keyword rules with negation handling) Local development, CI/CD, offline testing; deterministic, no AWS required

Set HAYSTACK_COMPLIANCE_MOCK=1 or pass use_mock=True to force mock mode.


Project Structure

├── app/
│   ├── __init__.py              # FastAPI (health, /check, /standards)
│   ├── models.py                # Pydantic: StandardFinding, ComplianceReport
│   ├── components.py            # Haystack 2.x @component classes
│   │   ├── MockComplianceEngine       # Deterministic keyword engine
│   │   └── ComplianceOutputValidator  # LLM → Pydantic (with mock fallback)
│   └── pipeline.py              # ArchitectureCompliancePipeline (RAG)
├── infra/
│   ├── app.py                   # CDK entry point
│   ├── cdk.json                 # CDK configuration
│   ├── stacks.py                # CDK stacks + Bedrock Guardrail resource
│   └── requirements.txt         # CDK dependencies
├── scripts/
│   ├── deploy-aws.sh            # Full deployment
│   ├── integration_test.py      # HTTP integration tests
│   ├── test_analysis.py         # 18-scenario analysis suite
│   ├── test_analysis_report.json
│   ├── evaluate_compliance.py   # Mock vs. Bedrock → HTML dashboard
│   └── evaluation_report.html   # Generated HTML report
├── tests/
│   ├── conftest.py              # Shared fixtures (mock + Bedrock pipelines)
│   ├── test_data.json           # 18 categorized test scenarios
│   ├── test_pipeline.py         # 14 original unit tests
│   ├── test_comprehensive.py    # 48 tests across 9 categories
│   └── test_bedrock_integration.py  # 20 real Bedrock pipeline tests
├── docs/
│   └── evaluation_dashboard.png # Screenshot of the evaluation report
├── .github/workflows/
│   └── deploy.yml               # Lint → Type Check → Test → Deploy
├── pyproject.toml               # Poetry + ruff + mypy + pytest
├── Dockerfile                   # Multi-stage build
└── README.md

Architecture Standards

ID Standard MockEngine Rule LLM Capability
Standard-01 Databases: AES-256 encryption + EU data residency (eu-central-1) unverschlüsselt keyword + region list Full semantic analysis
Standard-02 Microservices: gRPC/REST over TLS 1.3 http && !tls Protocol inference
Standard-03 External APIs: API Gateway + OAuth2 api gateway substring + oauth Contextual API analysis

All three standards are checked on every run. The LLM can infer violations the MockEngine misses (e.g., gRPC without TLS, implicit non-EU regions).


Pipeline Output Schema

class StandardFinding(BaseModel):
    standard_id: str
    is_compliant: bool
    finding_details: str          # Description of the finding (German)
    remediation: str              # Corrective action (empty if compliant)

class ComplianceReport(BaseModel):
    status: Literal["KONFORM", "NICHT KONFORM", "RELEVANZ UNKLAR"]
    reasoning_summary: str        # Concise per-standard bullet list
    findings: List[StandardFinding]  # One per standard, always 3 entries
    full_report: str              # Formatted compliance report (German)

The LLM is instructed to output JSON matching this Pydantic schema exactly. The ComplianceOutputValidator strips markdown fences, validates against the schema, and falls back to the MockComplianceEngine on any parsing failure. Report text is generated in German, the working language of the target compliance workflow.


MockComplianceEngine

A lightweight test substitute that matches keywords deterministically — intentionally simple so tests are fast, reproducible, and free of external dependencies.

Capability MockEngine LLM (Sonnet 4.6)
Encryption check unverschlüsselt keyword Full NLP understanding
Region check Keyword list (us-east, us-west, …) All AWS regions known
HTTP/TLS http && !tls Protocol inference
API Gateway api gateway substring + oauth Contextual reasoning
Negation ("nicht unverschlüsselt") Explicit double-negative rule Inherent comprehension
Negation ("ohne API Gateway") 3-token lookback for ohne/kein/nicht Full sentence comprehension
RELEVANZ UNKLAR detection 20+ infrastructure keywords Semantic relevance scoring
English input Partial (keywords catch us-east, etc.) Full English support
gRPC without TLS Not detected (no rule) Flagged as violation

Test Suite

Architecture

graph TB
    subgraph "Test Layers"
        U["test_pipeline.py<br/>14 tests"] --> C["test_comprehensive.py<br/>48 tests, 9 categories"]
        C --> B["test_bedrock_integration.py<br/>20 tests, real Claude Sonnet 4.6"]
        B --> E["evaluate_compliance.py<br/>18 scenarios, Mock vs. LLM comparison"]
    end

    subgraph "Engine under Test"
        M["MockComplianceEngine<br/>deterministic, offline"]
        L["Claude Sonnet 4.6<br/>AWS Bedrock Converse API"]
    end

    U --> M
    C --> M
    B --> L
    E --> M
    E --> L

    subgraph "Output"
        R["HTML Dashboard<br/>evaluation_report.html"]
        T["Terminal Summary"]
    end

    E --> R
    E --> T
Loading

Test Layers

Layer File Tests Runtime AWS Required
Original test_pipeline.py 14 < 1 s No
Comprehensive test_comprehensive.py 48 < 1 s No
Bedrock Integration test_bedrock_integration.py 20 ~3 min Yes
Cross-Validation evaluate_compliance.py 18 scenarios ~3 min Yes

9 Test Categories (Comprehensive Suite)

# Category Tests Coverage
1 Happy Path 5 Fully compliant across all 3 standards
2 Negative Single 8 Each standard violated individually
3 Negative Multiple 2 Two or more standards violated simultaneously
4 Edge — Negation 3 "nicht unverschlüsselt", "ohne", "kein"
5 Edge — Region 3 us-west-2, no region, encrypted without region
6 Edge — Input 5 English, mixed language, all-caps, newlines, special characters
7 Edge — Boundary 4 Minimal, single-word, short, non-architectural
8 Failure Inputs 6 Empty, whitespace, very long, numeric, stopwords, injection
9 Output Structure 12 Schema keys, finding format, status consistency, JSON

Running Tests

# All unit tests (62 tests, < 1 s)
python3 -m pytest tests/test_pipeline.py tests/test_comprehensive.py -v

# Bedrock integration (20 tests, ~3 min, requires AWS)
python3 -m pytest tests/test_bedrock_integration.py -v

# Full evaluation: Mock vs. Bedrock + HTML dashboard
python3 scripts/evaluate_compliance.py

# 18-scenario pipeline analysis
python3 scripts/test_analysis.py

Test Results

Unit Tests — 62/62 Pass

Category              Tests    Pass Rate
Happy Path             5/5     100% ████████████████████
Negative Single        8/8     100% ████████████████████
Negative Multiple      2/2     100% ████████████████████
Edge Negation          3/3     100% ████████████████████
Edge Region            3/3     100% ████████████████████
Edge Input             5/5     100% ████████████████████
Edge Boundary          4/4     100% ████████████████████
Failure Inputs         6/6     100% ████████████████████
Output Structure      12/12    100% ████████████████████

Bedrock Integration — 20/20 Pass

Claude Sonnet 4.6 (eu-central-1) correctly handles:

Scenario Status Notes
All standards met KONFORM AES-256, eu-central-1, gRPC+TLS, API Gateway+OAuth2
Unencrypted DB NICHT KONFORM Correctly identified
Plain HTTP NICHT KONFORM Correctly identified
Non-EU region (us-east-1) NICHT KONFORM Correctly flagged
Missing OAuth2 NICHT KONFORM API Gateway without OAuth2
Double negative ("nicht unverschlüsselt") KONFORM LLM understands negation
us-west-2 region NICHT KONFORM LLM knows us-west-2 is non-EU
gRPC without TLS NICHT KONFORM MockEngine misses this
"ohne API Gateway" NICHT KONFORM LLM treats as violation (defensible)
Non-architectural input ("Hallo") RELEVANZ UNKLAR Correct classification

Mock vs. Bedrock — Evaluation Dashboard

Evaluation Dashboard

The evaluation runs all 18 test_data.json scenarios through both engines and generates a side-by-side HTML comparison.

Latest results (Claude Sonnet 4.6, eu-central-1):

Metric MockEngine Bedrock LLM Δ Interpretation
Pass Rate 100.0% 83.3% −16.7% LLM is stricter on some edge cases
Avg F1 Score 0.981 0.811 −0.170 Trade-off: stricter means more false positives
Avg Precision 0.972 0.737 −0.235 LLM flags more items as violations
Avg Recall 1.0 0.972 −0.028 Both catch nearly all actual violations
Avg Response Time 0.0 s 10.7 s +10.7 s LLM inference vs. instant keyword matching

The LLM's lower pass rate reflects test expectations designed around the keyword-based MockEngine. The LLM is genuinely more capable — it catches violations the MockEngine misses entirely (e.g., gRPC without TLS). The three discrepancies are edge cases where both interpretations are defensible.

Dashboard features: summary cards (pass rate, F1, precision, recall, response time), per-engine category breakdown, confusion matrix, per-test-case comparison with reasoning snippets, disagreement analysis, and an error details table.

open scripts/evaluation_report.html

Deployment

Prerequisites

  • AWS CLI with OIDC-configured GitHub Actions
  • Docker, Python 3.9+
  • Bedrock model access: Claude Sonnet 4.6 in eu-central-1

GitHub Actions CI/CD

Uses OIDC — no long-lived AWS credentials:

graph LR
    A["push to main"] --> B["Lint (ruff)"]
    B --> C["Type Check (mypy)"]
    C --> D["Test (pytest, mock mode)"]
    D --> E["Docker Build & Push to ECR"]
    E --> F["ECS Force New Deployment"]
    F --> G["Health Check Wait"]
Loading

Required secret: AWS_ROLE_ARN — an IAM role with an OIDC trust policy (permissions: ECR push, ECS update-service, Bedrock InvokeModel, ApplyGuardrail).

Infrastructure (CDK)

Resource Details
VPC 2 AZs, 1 NAT Gateway, public/private subnets
ECS Cluster Fargate (serverless), container insights
ECR haystack-compliance, retains the last 10 images
ECS Service 1–5 tasks, auto-scaling at 70% CPU/memory
ALB Health check on /health
IAM Role Bedrock: InvokeModel, ApplyGuardrail
Bedrock Guardrail Content filters (sexual/hate/violence/insults) + PII (email anonymized, AWS keys blocked)
Fargate Spot Cost-optimized capacity provider
Region eu-central-1 (Frankfurt)

Local Development

# Install
pip install -r requirements.txt

# Mock mode (no AWS)
HAYSTACK_COMPLIANCE_MOCK=1 python3 architecture_compliance_checker.py

# FastAPI dev server
HAYSTACK_COMPLIANCE_MOCK=1 uvicorn app.main:app --reload --port 8000
# http://localhost:8000/docs (Swagger UI)

API Endpoints

GET /health

{ "status": "healthy", "pipeline": "haystack-architecture-compliance" }

POST /check

Request:

{
  "requirement": "Wir planen eine unverschlüsselte PostgreSQL-Datenbank in us-east-1."
}

Response:

{
  "status": "NICHT KONFORM",
  "reasoning_summary": "   - Standard-01: Datenbankverschlüsselung (AES-256) nicht eingehalten\n   - Standard-01: Datenhaltung außerhalb der EU (eu-central-1)",
  "findings": [
    {
      "standard_id": "Standard-01",
      "is_compliant": false,
      "finding_details": "Datenbankverschlüsselung (AES-256) nicht eingehalten\n   - Datenhaltung außerhalb der EU (eu-central-1)",
      "remediation": "Aktivieren Sie AES-256 Verschlüsselung für alle Datenbanken\nVerlagerung der Daten in die EU (eu-central-1)"
    },
    {
      "standard_id": "Standard-02",
      "is_compliant": true,
      "finding_details": "Keine Aussage zur Kommunikation — Standard als erfüllt betrachtet.",
      "remediation": ""
    },
    {
      "standard_id": "Standard-03",
      "is_compliant": true,
      "finding_details": "Keine externe API genannt — Standard als erfüllt betrachtet.",
      "remediation": ""
    }
  ],
  "full_report": "1. STATUS: NICHT KONFORM\n2. BEGRÜNDUNG:\n   ..."
}

GET /standards

{
  "standards": [
    { "id": 0, "content": "Standard-01: Alle produktiven Datenbanken..." },
    { "id": 1, "content": "Standard-02: Für den Datenaustausch..." },
    { "id": 2, "content": "Standard-03: Externe APIs müssen..." }
  ],
  "count": 3
}

Security & Compliance

  • OIDC authentication — GitHub Actions → AWS via IAM roles, no access keys
  • Bedrock Guardrails — content filters (hate, sexual, violence, insults) + PII handling (email anonymized, AWS keys blocked)
  • IAM least privilege — ECS task role limited to bedrock:InvokeModel and bedrock:ApplyGuardrail
  • No secrets in the repository.env, *.pyc, __pycache__, cdk.out/ are gitignored
  • Deterministic LLM outputtemperature=0.0 for reproducible results
  • Input validation — FastAPI Pydantic schema validation on /check

CI/CD Pipeline

Deploys on push to main or manual workflow_dispatch:

  1. Lintruff check app/ tests/ scripts/
  2. Type checkmypy app/
  3. Testpytest tests/ -v (MockEngine, no AWS needed, < 1 s)
  4. Docker build & push — multi-arch → ECR
  5. ECS deployforce-new-deployment, waits for service stability

Required secret: AWS_ROLE_ARN


Known Limitations

Area Limitation Impact
MockEngine Keyword-based; misses gRPC without TLS, non-standard regions Only used in CI — the LLM handles all edge cases
Region detection us-*, ap-* prefix list The LLM knows all AWS regions
TLS version Mock checks the tls keyword only The LLM distinguishes TLS 1.2 vs. 1.3
Language MockEngine is German-only The LLM handles German + English
LLM cost ~10 s per inference, ~$0.01 per check Acceptable for a compliance workflow
Bedrock availability Requires Claude Sonnet 4.6 access in eu-central-1 MockEngine works offline
No persistent store Results are ephemeral (in-memory) No audit trail without extensions

Quick Reference

# Install & run tests
pip install -r requirements.txt
python3 -m pytest tests/ -v

# FastAPI server
uvicorn app.main:app --reload --port 8000

# Check compliance
curl -X POST http://localhost:8000/check \
  -H "Content-Type: application/json" \
  -d '{"requirement": "Unverschlüsselte DB in us-east-1"}'

# Full evaluation
python3 scripts/evaluate_compliance.py
open scripts/evaluation_report.html

Built by LeopardCode.AI — AI Engineering & Consulting

About

Haystack 2.x RAG pipeline for automated IT architecture compliance checking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors