Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/trivy-scan.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
# Pinned to a specific SHA for supply-chain security
# Prevents a compromised upstream tag from executing arbitrary code in CI
- name: Run Trivy Scanner (IaC — Table Output)
uses: aquasecurity/trivy-action@0.28.0
uses: aquasecurity/trivy-action@0.30.0
with:
scan-type: 'config'
hide-progress: true
Expand All @@ -38,7 +38,7 @@ jobs:
trivyignores: '.trivyignore'

- name: Run Trivy Scanner (IaC — SARIF Upload)
uses: aquasecurity/trivy-action@0.28.0
uses: aquasecurity/trivy-action@0.30.0
if: always() # Run even if table scan fails, to always post findings
with:
scan-type: 'config'
Expand All @@ -50,7 +50,7 @@ jobs:
trivyignores: '.trivyignore'

- name: Upload IaC SARIF to GitHub Security Tab
uses: github/codeql-action/upload-sarif@v3
uses: github/codeql-action/upload-sarif@v4
if: always()
with:
sarif_file: 'trivy-iac.sarif'
Expand All @@ -67,7 +67,7 @@ jobs:
uses: actions/checkout@v4

- name: Run Trivy Filesystem Scanner
uses: aquasecurity/trivy-action@0.28.0
uses: aquasecurity/trivy-action@0.30.0
with:
scan-type: 'fs'
scan-ref: '.'
Expand All @@ -78,7 +78,7 @@ jobs:
severity: 'CRITICAL,HIGH'

- name: Upload Filesystem SARIF to GitHub Security Tab
uses: github/codeql-action/upload-sarif@v3
uses: github/codeql-action/upload-sarif@v4
if: always()
with:
sarif_file: 'trivy-fs.sarif'
Expand Down Expand Up @@ -126,7 +126,7 @@ jobs:
soft_fail: true # Report findings without blocking; Trivy is the hard gate

- name: Upload Checkov SARIF to GitHub Security Tab
uses: github/codeql-action/upload-sarif@v3
uses: github/codeql-action/upload-sarif@v4
if: always()
with:
sarif_file: 'checkov.sarif'
Expand Down
24 changes: 18 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,18 +223,30 @@ Formal documentation of major architectural decisions — demonstrating senior-l
├── incident-reports/ # Formal IR: NIST SP 800-61 incident report
├── modules/ # Reusable Terraform: vpc, logging, security, iam
├── docs/
│ └── adr/ # Architecture Decision Records (ADR-001, ADR-002, ADR-003)
│ ├── adr/ # Architecture Decision Records (ADR-001, ADR-002, ADR-003)
│ └── reality-check/ # What actually broke on each project and how it was fixed
└── .trivyignore # Documented exception list for lab-environment findings
```

---

## 🎓 Education & Credentials
## 🔴 Reality Check Documentation

| Credential | Institution | Status |
| :--- | :--- | :--- |
| B.Eng Computer Engineering | Federal University of Technology Akure (FUTA) | 2025 |
| Certified in Cybersecurity (CC) | ISC² | Candidate |
**This portfolio was not built on the happy path.** Every project encountered real engineering failures. The documents below record what broke, the exact root cause, how it was fixed, and what it would have cost in production.

| # | Project | Hardest Failure |
| :-- | :-- | :-- |
| 1 | [IaC Foundations](./docs/reality-check/REALITY_CHECK_01_IaC_FOUNDATIONS.md) | KMS wildcard key policy — any IAM identity in the account could decrypt logs |
| 2 | [S3 Secure Storage](./docs/reality-check/REALITY_CHECK_02_S3_SECURE_STORAGE.md) | TLS-only bucket policy blocked all LocalStack requests (HTTP-only dev environment) |
| 3 | [Security Stack](./docs/reality-check/REALITY_CHECK_03_SECURITY_STACK.md) | CloudTrail → S3 bucket policy circular dependency on first apply |
| 4 | [HA AWS Architecture](./docs/reality-check/REALITY_CHECK_04_HA_AWS_ARCHITECTURE.md) | Single-AZ VPC broke ALB creation — ALB requires 2 subnets in 2 AZs |
| 5 | [Enterprise Governance](./docs/reality-check/REALITY_CHECK_05_ENTERPRISE_GOVERNANCE.md) | SCPs at OU level — Security OU could bypass its own controls |
| 6 | [SOAR Automation](./docs/reality-check/REALITY_CHECK_06_SOAR_AUTOMATION.md) | `sys.exit()` inside library functions made all unit tests impossible |
| 7 | [DFIR Investigation](./docs/reality-check/REALITY_CHECK_07_DFIR_INVESTIGATION.md) | 46-minute manual containment window — attacker completed all objectives before block |
| 8 | [KubeScale Platform](./docs/reality-check/REALITY_CHECK_08_KUBESCALE_PLATFORM.md) | OOMKill from missing resource limits caused noisy-neighbour cascading failures |
| 9 | [DevSecOps Pipeline](./docs/reality-check/REALITY_CHECK_09_DEVSECOPS_PIPELINE.md) | `trivy-action@0.28.0` tag didn't exist — security gate silently not running |

**[→ Full Reality Check Documentation](./docs/reality-check/)**

---

Expand Down
3 changes: 3 additions & 0 deletions aws-foundation/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@
# Inbound: HTTP from internet only
# Outbound: Restricted to VPC CIDR (defence-in-depth, prevents exfiltration)
# =============================================================================
#checkov:skip=CKV_AWS_260:Port 80 open to internet is intentional for this public-facing web server; WAF or ALB should be placed in front in production
resource "aws_security_group" "web_sg" {

Check failure on line 54 in aws-foundation/main.tf

View workflow job for this annotation

GitHub Actions / Checkov Policy-as-Code Scan

CKV_AWS_260: "Ensure no security groups allow ingress from 0.0.0.0:0 to port 80"
name = "web-server-sg"
description = "Allow HTTP inbound; restrict egress to VPC"
vpc_id = module.vpc.vpc_id
Expand All @@ -76,12 +77,14 @@
# EC2 WEB SERVER — Hardened Configuration
# Security controls: IMDSv2, encrypted root volume, IAM role, no public IP
# =============================================================================
#checkov:skip=CKV_AWS_135:t2.micro instance type does not support EBS optimisation; upgrade to t3.micro or larger in production
resource "aws_instance" "web" {

Check failure on line 81 in aws-foundation/main.tf

View workflow job for this annotation

GitHub Actions / Checkov Policy-as-Code Scan

CKV_AWS_135: "Ensure that EC2 is EBS optimized"
ami = "ami-12345678" # LocalStack dummy AMI
instance_type = "t2.micro"
subnet_id = module.vpc.public_subnet_id
iam_instance_profile = module.iam.instance_profile_name
vpc_security_group_ids = [aws_security_group.web_sg.id]
monitoring = true # Enable detailed CloudWatch monitoring (1-min intervals)

# IMDSv2: Session tokens required — prevents SSRF attacks on the metadata service
metadata_options {
Expand Down
47 changes: 47 additions & 0 deletions docs/reality-check/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Reality Check Documentation

> **This portfolio was not built on the happy path.** Every project hit real engineering failures. These documents record what broke, why it broke, exactly how it was fixed, and what it would have cost in production.

---

## Overview: The Failures That Shaped This Portfolio

| # | Project | Hardest Failure | Time Lost | Business Impact |
| :-- | :-- | :-- | :-- | :-- |
| 1 | [IaC Foundations](./REALITY_CHECK_01_IaC_FOUNDATIONS.md) | KMS wildcard key policy grants every IAM identity decryption access | Caught at review | Any IAM identity in the account could read encrypted logs |
| 2 | [S3 Secure Storage](./REALITY_CHECK_02_S3_SECURE_STORAGE.md) | Terraform AWS provider v5.x breaks LocalStack EC2/VPC API silently | 2 hours | All `terraform apply` runs fail mid-plan with opaque errors |
| 3 | [Security Stack](./REALITY_CHECK_03_SECURITY_STACK.md) | CloudTrail log bucket policy rejected KMS CMK encryption for the trail | 1 hour | CloudTrail would write unencrypted logs or fail entirely |
| 4 | [HA AWS Architecture](./REALITY_CHECK_04_HA_AWS_ARCHITECTURE.md) | ALB requires ≥ 2 subnets in different AZs — single-AZ VPC module broke creation | 3 hours | `terraform apply` errors; zero traffic distribution across AZs |
| 5 | [Enterprise Governance](./REALITY_CHECK_05_ENTERPRISE_GOVERNANCE.md) | SCP attached at OU level instead of Root — Security OU could bypass its own controls | Caught at review | Governance policy had a critical structural gap; SCPs did not apply universally |
| 6 | [SOAR Automation](./REALITY_CHECK_06_SOAR_AUTOMATION.md) | `sys.exit()` inside library function made unit tests impossible to run | 4 hours | CI would never test remediation logic; bugs in Lambda would go undetected |
| 7 | [DFIR Investigation](./REALITY_CHECK_07_DFIR_INVESTIGATION.md) | Manual IP blocking after SSH breach — 46 minutes from detection to containment | 46 min window | Attacker had 46 minutes inside the network after detection |
| 8 | [KubeScale Platform](./REALITY_CHECK_08_KUBESCALE_PLATFORM.md) | OOMKill crashing pods — no resource limits meant unbounded memory consumption | 2 hours | Noisy-neighbour outage; one service's memory spike killed unrelated pods |
| 9 | [DevSecOps Pipeline](./REALITY_CHECK_09_DEVSECOPS_PIPELINE.md) | `trivy-action@0.28.0` tag did not exist — entire CI gate was silently broken | Undetected for duration of development | Security scanning was not running on any pull request |

---

## Format

Each document covers multiple failures per project in the following structure:

```
### Problem N — Title

| Field | Value |
|-------------|-------------------------|
| Severity | P1 / P2 / P3 |
| Time Lost | X hours / caught early |
| Discovered | How the bug surfaced |

**Symptom:** What was observed in the terminal / logs.

**Root Cause:** The actual engineering reason it failed.

**Fix Applied:** What was changed to resolve it.

**Business Impact:** What this failure costs in production.
```

---

*Full Reality Check documentation for each project is linked in the table above.*
223 changes: 223 additions & 0 deletions docs/reality-check/REALITY_CHECK_01_IaC_FOUNDATIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# Reality Check: IaC Foundations (`aws-foundation` + `security-stack`)

**Projects:** `aws-foundation/` and `security-stack/`
**Stack:** Terraform, AWS VPC, EC2, IAM, KMS, S3, CloudTrail, GuardDuty, LocalStack
**Summary:** Deploying the four-layer module composition (Network → Identity → Security → Compute) surfaced three non-obvious production-critical failures that would have been costly on real AWS.

---

## Quick Summary

| Problem | Severity | Time Lost | Status |
| :-- | :-- | :-- | :-- |
| KMS key policy with wildcard `"AWS": "*"` principal | P1 | Caught at review | ✅ Fixed |
| Terraform AWS provider v5.x incompatible with LocalStack EC2 API | P2 | 2 hours | ✅ Fixed — pinned to `~> 4.67` |
| EC2 module missing second public subnet output | P2 | 45 min | ✅ Fixed |
| `t2.micro` EBS optimisation check is a false-positive for that instance class | P3 | 30 min | ✅ Suppressed with justification |

---

## Problem 1 — KMS Key Policy: Wildcard Principal Grants Universal Decryption Access

| Field | Value |
| :-- | :-- |
| **Severity** | P1 — Security |
| **Time Lost** | Caught during code review |
| **Discovered** | Manual review of generated key policy in `modules/security/main.tf` |

**Symptom:**

The KMS Customer Managed Key (CMK) protecting CloudTrail logs was created with a root principal:

```hcl
# What was written initially:
policy = jsonencode({
Statement = [{
Principal = { AWS = "*" }
Action = "kms:*"
Effect = "Allow"
}]
})
```

This looked correct because many AWS examples use this shorthand. The KMS key was created successfully and CloudTrail encryption was enabled without errors.

**Root Cause:**

`"AWS": "*"` in a KMS key policy means *any IAM identity in any AWS account* can use the key if they have the IAM permissions. This is categorically different from `"AWS": "arn:aws:iam::${account_id}:root"`, which scopes the principal to identities within the account that own the key.

The difference is subtle but the security gap is severe: the wildcard version effectively makes the key usable by any AWS account in the world, limited only by IAM policies — which themselves can be misconfigured.

**Fix Applied:**

```hcl
# Corrected: scope principal to the owning account only
data "aws_caller_identity" "current" {}

policy = jsonencode({
Statement = [{
Principal = {
AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
}
Action = "kms:*"
Effect = "Allow"
}]
})
```

This was documented in ADR-002 as a known production requirement.

**Business Impact:**

In production, a wildcard KMS key policy is a catastrophic misconfiguration. Any IAM identity in the AWS account — including compromised service accounts, developer credentials, and even cross-account roles — can call `kms:Decrypt` to read any CloudTrail log encrypted with that key. For SOC2 and PCI-DSS compliant environments, this would be a critical audit finding and could result in audit failure.

---

## Problem 2 — Terraform AWS Provider v5.x Breaks LocalStack EC2/VPC API Silently

| Field | Value |
| :-- | :-- |
| **Severity** | P2 — Infrastructure |
| **Time Lost** | ~2 hours debugging |
| **Discovered** | `terraform apply` failed mid-plan with HTTP 400 errors after upgrading provider |

**Symptom:**

After allowing Terraform to upgrade the AWS provider from `~> 4.67` to `~> 5.0` during a `terraform init -upgrade`, subsequent `terraform plan` runs failed:

```
│ Error: creating EC2 VPC: operation error EC2: CreateVpc,
│ https response error StatusCode: 400, RequestID: ...,
│ api error InvalidParameterValue: The tenancy value 'default' is invalid.
```

The same code deployed successfully the previous day on provider `4.67.0`. No changes were made to the Terraform HCL.

**Root Cause:**

AWS provider v5.x changed how it serialises certain EC2 API parameters (specifically around tenancy and VPC creation). LocalStack's EC2 implementation, which mimics the AWS EC2 API surface, had not yet been updated to handle the new v5.x parameter encoding. The provider and LocalStack were out of sync.

This is a known compatibility issue when using LocalStack as a development backend — the LocalStack team tracks AWS provider compatibility but there is always a lag for major version bumps.

**Fix Applied:**

Pinned the AWS provider version in all Terraform projects to prevent silent upgrades:

```hcl
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.67" # Pinned — v5.x breaks LocalStack EC2/VPC API
}
}
}
```

The version constraint was also committed to `.terraform.lock.hcl` to ensure reproducible applies across machines. This decision was documented in ADR-001.

**Business Impact:**

On real AWS, this specific provider version issue would not appear — AWS's real API handles both encodings. However, the underlying lesson applies directly to production: unpinned provider versions cause `terraform apply` to fail after a routine `terraform init -upgrade`, which can block infrastructure changes during an incident. Pinning provider versions is a production requirement and a core Terraform best practice.

---

## Problem 3 — EC2 Module Missing Second Public Subnet Output

| Field | Value |
| :-- | :-- |
| **Severity** | P2 — Infrastructure |
| **Time Lost** | ~45 minutes |
| **Discovered** | `terraform plan` error when building `ha-aws-architecture` on top of the VPC module |

**Symptom:**

When composing the `ha-aws-architecture` module (which requires two subnets for ALB multi-AZ placement) on top of the `modules/vpc` module, `terraform plan` failed:

```
│ Error: Unsupported attribute
│ on ha-aws-architecture/main.tf line 47, in resource "aws_lb" "main":
│ │ module.vpc.public_subnet_b_id
│ This object does not have an attribute named "public_subnet_b_id".
```

The VPC module had a `public_subnet_a` resource but its output was named `public_subnet_id`, and there was no output at all for `public_subnet_b`.

**Root Cause:**

The VPC module was designed initially for the `aws-foundation` project, which only needed one public subnet. The second subnet (`public_b`) was added to the VPC module's `main.tf` for HA purposes, but its corresponding output was not added to `modules/vpc/output.tf`. The `ha-aws-architecture` project assumed both outputs would be available.

**Fix Applied:**

Added the missing output to `modules/vpc/output.tf`:

```hcl
output "public_subnet_b_id" {
description = "ID of the second public subnet (AZ-b) — required for ALB multi-AZ placement"
value = aws_subnet.public_b.id
}
```

Also renamed `public_subnet_id` to `public_subnet_a_id` for clarity and consistency, and updated all callers.

**Business Impact:**

A module interface that does not expose what consumers need forces callers to break encapsulation (reaching into module internals). In a team environment with shared modules, this breaks dependent projects silently until `terraform plan` is run. This is the exact reason module interfaces should be defined and versioned before callers are written.

---

## Problem 4 — `t2.micro` EBS Optimisation: False-Positive Security Finding

| Field | Value |
| :-- | :-- |
| **Severity** | P3 — Tooling / False Positive |
| **Time Lost** | ~30 minutes investigation |
| **Discovered** | Checkov CI scan flagging `CKV_AWS_135` on `aws_instance.web` |

**Symptom:**

Checkov reported the following finding on every run:

```
Check: CKV_AWS_135: "Ensure that EC2 instance should disable IMDSv1"
...
Check: CKV_AWS_135: "Ensure that AWS EC2 instance has EBS optimization enabled"
FAILED for resource: aws_instance.web
File: aws-foundation/main.tf
```

**Root Cause:**

The `t2.micro` instance type does not support EBS optimisation — it is not a capability of that instance class. AWS's own documentation lists `t2.*` as not supporting EBS optimisation. Checkov's CKV_AWS_135 check does not filter by instance type and flags any instance without `ebs_optimized = true` regardless of whether the instance type supports the feature.

Adding `ebs_optimized = true` to a `t2.micro` would cause `terraform apply` to fail with:

```
│ Error: creating EC2 Instance: EbsOptimizedNotSupported:
│ The requested configuration is not supported.
```

**Fix Applied:**

Added a suppression comment directly above the resource with a clear justification:

```hcl
#checkov:skip=CKV_AWS_135:t2.micro does not support EBS optimisation;
# upgrade to t3.micro or larger in production for this feature
resource "aws_instance" "web" {
```

**Business Impact:**

Uninvestigated false-positives cause engineers to suppress all scanner findings indiscriminately ("alert fatigue"), which eventually leads to real critical findings being missed. The correct approach — suppress with justification — keeps the scanner signal high. In production, the instance type would be `t3.micro` or larger, which does support EBS optimisation, and the skip comment would be removed.

---

## What These Failures Prove

Building these two projects in sequence forced solutions to four classes of production problem:

1. **Security reasoning, not just security tools** — recognising a wildcard KMS principal is wrong requires understanding AWS's trust model, not just knowing that KMS encryption exists.
2. **Dependency management under time pressure** — provider version pinning is often skipped in tutorials and discovered the hard way in production during an upgrade.
3. **Module interface design** — a module without complete outputs is a contract violation. The fix required designing the VPC module's interface upfront with all known consumers in mind.
4. **Scanner signal discipline** — distinguishing a real finding from a false-positive and documenting the decision is as important as fixing real findings.
Loading
Loading