diff --git a/Taskfile.yaml b/Taskfile.yaml new file mode 100644 index 0000000..9fab6f1 --- /dev/null +++ b/Taskfile.yaml @@ -0,0 +1,13 @@ +version: '3' + +includes: + # Documentation tasks + docs: + taskfile: ./docs/Taskfile.yaml + dir: ./docs + +tasks: + generate: + desc: Run code generation (deepcopy, defaults) + deps: + - task: docs:generate \ No newline at end of file diff --git a/docs/Taskfile.yaml b/docs/Taskfile.yaml new file mode 100644 index 0000000..a526bfc --- /dev/null +++ b/docs/Taskfile.yaml @@ -0,0 +1,69 @@ +version: '3' + +vars: + DIAGRAMS_DIR: "{{.ROOT_DIR}}/docs/diagrams" + OUTPUT_FORMAT: "png" + PLANTUML_IMAGE: plantuml/plantuml:1.2026.4 + +tasks: + generate: + desc: Generate all documentation artifacts (diagrams, etc.) + cmds: + - task: diagrams:render + silent: true + + diagrams: + desc: Generate all architecture diagrams from PlantUML + cmds: + - task: diagrams:render + silent: true + + diagrams:render: + desc: Render PlantUML diagrams to PNG format using Docker + cmds: + - | + set -e + echo "Rendering PlantUML diagrams..." + echo "" + + # Check if PlantUML files exist + if ! ls {{.DIAGRAMS_DIR}}/*.puml >/dev/null 2>&1; then + echo "❌ Error: PlantUML source files (*.puml) not found in {{.DIAGRAMS_DIR}}" + exit 1 + fi + + # Render using Docker (no local installation required) + docker run --rm \ + -v "{{.DIAGRAMS_DIR}}":/data \ + {{.PLANTUML_IMAGE}} \ + -t{{.OUTPUT_FORMAT}} \ + /data/*.puml + + echo "" + echo "✅ Diagrams rendered in {{.DIAGRAMS_DIR}}" + echo "" + echo "Generated files:" + ls -1 {{.DIAGRAMS_DIR}}/*.{{.OUTPUT_FORMAT}} 2>/dev/null | xargs -n1 basename || echo "No output files found" + silent: true + + diagrams:clean: + desc: Remove generated diagram files + cmds: + - | + rm -f {{.DIAGRAMS_DIR}}/*.png {{.DIAGRAMS_DIR}}/*.svg + echo "✅ Generated diagram files removed" + silent: true + + diagrams:validate: + desc: Validate PlantUML syntax using Docker + cmds: + - | + set -e + echo "Validating PlantUML diagrams..." + docker run --rm \ + -v "{{.DIAGRAMS_DIR}}":/data \ + {{.PLANTUML_IMAGE}} \ + -syntax \ + /data/*.puml + echo "✅ All diagrams are valid" + silent: true diff --git a/docs/diagrams/fraudulent-login-flow.png b/docs/diagrams/fraudulent-login-flow.png new file mode 100644 index 0000000..6a4bbbf Binary files /dev/null and b/docs/diagrams/fraudulent-login-flow.png differ diff --git a/docs/diagrams/fraudulent-login-sequence.png b/docs/diagrams/fraudulent-login-sequence.png new file mode 100644 index 0000000..c922013 Binary files /dev/null and b/docs/diagrams/fraudulent-login-sequence.png differ diff --git a/docs/diagrams/fraudulent-login-sequence.puml b/docs/diagrams/fraudulent-login-sequence.puml new file mode 100644 index 0000000..2de98b9 --- /dev/null +++ b/docs/diagrams/fraudulent-login-sequence.puml @@ -0,0 +1,51 @@ +@startuml +skinparam handwritten false +skinparam participantPadding 10 +skinparam boxPadding 10 + +box "Authentication System" #LightBlue +participant "Zitadel Event Handler" as Zitadel +end box + +box "Kubernetes Control Plane" #LightYellow +database "K8s API Server" as K8s +end box + +box "Fraud System" #LightPink +participant "Fraud Controller" as FraudCtrl +end box + +box "Notification System" #LightGreen +participant "Notification Operator" as Notif +end box + +== 1. Session Created Event == +Zitadel -> Zitadel: Webhook received:\noidc_session.added +Zitadel -> K8s: Create LoginEvaluation resource\n(Spec: UserRef, loginEmail, LoginContext) + +== 2. Reconcile & Fraud Evaluation == +K8s -> FraudCtrl: Watch event: LoginEvaluation Created +activate FraudCtrl + +FraudCtrl -> K8s: List Sessions (for UserRef) +K8s --> FraudCtrl: Return historical Session resources + +FraudCtrl -> FraudCtrl: Compare current LoginContext\nagainst historical sessions\n(Compare IP, UserAgent, Fingerprint) + +alt Login is Fraudulent (Suspicious) + FraudCtrl -> K8s: Resolve Location (via GraphQL Gateway LookupIP) + K8s --> FraudCtrl: Return resolved Location details + FraudCtrl -> K8s: Parse User-Agent (via GraphQL Gateway ParseUserAgent) + K8s --> FraudCtrl: Return parsed Device & Browser + + FraudCtrl -> K8s: Create Email resource\n(Spec: Recipient, Template, Variables) + activate Notif + Notif -> K8s: Update Email status to Sent + deactivate Notif + + FraudCtrl -> K8s: Update LoginEvaluation Status\n(isFraudulent=true, phase=Completed) +else Login is Normal + FraudCtrl -> K8s: Update LoginEvaluation Status\n(isFraudulent=false, phase=Completed) +end +deactivate FraudCtrl +@enduml diff --git a/docs/enhancements/fraudulent-login.md b/docs/enhancements/fraudulent-login.md new file mode 100644 index 0000000..d14497c --- /dev/null +++ b/docs/enhancements/fraudulent-login.md @@ -0,0 +1,650 @@ +--- +status: provisional +stage: alpha +latest-milestone: "v0.x" +--- + + + + +# Fraudulent Login Evaluation + + + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + +## Summary + + + +This enhancement proposes shifting the responsibility of evaluating suspicious user logins and alerting users of anomalous access from the identity provider layer to the central fraud detection system. + +Instead of the authentication gateway performing inline fraud risk checks and sending email alerts synchronously, it will delegate login event data directly to the fraud detection service. The fraud service then evaluates the login context against the user's historical session patterns to determine if it is anomalous (e.g., a new IP, browser, or device). When a suspicious login is identified, the fraud system automatically enriches the metadata with geographic location and device details and sends a security alert to the user. + +## Motivation + + + +Currently, the authentication provider is coupled with security and fraud rules. Evaluating whether a login context (IP, User-Agent, or Fingerprint) is anomalous requires knowledge of session histories, geolocation lookups, and user-agent analysis. Housing this capability inside the authentication gateway introduces several disadvantages: +- **Feature Coupling**: The authentication system should focus exclusively on validating user credentials, rather than performing geolocation enrichment and complex risk analysis. +- **Fragmented Fraud Policies**: Security policies and risk assessment logic are split across different systems, making it difficult to maintain and audit consistently. +- **Lack of Central Audit Logging**: Suspicious login decisions are made in-memory and logged, but they are not stored as persistent audit records for security administrators. + +By centralizing login evaluation within the fraud system, we establish a clean separation of responsibilities, improve the auditability of security decisions, and ensure a unified security and fraud policy engine. + +### Goals + + + +- Decouple the login flow from fraud and alert policies. +- Centralize login risk assessment within the dedicated fraud detection system. +- Utilize historical user session characteristics to recognize anomalous login attempts. +- Provide persistent audit records for all evaluated login attempts. +- Deliver automated, metadata-enriched security notifications to users upon detection of suspicious logins. + +### Non-Goals + + + +- Modifying the Zitadel event delivery system or changing Zitadel webhook payloads. +- Replacing or modifying the existing `FraudEvaluation` pipeline, which focuses on long-term user risk profiles rather than transient login events. +- Creating an independent geo-IP database; the fraud operator will leverage the existing GraphQL gateway. + +## Proposal + + + +We propose an event-driven flow for evaluating user login security: + +1. **Login Event Propagation**: Upon a new user login, the authentication system publishes a login attempt record containing the login context (IP, User-Agent, device fingerprint, and timestamp). +2. **Historical Analysis**: The fraud system receives the login event and queries the historical record of that user's sessions. +3. **Anomalous Context Detection**: The fraud system checks if the incoming login context is new or unseen compared to the user's past active sessions. +4. **Metadata Enrichment**: If the login is flagged as anomalous, the fraud system translates the raw client IP and User-Agent strings into human-readable geographic locations and device descriptions. +5. **Security Alerting**: The fraud system triggers a high-priority notification to alert the user of the suspicious access attempt. +6. **Audit Persistence**: The outcome of the evaluation (whether flagged or not) is recorded in the fraud system's audit logs. + +### User Stories (Optional) + + + +#### Story 1 +As a User, I want to receive an email alert when a new login occurs on my account from a device or location I have not used before, so that I can secure my account. + +#### Story 2 +As a Security Admin, I want to query a list of login evaluations (`kubectl get loginevaluations`) to see all evaluated login events, their details, and whether they were flagged as fraudulent. + +### Notes/Constraints/Caveats (Optional) + + + +- **Race Conditions**: When a new session is added, `Session` resources in the cluster may be updated asynchronously. The fraud controller must ignore the current session itself when looking at historical data to avoid comparing a login to itself. +- **Gateway Availability**: Geolocation and user-agent parsing depend on the GraphQL gateway. If the gateway is down, the system should fall back gracefully to raw values. + +### Risks and Mitigations + + + +- **Resource Proliferation**: A high volume of login events could produce many `LoginEvaluation` resources, leading to API server stress. + *Mitigation*: Implement a garbage-collection policy (e.g., TTL controller or owner references) to delete old `LoginEvaluation` resources after a configured retention period. +- **Performance Overhead**: Fetching session lists and performing HTTP lookups during reconciliation can delay evaluation. + *Mitigation*: Use client caching for `Session` lookups, run network requests concurrently, and handle transient errors with proper exponential backoff retries. + +## Design Details + + + +### LoginEvaluation CRD Schema + +The new custom resource `LoginEvaluation` will represent a login event under the `fraud.miloapis.com` group. + +```yaml +apiVersion: fraud.miloapis.com/v1alpha1 +kind: LoginEvaluation +metadata: + name: login-eval-sample + namespace: fraud-system +spec: + # Reference to the User resource + userRef: + name: user-zitadel-id-123 + # Optional email address used for this specific login attempt (essential when users can log in with different emails/OIDC providers) + loginEmail: "user@example.com" + # Context details about the login attempt + loginContext: + sessionID: sess-98765 + ip: 203.0.113.88 + userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" + fingerprintID: fp-ab12cd34 + createdAt: "2026-06-17T16:11:00Z" +status: + # Current phase: Pending, Running, Completed, Error + phase: Completed + # Evaluation result + isFraudulent: true + # Status conditions representing evaluation steps + conditions: + - type: Ready + status: "True" + lastTransitionTime: "2026-06-17T16:11:03Z" + reason: EvaluationCompleted + message: "Login evaluated and processed successfully." + - type: UserRefValid + status: "True" + lastTransitionTime: "2026-06-17T16:11:02Z" + reason: UserRefExists + message: "Subject user-zitadel-id-123 is valid and exists." + - type: NotificationSent + status: "True" + lastTransitionTime: "2026-06-17T16:11:03Z" + reason: NotificationDispatched + message: "Alert notification created for delivery." +``` + +### Sequence Diagram + +![Sequence Diagram](../diagrams/fraudulent-login-sequence.png) + +### Evaluation Logic & Flow + +1. **Triggering**: The Zitadel handler receives the `oidc_session.added` payload. Instead of running analysis logic, it builds a `LoginEvaluation` resource and writes it to the Kubernetes API. +2. **Session Retrieval**: The fraud controller uses the UserRef from the spec to retrieve all existing `Session` resources under the `identity.miloapis.com/v1alpha1` group. +3. **Suspicious Context Check**: + - The controller filters out the current session ID to avoid checking against itself. + - It checks if the current IP address, User-Agent string, or fingerprint ID matches any historical session records. + - If *none* of the historical sessions match the current IP, User-Agent, or fingerprint, the login is marked as suspicious. +4. **Geolocation and UA Parsing**: + - The fraud controller calls the external GraphQL Gateway to get human-readable location details for the IP. + - The user agent string is resolved to determine the OS (device) and Browser. +5. **Notification**: + - If flagged, a high-priority `Email` resource is created in the notification namespace, targeting the recipient user with variables: `UserName`, `Email`, `Location`, `SignInTime`, `Browser`, `Device`, and `IpAddress`. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +#### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control plane? + - Will enabling / disabling the feature require downtime or reprovisioning of a node? + +#### Does enabling the feature change any default behavior? + + + +#### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +#### What happens if we reenable the feature if it was previously rolled back? + +#### Are there any tests for feature enablement/disablement? + +### Rollout, Upgrade and Rollback Planning + + + +#### How can a rollout or rollback fail? Can it impact already running workloads? + + + +#### What specific metrics should inform a rollback? + + + +#### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +#### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +#### How can an operator determine if the feature is in use by workloads? + + + +#### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +#### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +#### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +#### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +#### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +#### Will enabling / using this feature result in any new API calls? + + + +#### Will enabling / using this feature result in introducing new API types? + + + +#### Will enabling / using this feature result in any new calls to the cloud provider? + + + +#### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +#### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +#### Will enabling / using this feature result in non-negligible increase of resource usage in any components? + + + +#### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +#### How does this feature react if the API server is unavailable? + +#### What are other known failure modes? + + + +#### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +- **2026-06-17**: Initial enhancement proposal drafted (Alpha). + +## Drawbacks + + + +- **Increased API Overhead**: Each user login now triggers at least one additional write to the Kubernetes API server (`LoginEvaluation` creation) and several reads. +- **Dependency on CRD**: If the `LoginEvaluation` CRD is deleted or misconfigured, it breaks the fraud-alert pipeline. + +## Alternatives + + + +- **Kafka / Event Bus integration**: Send authentication events directly to a broker like Kafka or RabbitMQ, which the fraud operator listens to. While scalable, it introduces a massive external infrastructure requirement. Kubernetes CRDs offer a simple, native control plane fit for the existing environment. + +## Infrastructure Needed (Optional) + + + +None. \ No newline at end of file