Add KEDA-based autoscaling support with scale-to-zero#74
Conversation
Introduces KEDA ScaledObject management for Agent CRDs, enabling event-driven autoscaling including scale-to-zero for idle agents. This addresses the real cost problem of running unused agent replicas without building custom autoscaling infrastructure. Changes: - Add ScalingSpec to AgentSpec with triggers, min/max replicas, cooldown period, and polling interval configuration - Create ScaledObject builder using unstructured objects to avoid hard KEDA module dependency - Add ScaledObjectRepo with create/update/delete via controller-runtime - Integrate ScaledObject reconciliation into agent app service with graceful handling when KEDA is not installed - Add ScalingReady condition for observability - Add RBAC for keda.sh/scaledobjects - Update CRD schema and Helm chart - Include sample configs for Prometheus and cron-based scaling https://claude.ai/code/session_018ctEFbksh15TJzfJwhBNvs
There was a problem hiding this comment.
Pull request overview
This PR adds KEDA-based autoscaling support to the Flokoa operator, enabling agents to dynamically scale based on custom metrics and scale to zero when idle. The implementation uses unstructured Kubernetes objects to avoid a hard dependency on the KEDA Go module, making KEDA an optional cluster component.
Changes:
- Added
ScalingSpec,ScalingTrigger, andScalingTriggerAuthtypes to the Agent CRD API for configuring KEDA autoscaling - Implemented ScaledObject builder and repository layers using unstructured objects to create/update/delete KEDA resources
- Integrated ScaledObject reconciliation into the agent reconciliation loop with non-fatal error handling
- Added comprehensive unit tests for builder and reconciliation logic with fake repository implementations
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
operator/api/v1alpha1/agent_types.go |
Defines ScalingSpec, ScalingTrigger, and ScalingTriggerAuth API types with kubebuilder validation markers |
operator/api/v1alpha1/zz_generated.deepcopy.go |
Auto-generated DeepCopy methods for new scaling-related types |
operator/internal/infra/builder/scaledobject.go |
Pure function to build unstructured KEDA ScaledObjects from agent configuration |
operator/internal/infra/builder/scaledobject_test.go |
Unit tests covering ScaledObject construction with various trigger configurations |
operator/internal/infra/repo/scaledobject.go |
Repository implementation for CRUD operations on KEDA ScaledObjects using unstructured client |
operator/internal/infra/repo/fakes/scaledobject_fake.go |
In-memory fake implementation for testing ScaledObject operations |
operator/internal/infra/repo/interfaces.go |
Added ScaledObjectRepo interface to repository layer |
operator/internal/app/agent/reconcile.go |
Added reconcileScaledObject method with non-fatal error handling and KEDA CRD detection |
operator/internal/app/agent/reconcile_scaling_test.go |
Integration tests for ScaledObject lifecycle (create, update, delete) |
operator/internal/domain/agent/status.go |
Added ScalingReady condition type and related reason constants |
operator/internal/controller/agent_controller.go |
Added RBAC markers for ScaledObject permissions |
operator/cmd/main.go |
Wired ScaledObjectRepoImpl into agent service dependencies |
operator/config/samples/agent_v1alpha1_agent_keda.yaml |
Example Agent manifests demonstrating Prometheus and cron-based scaling |
operator/config/rbac/role.yaml |
Generated RBAC role with ScaledObject permissions |
operator/config/crd/bases/agent.flokoa.ai_agents.yaml |
Generated CRD with scaling spec and status fields |
operator/charts/flokoa/files/crds/agent.flokoa.ai_agents.yaml |
Helm chart CRD with scaling support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "spec": spec, | ||
| }, | ||
| } | ||
|
|
There was a problem hiding this comment.
The constructed unstructured object should have its GVK explicitly set using SetGroupVersionKind. While the apiVersion and kind are present in the Object map, explicitly setting the GVK ensures proper handling by the Kubernetes client, especially for owner reference resolution. Consider adding obj.SetGroupVersionKind(ScaledObjectGVK) after constructing the object.
| obj.SetGroupVersionKind(ScaledObjectGVK) |
|
|
||
| // Preserve the existing metadata (resourceVersion, etc.) and update spec + labels | ||
| existing.Object["spec"] = desired.Object["spec"] | ||
| existing.SetLabels(desired.GetLabels()) |
There was a problem hiding this comment.
When updating an existing ScaledObject, consider preserving annotations in addition to labels. KEDA and other controllers may add annotations to ScaledObjects that should be preserved across updates. The current implementation only preserves labels but directly overwrites the spec, which could lose important annotations added by KEDA itself.
| existing.SetLabels(desired.GetLabels()) | |
| existing.SetLabels(desired.GetLabels()) | |
| // Preserve existing annotations while merging in desired annotations. | |
| // This avoids dropping annotations that may have been added by KEDA or other controllers. | |
| desiredAnnotations := desired.GetAnnotations() | |
| if len(desiredAnnotations) > 0 { | |
| existingAnnotations := existing.GetAnnotations() | |
| if existingAnnotations == nil { | |
| existingAnnotations = map[string]string{} | |
| } | |
| for k, v := range desiredAnnotations { | |
| existingAnnotations[k] = v | |
| } | |
| existing.SetAnnotations(existingAnnotations) | |
| } |
| } | ||
| agent.Status.ScaledObjectName = "" | ||
| agentdomain.SetCondition(agent, agentdomain.ConditionTypeScalingReady, | ||
| metav1.ConditionFalse, agentdomain.ReasonScaledObjectRemoved, "ScaledObject removed") |
There was a problem hiding this comment.
The condition should be set to True (not False) when a ScaledObject is successfully removed, or the condition should be removed entirely. Setting status to ConditionFalse with reason ReasonScaledObjectRemoved is semantically confusing - "ScalingReady=False" suggests scaling is not working, but removal is actually the correct desired state when scaling is disabled. Consider either setting it to True with a message like "Scaling disabled" or removing the condition entirely when scaling is not configured.
| metav1.ConditionFalse, agentdomain.ReasonScaledObjectRemoved, "ScaledObject removed") | |
| metav1.ConditionTrue, agentdomain.ReasonScaledObjectRemoved, "Scaling disabled: ScaledObject removed") |
| if agent.Spec.Scaling == nil { | ||
| // Scaling removed — delete ScaledObject if it exists and clear status | ||
| if agent.Status.ScaledObjectName != "" { | ||
| if err := s.deps.ScaledObjects.DeleteScaledObject(ctx, types.NamespacedName{ | ||
| Name: agent.Status.ScaledObjectName, | ||
| Namespace: agent.Namespace, | ||
| }); err != nil { | ||
| agentdomain.SetCondition(agent, agentdomain.ConditionTypeScalingReady, | ||
| metav1.ConditionFalse, agentdomain.ReasonScaledObjectFailed, | ||
| fmt.Sprintf("Failed to delete ScaledObject: %v", err)) | ||
| return err | ||
| } | ||
| agent.Status.ScaledObjectName = "" | ||
| agentdomain.SetCondition(agent, agentdomain.ConditionTypeScalingReady, | ||
| metav1.ConditionFalse, agentdomain.ReasonScaledObjectRemoved, "ScaledObject removed") | ||
| } | ||
| return nil |
There was a problem hiding this comment.
There's a potential issue when agent.Spec.Scaling is nil but agent.Status.ScaledObjectName is also empty - the deletion will be skipped even if a ScaledObject with the standard naming convention exists. This could happen if the status was never updated or was cleared. Consider attempting deletion based on the naming convention (builder.ScaledObjectName(agent.Name)) rather than only relying on agent.Status.ScaledObjectName, treating NotFound errors as success.
| func TestReconcileScaledObject_CreatesWhenScalingConfigured(t *testing.T) { | ||
| scaledObjectRepo := fakes.NewFakeScaledObjectRepo() | ||
| svc := newTestServiceWithScaling(scaledObjectRepo) | ||
|
|
||
| agent := &agentv1alpha1.Agent{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: "test-agent", | ||
| Namespace: "default", | ||
| }, | ||
| Spec: agentv1alpha1.AgentSpec{ | ||
| Scaling: &agentv1alpha1.ScalingSpec{ | ||
| MinReplicaCount: int32Ptr(0), | ||
| MaxReplicaCount: int32Ptr(5), | ||
| CooldownPeriod: int32Ptr(300), | ||
| PollingInterval: int32Ptr(15), | ||
| Triggers: []agentv1alpha1.ScalingTrigger{ | ||
| { | ||
| Type: "prometheus", | ||
| Metadata: map[string]string{"threshold": "100"}, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| err := svc.reconcileScaledObject(context.Background(), agent) | ||
| if err != nil { | ||
| t.Fatalf("reconcileScaledObject() error = %v", err) | ||
| } | ||
|
|
||
| // Verify ScaledObject was created | ||
| key := types.NamespacedName{ | ||
| Name: builder.ScaledObjectName("test-agent"), | ||
| Namespace: "default", | ||
| } | ||
| if _, ok := scaledObjectRepo.ScaledObjects[key]; !ok { | ||
| t.Error("ScaledObject was not created") | ||
| } | ||
|
|
||
| // Verify status updated | ||
| if agent.Status.ScaledObjectName != "test-agent-scaler" { | ||
| t.Errorf("ScaledObjectName = %q, want test-agent-scaler", agent.Status.ScaledObjectName) | ||
| } | ||
|
|
||
| // Verify condition set | ||
| cond := meta.FindStatusCondition(agent.Status.Conditions, agentdomain.ConditionTypeScalingReady) | ||
| if cond == nil { | ||
| t.Fatal("ScalingReady condition not set") | ||
| } | ||
| if cond.Status != metav1.ConditionTrue { | ||
| t.Errorf("ScalingReady status = %q, want True", cond.Status) | ||
| } | ||
| if cond.Reason != agentdomain.ReasonScaledObjectReady { | ||
| t.Errorf("ScalingReady reason = %q, want %q", cond.Reason, agentdomain.ReasonScaledObjectReady) | ||
| } | ||
| } |
There was a problem hiding this comment.
The test should also verify that the ScaledObject's spec contains the expected trigger configuration, not just that it exists. Consider adding assertions to check that the trigger type, metadata, and other scaling parameters were correctly transferred to the ScaledObject to ensure the builder and repository layer work end-to-end.
| runtime: | ||
| type: standard | ||
| standard: | ||
| replicas: 1 |
There was a problem hiding this comment.
The agent specifies both runtime.standard.replicas: 1 and a KEDA scaling configuration with minReplicaCount: 0. When KEDA takes over autoscaling, it manages the Deployment replicas through an HPA, which may conflict with the static replica count. Consider removing the replicas field from the standard runtime spec when scaling is configured, or add documentation clarifying that KEDA will override this value. The second example (agent-cron-scaling) correctly omits the replicas field.
| replicas: 1 |
Summary
This PR adds KEDA-based autoscaling capabilities to agents, enabling dynamic scaling based on custom metrics and scale-to-zero functionality. When an agent specifies scaling configuration, the operator automatically creates and manages a KEDA ScaledObject targeting the agent's Deployment.
Key Changes
New Scaling API Types (
agent_types.go):ScalingSpecto configure KEDA autoscaling parameters (min/max replicas, cooldown, polling interval)ScalingTriggerto define KEDA triggers (Prometheus, CPU, cron, etc.) with metadata and optional authenticationScalingTriggerAuthto reference KEDA TriggerAuthentication resourcesScaledObject Builder (
scaledobject.go):BuildScaledObject()that constructs unstructured KEDA ScaledObjects from agent configurationScaledObjectName()for consistent naming conventionRepository Layer (
scaledobject.go,scaledobject_fake.go):ScaledObjectRepoImplfor Kubernetes API interactions using unstructured objects (avoids hard KEDA dependency)FakeScaledObjectRepofor testing with in-memory storageScaledObjectRepointerface tointerfaces.goReconciliation Logic (
reconcile.go):reconcileScaledObject()method that:agent.Spec.Scalingis setScalingReady) and updatesagent.Status.ScaledObjectNameStatus Management (
status.go):ConditionTypeScalingReadycondition typeReasonScaledObjectReady,ReasonScaledObjectRemoved,ReasonScaledObjectFailedRBAC & Configuration:
keda.sh/scaledobjectsagent_v1alpha1_agent_keda.yaml) demonstrating Prometheus and cron-based scalingComprehensive Tests:
scaledobject_test.go: Tests for ScaledObject builder covering basic fields, triggers with auth, named triggers, optional field omission, and multiple triggersreconcile_scaling_test.go: Integration tests for reconciliation lifecycle (create, update, delete, skip when repo nil)Implementation Details
https://claude.ai/code/session_018ctEFbksh15TJzfJwhBNvs