Skip to content

Add KEDA-based autoscaling support with scale-to-zero#74

Open
danielnyari wants to merge 1 commit into
mainfrom
claude/add-keda-integration-gKJSv
Open

Add KEDA-based autoscaling support with scale-to-zero#74
danielnyari wants to merge 1 commit into
mainfrom
claude/add-keda-integration-gKJSv

Conversation

@danielnyari

Copy link
Copy Markdown
Owner

Summary

This PR adds KEDA-based autoscaling capabilities to agents, enabling dynamic scaling based on custom metrics and scale-to-zero functionality. When an agent specifies scaling configuration, the operator automatically creates and manages a KEDA ScaledObject targeting the agent's Deployment.

Key Changes

  • New Scaling API Types (agent_types.go):

    • Added ScalingSpec to configure KEDA autoscaling parameters (min/max replicas, cooldown, polling interval)
    • Added ScalingTrigger to define KEDA triggers (Prometheus, CPU, cron, etc.) with metadata and optional authentication
    • Added ScalingTriggerAuth to reference KEDA TriggerAuthentication resources
  • ScaledObject Builder (scaledobject.go):

    • Pure function BuildScaledObject() that constructs unstructured KEDA ScaledObjects from agent configuration
    • Handles optional fields (only sets values when non-nil) and converts Go types to unstructured format
    • Helper function ScaledObjectName() for consistent naming convention
  • Repository Layer (scaledobject.go, scaledobject_fake.go):

    • ScaledObjectRepoImpl for Kubernetes API interactions using unstructured objects (avoids hard KEDA dependency)
    • FakeScaledObjectRepo for testing with in-memory storage
    • Added ScaledObjectRepo interface to interfaces.go
  • Reconciliation Logic (reconcile.go):

    • New reconcileScaledObject() method that:
      • Creates ScaledObject when agent.Spec.Scaling is set
      • Updates existing ScaledObject when scaling config changes
      • Deletes ScaledObject when scaling is removed
      • Sets appropriate status conditions (ScalingReady) and updates agent.Status.ScaledObjectName
    • Gracefully skips if ScaledObject repo is not configured (optional feature)
  • Status Management (status.go):

    • Added ConditionTypeScalingReady condition type
    • Added reason constants: ReasonScaledObjectReady, ReasonScaledObjectRemoved, ReasonScaledObjectFailed
  • RBAC & Configuration:

    • Updated operator RBAC role to allow create/delete/update/patch/watch on keda.sh/scaledobjects
    • Updated controller-gen markers for ScaledObject management
    • Added example Agent manifests (agent_v1alpha1_agent_keda.yaml) demonstrating Prometheus and cron-based scaling
  • Comprehensive Tests:

    • scaledobject_test.go: Tests for ScaledObject builder covering basic fields, triggers with auth, named triggers, optional field omission, and multiple triggers
    • reconcile_scaling_test.go: Integration tests for reconciliation lifecycle (create, update, delete, skip when repo nil)

Implementation Details

  • Uses unstructured objects for ScaledObject to avoid hard dependency on KEDA Go module
  • ScaledObject failures are non-fatal—agents continue running without autoscaling if KEDA is unavailable
  • Follows existing patterns: builder pattern for object construction, repository pattern for persistence, condition-based status tracking
  • Supports all major KEDA trigger types through flexible metadata maps
  • Properly handles owner references for garbage collection

https://claude.ai/code/session_018ctEFbksh15TJzfJwhBNvs

Introduces KEDA ScaledObject management for Agent CRDs, enabling
event-driven autoscaling including scale-to-zero for idle agents.
This addresses the real cost problem of running unused agent replicas
without building custom autoscaling infrastructure.

Changes:
- Add ScalingSpec to AgentSpec with triggers, min/max replicas,
  cooldown period, and polling interval configuration
- Create ScaledObject builder using unstructured objects to avoid
  hard KEDA module dependency
- Add ScaledObjectRepo with create/update/delete via controller-runtime
- Integrate ScaledObject reconciliation into agent app service with
  graceful handling when KEDA is not installed
- Add ScalingReady condition for observability
- Add RBAC for keda.sh/scaledobjects
- Update CRD schema and Helm chart
- Include sample configs for Prometheus and cron-based scaling

https://claude.ai/code/session_018ctEFbksh15TJzfJwhBNvs
Copilot AI review requested due to automatic review settings February 17, 2026 20:51

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds KEDA-based autoscaling support to the Flokoa operator, enabling agents to dynamically scale based on custom metrics and scale to zero when idle. The implementation uses unstructured Kubernetes objects to avoid a hard dependency on the KEDA Go module, making KEDA an optional cluster component.

Changes:

  • Added ScalingSpec, ScalingTrigger, and ScalingTriggerAuth types to the Agent CRD API for configuring KEDA autoscaling
  • Implemented ScaledObject builder and repository layers using unstructured objects to create/update/delete KEDA resources
  • Integrated ScaledObject reconciliation into the agent reconciliation loop with non-fatal error handling
  • Added comprehensive unit tests for builder and reconciliation logic with fake repository implementations

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
operator/api/v1alpha1/agent_types.go Defines ScalingSpec, ScalingTrigger, and ScalingTriggerAuth API types with kubebuilder validation markers
operator/api/v1alpha1/zz_generated.deepcopy.go Auto-generated DeepCopy methods for new scaling-related types
operator/internal/infra/builder/scaledobject.go Pure function to build unstructured KEDA ScaledObjects from agent configuration
operator/internal/infra/builder/scaledobject_test.go Unit tests covering ScaledObject construction with various trigger configurations
operator/internal/infra/repo/scaledobject.go Repository implementation for CRUD operations on KEDA ScaledObjects using unstructured client
operator/internal/infra/repo/fakes/scaledobject_fake.go In-memory fake implementation for testing ScaledObject operations
operator/internal/infra/repo/interfaces.go Added ScaledObjectRepo interface to repository layer
operator/internal/app/agent/reconcile.go Added reconcileScaledObject method with non-fatal error handling and KEDA CRD detection
operator/internal/app/agent/reconcile_scaling_test.go Integration tests for ScaledObject lifecycle (create, update, delete)
operator/internal/domain/agent/status.go Added ScalingReady condition type and related reason constants
operator/internal/controller/agent_controller.go Added RBAC markers for ScaledObject permissions
operator/cmd/main.go Wired ScaledObjectRepoImpl into agent service dependencies
operator/config/samples/agent_v1alpha1_agent_keda.yaml Example Agent manifests demonstrating Prometheus and cron-based scaling
operator/config/rbac/role.yaml Generated RBAC role with ScaledObject permissions
operator/config/crd/bases/agent.flokoa.ai_agents.yaml Generated CRD with scaling spec and status fields
operator/charts/flokoa/files/crds/agent.flokoa.ai_agents.yaml Helm chart CRD with scaling support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"spec": spec,
},
}

Copilot AI Feb 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructed unstructured object should have its GVK explicitly set using SetGroupVersionKind. While the apiVersion and kind are present in the Object map, explicitly setting the GVK ensures proper handling by the Kubernetes client, especially for owner reference resolution. Consider adding obj.SetGroupVersionKind(ScaledObjectGVK) after constructing the object.

Suggested change
obj.SetGroupVersionKind(ScaledObjectGVK)

Copilot uses AI. Check for mistakes.

// Preserve the existing metadata (resourceVersion, etc.) and update spec + labels
existing.Object["spec"] = desired.Object["spec"]
existing.SetLabels(desired.GetLabels())

Copilot AI Feb 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When updating an existing ScaledObject, consider preserving annotations in addition to labels. KEDA and other controllers may add annotations to ScaledObjects that should be preserved across updates. The current implementation only preserves labels but directly overwrites the spec, which could lose important annotations added by KEDA itself.

Suggested change
existing.SetLabels(desired.GetLabels())
existing.SetLabels(desired.GetLabels())
// Preserve existing annotations while merging in desired annotations.
// This avoids dropping annotations that may have been added by KEDA or other controllers.
desiredAnnotations := desired.GetAnnotations()
if len(desiredAnnotations) > 0 {
existingAnnotations := existing.GetAnnotations()
if existingAnnotations == nil {
existingAnnotations = map[string]string{}
}
for k, v := range desiredAnnotations {
existingAnnotations[k] = v
}
existing.SetAnnotations(existingAnnotations)
}

Copilot uses AI. Check for mistakes.
}
agent.Status.ScaledObjectName = ""
agentdomain.SetCondition(agent, agentdomain.ConditionTypeScalingReady,
metav1.ConditionFalse, agentdomain.ReasonScaledObjectRemoved, "ScaledObject removed")

Copilot AI Feb 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition should be set to True (not False) when a ScaledObject is successfully removed, or the condition should be removed entirely. Setting status to ConditionFalse with reason ReasonScaledObjectRemoved is semantically confusing - "ScalingReady=False" suggests scaling is not working, but removal is actually the correct desired state when scaling is disabled. Consider either setting it to True with a message like "Scaling disabled" or removing the condition entirely when scaling is not configured.

Suggested change
metav1.ConditionFalse, agentdomain.ReasonScaledObjectRemoved, "ScaledObject removed")
metav1.ConditionTrue, agentdomain.ReasonScaledObjectRemoved, "Scaling disabled: ScaledObject removed")

Copilot uses AI. Check for mistakes.
Comment on lines +346 to +362
if agent.Spec.Scaling == nil {
// Scaling removed — delete ScaledObject if it exists and clear status
if agent.Status.ScaledObjectName != "" {
if err := s.deps.ScaledObjects.DeleteScaledObject(ctx, types.NamespacedName{
Name: agent.Status.ScaledObjectName,
Namespace: agent.Namespace,
}); err != nil {
agentdomain.SetCondition(agent, agentdomain.ConditionTypeScalingReady,
metav1.ConditionFalse, agentdomain.ReasonScaledObjectFailed,
fmt.Sprintf("Failed to delete ScaledObject: %v", err))
return err
}
agent.Status.ScaledObjectName = ""
agentdomain.SetCondition(agent, agentdomain.ConditionTypeScalingReady,
metav1.ConditionFalse, agentdomain.ReasonScaledObjectRemoved, "ScaledObject removed")
}
return nil

Copilot AI Feb 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential issue when agent.Spec.Scaling is nil but agent.Status.ScaledObjectName is also empty - the deletion will be skipped even if a ScaledObject with the standard naming convention exists. This could happen if the status was never updated or was cleared. Consider attempting deletion based on the naming convention (builder.ScaledObjectName(agent.Name)) rather than only relying on agent.Status.ScaledObjectName, treating NotFound errors as success.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +91
func TestReconcileScaledObject_CreatesWhenScalingConfigured(t *testing.T) {
scaledObjectRepo := fakes.NewFakeScaledObjectRepo()
svc := newTestServiceWithScaling(scaledObjectRepo)

agent := &agentv1alpha1.Agent{
ObjectMeta: metav1.ObjectMeta{
Name: "test-agent",
Namespace: "default",
},
Spec: agentv1alpha1.AgentSpec{
Scaling: &agentv1alpha1.ScalingSpec{
MinReplicaCount: int32Ptr(0),
MaxReplicaCount: int32Ptr(5),
CooldownPeriod: int32Ptr(300),
PollingInterval: int32Ptr(15),
Triggers: []agentv1alpha1.ScalingTrigger{
{
Type: "prometheus",
Metadata: map[string]string{"threshold": "100"},
},
},
},
},
}

err := svc.reconcileScaledObject(context.Background(), agent)
if err != nil {
t.Fatalf("reconcileScaledObject() error = %v", err)
}

// Verify ScaledObject was created
key := types.NamespacedName{
Name: builder.ScaledObjectName("test-agent"),
Namespace: "default",
}
if _, ok := scaledObjectRepo.ScaledObjects[key]; !ok {
t.Error("ScaledObject was not created")
}

// Verify status updated
if agent.Status.ScaledObjectName != "test-agent-scaler" {
t.Errorf("ScaledObjectName = %q, want test-agent-scaler", agent.Status.ScaledObjectName)
}

// Verify condition set
cond := meta.FindStatusCondition(agent.Status.Conditions, agentdomain.ConditionTypeScalingReady)
if cond == nil {
t.Fatal("ScalingReady condition not set")
}
if cond.Status != metav1.ConditionTrue {
t.Errorf("ScalingReady status = %q, want True", cond.Status)
}
if cond.Reason != agentdomain.ReasonScaledObjectReady {
t.Errorf("ScalingReady reason = %q, want %q", cond.Reason, agentdomain.ReasonScaledObjectReady)
}
}

Copilot AI Feb 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should also verify that the ScaledObject's spec contains the expected trigger configuration, not just that it exists. Consider adding assertions to check that the trigger type, metadata, and other scaling parameters were correctly transferred to the ScaledObject to ensure the builder and repository layer work end-to-end.

Copilot uses AI. Check for mistakes.
runtime:
type: standard
standard:
replicas: 1

Copilot AI Feb 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent specifies both runtime.standard.replicas: 1 and a KEDA scaling configuration with minReplicaCount: 0. When KEDA takes over autoscaling, it manages the Deployment replicas through an HPA, which may conflict with the static replica count. Consider removing the replicas field from the standard runtime spec when scaling is configured, or add documentation clarifying that KEDA will override this value. The second example (agent-cron-scaling) correctly omits the replicas field.

Suggested change
replicas: 1

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants