[DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828) by mayavkrishnan25 · Pull Request #829 · scaleapi/llm-engine

mayavkrishnan25 · 2026-05-13T12:07:21Z

Also scan 'default' namespace in celery autoscaler

PR #770 scoped the autoscaler's deployment scan to a single namespace (hmi_config.endpoint_namespace, i.e. 'scale-deploy') for startup speed. This inadvertently broke autoscaling for non-launch celery deployments that live in the 'default' namespace, e.g.
nucleus-embed-image-clip-continuous-sqs, which use the same celery autoscaler annotations but are not model-engine async endpoints.

Add 'default' as a second namespace to scan, restoring the previous behavior for those deployments while keeping the startup-speed win.

A follow-up could make this configurable via env var; hardcoding for now to keep this change small and surgical.

Pin types-setuptools below 82.0.0.20260508 to fix mypy CI

The 2026-05-08 release of types-setuptools tightened the type for package_data in ways that fail strict mypy on clients/python/setup.py. Pin to the previous compatible version range.

mypy --install-types spawns its own pip install, so add PIP_CONSTRAINT pointing at requirements-dev.txt so the pin is honored for transitive deps too (types-pyOpenSSL pulls in types-cffi which otherwise upgrades types-setuptools).

Revert "Pin types-setuptools below 82.0.0.20260508 to fix mypy CI"

This reverts commit 439555a.

Address review: harden per-namespace errors, dedupe, ignore stub regression

celery_autoscaler: wrap list_namespaced_deployment in try/except ApiException per namespace so a failure in one (e.g. missing RBAC in "default") doesn't silence autoscaling for the other.
celery_autoscaler: dedupe namespaces_to_scan via dict.fromkeys in case hmi_config.endpoint_namespace is "default" in a dev/test env.
clients/python/setup.py: add # type: ignore[arg-type] for package_data to work around a type stub regression in types-setuptools 82.0.0.20260508.

Fix black formatting on logger.error call

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Greptile Summary

Restores autoscaling for non-launch celery deployments in the "default" namespace by scanning both endpoint_namespace and "default" in list_deployments; uses dict.fromkeys to dedupe and wraps each namespace call in a per-namespace ApiException guard so a single RBAC failure doesn't silently break the other namespace.
Adds a nullable temporal_task_queue column to the endpoints table via a new Alembic migration; the migration chain is correct but the revision ID uses a manually-crafted sequential string with a non-hex character (g), unlike every other migration in the project.
Adds # type: ignore[arg-type] to clients/python/setup.py to work around a mypy strict-mode regression in types-setuptools 82.x.

Confidence Score: 5/5

Safe to merge; the autoscaler logic is correct and the only notable issue is a style concern in the migration file.

No P0 or P1 findings. The core autoscaler change (dual-namespace scan with deduplication and per-namespace error handling) is logically sound. The migration chain is valid. The only finding is a P2 style issue with the non-hex revision ID.

The Alembic migration file should have its revision ID regenerated with alembic tooling to follow project conventions.

Important Files Changed

Filename	Overview
model-engine/model_engine_server/core/celery/celery_autoscaler.py	Adds "default" as a second namespace to scan; deduplication via dict.fromkeys and per-namespace ApiException handling are correct. No functional issues found.
clients/python/setup.py	Adds # type: ignore[arg-type] to suppress mypy stub regression in types-setuptools 82.x; minimal and targeted.
model-engine/model_engine_server/db/migrations/alembic/versions/2026_04_24_0000-b2c3d4e5f6g7_add_temporal_task_queue_column.py	Adds nullable temporal_task_queue column to the endpoints table; migration chain is correct (down_revision matches a1b2c3d4e5f6), but the revision ID uses a non-hex character (g) unlike all other migrations in the project.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[list_deployments called] --> B[Build namespaces_to_scan\ndict.fromkeys endpoint_namespace + default]
    B --> C{For each namespace}
    C --> D[list_namespaced_deployment]
    D -->|ApiException| E[logger.error\ncontinue to next namespace]
    D -->|Success| F[logger.info timing]
    F --> G{For each deployment}
    G --> H{Has annotations?}
    H -->|No| G
    H -->|Yes| I{Broker matches\nautoscaler_broker?}
    I -->|No| G
    I -->|Yes| J[Parse CeleryAutoscalerParams]
    J -->|TypeError| G
    J -->|OK| K[Store in celery_deployments_params\nkeyed by name + namespace]
    K --> G
    G -->|Done| C
    C -->|Done| L[Return celery_deployments_params]

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
model-engine/model_engine_server/db/migrations/alembic/versions/2026_04_24_0000-b2c3d4e5f6g7_add_temporal_task_queue_column.py:12-13
**Non-hex revision ID deviates from project convention**

The revision ID `b2c3d4e5f6g7` contains `g`, which is not a valid hexadecimal character. Every other migration in this project uses a proper 12-char hex string (e.g. `fa3267c80731`, `62da4f8b3403`). While Alembic treats revision IDs as opaque strings and the migration will technically function, the manually-crafted sequential pattern could collide with another developer's migration if they follow the same convention. Consider regenerating this file with `alembic revision --autogenerate` to get a real random ID.

_{Reviews (2): Last reviewed commit: "Cherry-pick temporal_task_queue migratio..." | Re-trigger Greptile}

* Also scan 'default' namespace in celery autoscaler PR #770 scoped the autoscaler's deployment scan to a single namespace (hmi_config.endpoint_namespace, i.e. 'scale-deploy') for startup speed. This inadvertently broke autoscaling for non-launch celery deployments that live in the 'default' namespace, e.g. nucleus-embed-image-clip-continuous-sqs, which use the same celery autoscaler annotations but are not model-engine async endpoints. Add 'default' as a second namespace to scan, restoring the previous behavior for those deployments while keeping the startup-speed win. A follow-up could make this configurable via env var; hardcoding for now to keep this change small and surgical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin types-setuptools below 82.0.0.20260508 to fix mypy CI The 2026-05-08 release of types-setuptools tightened the type for `package_data` in ways that fail strict mypy on clients/python/setup.py. Pin to the previous compatible version range. mypy --install-types spawns its own `pip install`, so add PIP_CONSTRAINT pointing at requirements-dev.txt so the pin is honored for transitive deps too (types-pyOpenSSL pulls in types-cffi which otherwise upgrades types-setuptools). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert "Pin types-setuptools below 82.0.0.20260508 to fix mypy CI" This reverts commit 439555a. * Address review: harden per-namespace errors, dedupe, ignore stub regression - celery_autoscaler: wrap list_namespaced_deployment in try/except ApiException per namespace so a failure in one (e.g. missing RBAC in "default") doesn't silence autoscaling for the other. - celery_autoscaler: dedupe namespaces_to_scan via dict.fromkeys in case hmi_config.endpoint_namespace is "default" in a dev/test env. - clients/python/setup.py: add `# type: ignore[arg-type]` for package_data to work around a type stub regression in types-setuptools 82.0.0.20260508. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix black formatting on logger.error call --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-13T12:09:51Z

+        except ApiException as exc:
+            # Don't let a failure in one namespace (e.g. missing RBAC) wipe out scaling for the
+            # other. Log and move on; the next iteration of the outer loop will retry.
+            logger.error(f"Failed to list deployments in namespace {namespace_name}: {exc}")
            continue


Persistent ERROR spam when RBAC for "default" is not granted

If the autoscaler pod lacks RBAC permission to list deployments in the "default" namespace (which is a likely production config where only endpoint_namespace access was ever granted), every loop iteration (~3 s) will emit an ERROR-level log for the failed namespace. Over time this floods log pipelines and may mask real errors. Consider logging at WARNING level, or tracking the error and only re-logging after a backoff interval.

Prompt To Fix With AI

This is a comment left during a code review. Path: model-engine/model_engine_server/core/celery/celery_autoscaler.py Line: 93-97 Comment: **Persistent ERROR spam when RBAC for "default" is not granted** If the autoscaler pod lacks RBAC permission to list deployments in the `"default"` namespace (which is a likely production config where only `endpoint_namespace` access was ever granted), every loop iteration (~3 s) will emit an ERROR-level log for the failed namespace. Over time this floods log pipelines and may mask real errors. Consider logging at WARNING level, or tracking the error and only re-logging after a backoff interval. How can I resolve this? If you propose a fix, please make it concise.

The prod DB currently has migration b2c3d4e5f6g7 applied, from lilyz-ai/temporal-endpoint-type branch (commit 04729ce, not yet on main). Any deploy from a base that doesn't include this migration fails the alembic pre-upgrade hook with: alembic.util.exc.CommandError: Can't locate revision identified by 'b2c3d4e5f6g7' Bring just the migration file in (the rest of the temporal feature code is intentionally NOT cherry-picked here). The migration adds a single nullable column temporal_task_queue on the endpoints table. Since the column is nullable and our ORM model doesn't reference it, this is purely a schema-acknowledgement change — no behavior impact on the autoscaler fix in c251213. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mayavkrishnan25 changed the title ~~Also scan 'default' namespace in celery autoscaler (#828)~~ [DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828) May 13, 2026

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828)#829

[DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828)#829
mayavkrishnan25 wants to merge 2 commits into
mainfrom
maya/celery-autoscaler-default-ns-onto-2f90

mayavkrishnan25 commented May 13, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mayavkrishnan25 commented May 13, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Test Plan and Usage Guide

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mayavkrishnan25 commented May 13, 2026 •

edited by greptile-apps Bot

Loading