Skip to content

[DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828)#829

Open
mayavkrishnan25 wants to merge 2 commits into
mainfrom
maya/celery-autoscaler-default-ns-onto-2f90
Open

[DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828)#829
mayavkrishnan25 wants to merge 2 commits into
mainfrom
maya/celery-autoscaler-default-ns-onto-2f90

Conversation

@mayavkrishnan25
Copy link
Copy Markdown
Collaborator

@mayavkrishnan25 mayavkrishnan25 commented May 13, 2026

  • Also scan 'default' namespace in celery autoscaler

PR #770 scoped the autoscaler's deployment scan to a single namespace (hmi_config.endpoint_namespace, i.e. 'scale-deploy') for startup speed. This inadvertently broke autoscaling for non-launch celery deployments that live in the 'default' namespace, e.g.
nucleus-embed-image-clip-continuous-sqs, which use the same celery autoscaler annotations but are not model-engine async endpoints.

Add 'default' as a second namespace to scan, restoring the previous behavior for those deployments while keeping the startup-speed win.

A follow-up could make this configurable via env var; hardcoding for now to keep this change small and surgical.

  • Pin types-setuptools below 82.0.0.20260508 to fix mypy CI

The 2026-05-08 release of types-setuptools tightened the type for package_data in ways that fail strict mypy on clients/python/setup.py. Pin to the previous compatible version range.

mypy --install-types spawns its own pip install, so add PIP_CONSTRAINT pointing at requirements-dev.txt so the pin is honored for transitive deps too (types-pyOpenSSL pulls in types-cffi which otherwise upgrades types-setuptools).

  • Revert "Pin types-setuptools below 82.0.0.20260508 to fix mypy CI"

This reverts commit 439555a.

  • Address review: harden per-namespace errors, dedupe, ignore stub regression
  • celery_autoscaler: wrap list_namespaced_deployment in try/except ApiException per namespace so a failure in one (e.g. missing RBAC in "default") doesn't silence autoscaling for the other.
  • celery_autoscaler: dedupe namespaces_to_scan via dict.fromkeys in case hmi_config.endpoint_namespace is "default" in a dev/test env.
  • clients/python/setup.py: add # type: ignore[arg-type] for package_data to work around a type stub regression in types-setuptools 82.0.0.20260508.
  • Fix black formatting on logger.error call

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Greptile Summary

  • Restores autoscaling for non-launch celery deployments in the "default" namespace by scanning both endpoint_namespace and "default" in list_deployments; uses dict.fromkeys to dedupe and wraps each namespace call in a per-namespace ApiException guard so a single RBAC failure doesn't silently break the other namespace.
  • Adds a nullable temporal_task_queue column to the endpoints table via a new Alembic migration; the migration chain is correct but the revision ID uses a manually-crafted sequential string with a non-hex character (g), unlike every other migration in the project.
  • Adds # type: ignore[arg-type] to clients/python/setup.py to work around a mypy strict-mode regression in types-setuptools 82.x.

Confidence Score: 5/5

Safe to merge; the autoscaler logic is correct and the only notable issue is a style concern in the migration file.

No P0 or P1 findings. The core autoscaler change (dual-namespace scan with deduplication and per-namespace error handling) is logically sound. The migration chain is valid. The only finding is a P2 style issue with the non-hex revision ID.

The Alembic migration file should have its revision ID regenerated with alembic tooling to follow project conventions.

Important Files Changed

Filename Overview
model-engine/model_engine_server/core/celery/celery_autoscaler.py Adds "default" as a second namespace to scan; deduplication via dict.fromkeys and per-namespace ApiException handling are correct. No functional issues found.
clients/python/setup.py Adds # type: ignore[arg-type] to suppress mypy stub regression in types-setuptools 82.x; minimal and targeted.
model-engine/model_engine_server/db/migrations/alembic/versions/2026_04_24_0000-b2c3d4e5f6g7_add_temporal_task_queue_column.py Adds nullable temporal_task_queue column to the endpoints table; migration chain is correct (down_revision matches a1b2c3d4e5f6), but the revision ID uses a non-hex character (g) unlike all other migrations in the project.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[list_deployments called] --> B[Build namespaces_to_scan\ndict.fromkeys endpoint_namespace + default]
    B --> C{For each namespace}
    C --> D[list_namespaced_deployment]
    D -->|ApiException| E[logger.error\ncontinue to next namespace]
    D -->|Success| F[logger.info timing]
    F --> G{For each deployment}
    G --> H{Has annotations?}
    H -->|No| G
    H -->|Yes| I{Broker matches\nautoscaler_broker?}
    I -->|No| G
    I -->|Yes| J[Parse CeleryAutoscalerParams]
    J -->|TypeError| G
    J -->|OK| K[Store in celery_deployments_params\nkeyed by name + namespace]
    K --> G
    G -->|Done| C
    C -->|Done| L[Return celery_deployments_params]
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
model-engine/model_engine_server/db/migrations/alembic/versions/2026_04_24_0000-b2c3d4e5f6g7_add_temporal_task_queue_column.py:12-13
**Non-hex revision ID deviates from project convention**

The revision ID `b2c3d4e5f6g7` contains `g`, which is not a valid hexadecimal character. Every other migration in this project uses a proper 12-char hex string (e.g. `fa3267c80731`, `62da4f8b3403`). While Alembic treats revision IDs as opaque strings and the migration will technically function, the manually-crafted sequential pattern could collide with another developer's migration if they follow the same convention. Consider regenerating this file with `alembic revision --autogenerate` to get a real random ID.

Reviews (2): Last reviewed commit: "Cherry-pick temporal_task_queue migratio..." | Re-trigger Greptile

* Also scan 'default' namespace in celery autoscaler

PR #770 scoped the autoscaler's deployment scan to a single namespace
(hmi_config.endpoint_namespace, i.e. 'scale-deploy') for startup speed.
This inadvertently broke autoscaling for non-launch celery deployments
that live in the 'default' namespace, e.g.
nucleus-embed-image-clip-continuous-sqs, which use the same celery
autoscaler annotations but are not model-engine async endpoints.

Add 'default' as a second namespace to scan, restoring the previous
behavior for those deployments while keeping the startup-speed win.

A follow-up could make this configurable via env var; hardcoding for now
to keep this change small and surgical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Pin types-setuptools below 82.0.0.20260508 to fix mypy CI

The 2026-05-08 release of types-setuptools tightened the type for
`package_data` in ways that fail strict mypy on clients/python/setup.py.
Pin to the previous compatible version range.

mypy --install-types spawns its own `pip install`, so add
PIP_CONSTRAINT pointing at requirements-dev.txt so the pin is honored
for transitive deps too (types-pyOpenSSL pulls in types-cffi which
otherwise upgrades types-setuptools).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert "Pin types-setuptools below 82.0.0.20260508 to fix mypy CI"

This reverts commit 439555a.

* Address review: harden per-namespace errors, dedupe, ignore stub regression

- celery_autoscaler: wrap list_namespaced_deployment in try/except ApiException
  per namespace so a failure in one (e.g. missing RBAC in "default") doesn't
  silence autoscaling for the other.
- celery_autoscaler: dedupe namespaces_to_scan via dict.fromkeys in case
  hmi_config.endpoint_namespace is "default" in a dev/test env.
- clients/python/setup.py: add `# type: ignore[arg-type]` for package_data
  to work around a type stub regression in types-setuptools 82.0.0.20260508.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix black formatting on logger.error call

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mayavkrishnan25 mayavkrishnan25 changed the title Also scan 'default' namespace in celery autoscaler (#828) [DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828) May 13, 2026
Comment on lines +93 to 97
except ApiException as exc:
# Don't let a failure in one namespace (e.g. missing RBAC) wipe out scaling for the
# other. Log and move on; the next iteration of the outer loop will retry.
logger.error(f"Failed to list deployments in namespace {namespace_name}: {exc}")
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Persistent ERROR spam when RBAC for "default" is not granted

If the autoscaler pod lacks RBAC permission to list deployments in the "default" namespace (which is a likely production config where only endpoint_namespace access was ever granted), every loop iteration (~3 s) will emit an ERROR-level log for the failed namespace. Over time this floods log pipelines and may mask real errors. Consider logging at WARNING level, or tracking the error and only re-logging after a backoff interval.

Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/core/celery/celery_autoscaler.py
Line: 93-97

Comment:
**Persistent ERROR spam when RBAC for "default" is not granted**

If the autoscaler pod lacks RBAC permission to list deployments in the `"default"` namespace (which is a likely production config where only `endpoint_namespace` access was ever granted), every loop iteration (~3 s) will emit an ERROR-level log for the failed namespace. Over time this floods log pipelines and may mask real errors. Consider logging at WARNING level, or tracking the error and only re-logging after a backoff interval.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

The prod DB currently has migration b2c3d4e5f6g7 applied, from
lilyz-ai/temporal-endpoint-type branch (commit 04729ce, not yet on
main). Any deploy from a base that doesn't include this migration
fails the alembic pre-upgrade hook with:
  alembic.util.exc.CommandError: Can't locate revision identified
  by 'b2c3d4e5f6g7'

Bring just the migration file in (the rest of the temporal feature
code is intentionally NOT cherry-picked here). The migration adds a
single nullable column temporal_task_queue on the endpoints table.
Since the column is nullable and our ORM model doesn't reference it,
this is purely a schema-acknowledgement change — no behavior impact
on the autoscaler fix in c251213.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant