[DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828)#829
Open
mayavkrishnan25 wants to merge 2 commits into
Open
[DO NOT MERGE] Also scan 'default' namespace in celery autoscaler (#828)#829mayavkrishnan25 wants to merge 2 commits into
mayavkrishnan25 wants to merge 2 commits into
Conversation
* Also scan 'default' namespace in celery autoscaler PR #770 scoped the autoscaler's deployment scan to a single namespace (hmi_config.endpoint_namespace, i.e. 'scale-deploy') for startup speed. This inadvertently broke autoscaling for non-launch celery deployments that live in the 'default' namespace, e.g. nucleus-embed-image-clip-continuous-sqs, which use the same celery autoscaler annotations but are not model-engine async endpoints. Add 'default' as a second namespace to scan, restoring the previous behavior for those deployments while keeping the startup-speed win. A follow-up could make this configurable via env var; hardcoding for now to keep this change small and surgical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin types-setuptools below 82.0.0.20260508 to fix mypy CI The 2026-05-08 release of types-setuptools tightened the type for `package_data` in ways that fail strict mypy on clients/python/setup.py. Pin to the previous compatible version range. mypy --install-types spawns its own `pip install`, so add PIP_CONSTRAINT pointing at requirements-dev.txt so the pin is honored for transitive deps too (types-pyOpenSSL pulls in types-cffi which otherwise upgrades types-setuptools). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert "Pin types-setuptools below 82.0.0.20260508 to fix mypy CI" This reverts commit 439555a. * Address review: harden per-namespace errors, dedupe, ignore stub regression - celery_autoscaler: wrap list_namespaced_deployment in try/except ApiException per namespace so a failure in one (e.g. missing RBAC in "default") doesn't silence autoscaling for the other. - celery_autoscaler: dedupe namespaces_to_scan via dict.fromkeys in case hmi_config.endpoint_namespace is "default" in a dev/test env. - clients/python/setup.py: add `# type: ignore[arg-type]` for package_data to work around a type stub regression in types-setuptools 82.0.0.20260508. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix black formatting on logger.error call --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines
+93
to
97
| except ApiException as exc: | ||
| # Don't let a failure in one namespace (e.g. missing RBAC) wipe out scaling for the | ||
| # other. Log and move on; the next iteration of the outer loop will retry. | ||
| logger.error(f"Failed to list deployments in namespace {namespace_name}: {exc}") | ||
| continue |
There was a problem hiding this comment.
Persistent ERROR spam when RBAC for "default" is not granted
If the autoscaler pod lacks RBAC permission to list deployments in the "default" namespace (which is a likely production config where only endpoint_namespace access was ever granted), every loop iteration (~3 s) will emit an ERROR-level log for the failed namespace. Over time this floods log pipelines and may mask real errors. Consider logging at WARNING level, or tracking the error and only re-logging after a backoff interval.
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/core/celery/celery_autoscaler.py
Line: 93-97
Comment:
**Persistent ERROR spam when RBAC for "default" is not granted**
If the autoscaler pod lacks RBAC permission to list deployments in the `"default"` namespace (which is a likely production config where only `endpoint_namespace` access was ever granted), every loop iteration (~3 s) will emit an ERROR-level log for the failed namespace. Over time this floods log pipelines and may mask real errors. Consider logging at WARNING level, or tracking the error and only re-logging after a backoff interval.
How can I resolve this? If you propose a fix, please make it concise.The prod DB currently has migration b2c3d4e5f6g7 applied, from lilyz-ai/temporal-endpoint-type branch (commit 04729ce, not yet on main). Any deploy from a base that doesn't include this migration fails the alembic pre-upgrade hook with: alembic.util.exc.CommandError: Can't locate revision identified by 'b2c3d4e5f6g7' Bring just the migration file in (the rest of the temporal feature code is intentionally NOT cherry-picked here). The migration adds a single nullable column temporal_task_queue on the endpoints table. Since the column is nullable and our ORM model doesn't reference it, this is purely a schema-acknowledgement change — no behavior impact on the autoscaler fix in c251213. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #770 scoped the autoscaler's deployment scan to a single namespace (hmi_config.endpoint_namespace, i.e. 'scale-deploy') for startup speed. This inadvertently broke autoscaling for non-launch celery deployments that live in the 'default' namespace, e.g.
nucleus-embed-image-clip-continuous-sqs, which use the same celery autoscaler annotations but are not model-engine async endpoints.
Add 'default' as a second namespace to scan, restoring the previous behavior for those deployments while keeping the startup-speed win.
A follow-up could make this configurable via env var; hardcoding for now to keep this change small and surgical.
The 2026-05-08 release of types-setuptools tightened the type for
package_datain ways that fail strict mypy on clients/python/setup.py. Pin to the previous compatible version range.mypy --install-types spawns its own
pip install, so add PIP_CONSTRAINT pointing at requirements-dev.txt so the pin is honored for transitive deps too (types-pyOpenSSL pulls in types-cffi which otherwise upgrades types-setuptools).This reverts commit 439555a.
# type: ignore[arg-type]for package_data to work around a type stub regression in types-setuptools 82.0.0.20260508.Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.
Greptile Summary
"default"namespace by scanning bothendpoint_namespaceand"default"inlist_deployments; usesdict.fromkeysto dedupe and wraps each namespace call in a per-namespaceApiExceptionguard so a single RBAC failure doesn't silently break the other namespace.temporal_task_queuecolumn to theendpointstable via a new Alembic migration; the migration chain is correct but the revision ID uses a manually-crafted sequential string with a non-hex character (g), unlike every other migration in the project.# type: ignore[arg-type]toclients/python/setup.pyto work around a mypy strict-mode regression intypes-setuptools82.x.Confidence Score: 5/5
Safe to merge; the autoscaler logic is correct and the only notable issue is a style concern in the migration file.
No P0 or P1 findings. The core autoscaler change (dual-namespace scan with deduplication and per-namespace error handling) is logically sound. The migration chain is valid. The only finding is a P2 style issue with the non-hex revision ID.
The Alembic migration file should have its revision ID regenerated with alembic tooling to follow project conventions.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[list_deployments called] --> B[Build namespaces_to_scan\ndict.fromkeys endpoint_namespace + default] B --> C{For each namespace} C --> D[list_namespaced_deployment] D -->|ApiException| E[logger.error\ncontinue to next namespace] D -->|Success| F[logger.info timing] F --> G{For each deployment} G --> H{Has annotations?} H -->|No| G H -->|Yes| I{Broker matches\nautoscaler_broker?} I -->|No| G I -->|Yes| J[Parse CeleryAutoscalerParams] J -->|TypeError| G J -->|OK| K[Store in celery_deployments_params\nkeyed by name + namespace] K --> G G -->|Done| C C -->|Done| L[Return celery_deployments_params]Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "Cherry-pick temporal_task_queue migratio..." | Re-trigger Greptile