feat(model-engine): add GcpPubSubQueueEndpointResourceDelegate for async endpoints on GCP#827
Open
postevanus-scale wants to merge 13 commits into
Open
feat(model-engine): add GcpPubSubQueueEndpointResourceDelegate for async endpoints on GCP#827postevanus-scale wants to merge 13 commits into
postevanus-scale wants to merge 13 commits into
Conversation
Three small fixes uncovered while diagnosing a stuck endpoint build on GKE running model-engine 798747b…: 1. service_template_config_map.yaml: HPA `type: Pods` was paired with `target.type: Value`, which is invalid for HPA v2. Kubernetes rejects the metric (status: `InvalidMetricSourceType`) and falls back to `minReplicas`. Switch to the correct `AverageValue`. 2. service_template_config_map.yaml: `virtual-service.yaml` and `destination-rule.yaml` are gated on `Values.virtualservice.enabled` and `Values.destinationrule.enabled` respectively, but the runtime code (`k8s_endpoint_resource_delegate.py:_create_or_update_resources`) reads `config.values.launch.istio_enabled` and unconditionally calls `load_k8s_yaml("virtual-service.yaml", …)` / `destination-rule.yaml` whenever that flag is true. When the two flags disagree, the build task crashes with `FileNotFoundError` and the endpoint never reaches READY (SGP reports `deployment_timeout`). Couple the chart gating to the runtime flag — the templates are now emitted when *either* `virtualservice.enabled` / `destinationrule.enabled` *or* `config.values.launch.istio_enabled` is true. Existing operators who explicitly set the chart flags see no change. 3. endpoint_builder_deployment.yaml: `readinessProbe` hardcoded `bash -c 'test -f /tmp/readyz'`. Minimal-base images (e.g. the 798747b… build) no longer ship `bash`, so the probe permanently errors with `executable file not found in $PATH` and the pod stays `0/1` Ready, which times out `helm --wait` at 1200s and stalls the whole HelmRelease. Make the probe overridable via `endpointBuilder.readinessProbe`; default behavior unchanged. Render-verified with `helm template`: - HPA target.type renders as `AverageValue`. - VS / DR templates appear when either flag is true; absent otherwise. - Default readinessProbe still uses `bash`; override via values works.
…ync endpoints on GCP Async inference endpoints fail on GCP clusters with NoCredentialsError because the codebase only supports SQS / ASB / OnPrem queue delegates. This wires a Pub/Sub-based delegate so cloud_provider=gcp can create and manage the queues that async workers consume from. Affects: launch namespace queue resource creation for async deploys.
lilyz-ai
reviewed
May 11, 2026
Collaborator
lilyz-ai
left a comment
There was a problem hiding this comment.
Please address greptile comments and fix ci tests.
…bsub delegate - Cache PublisherClient/SubscriberClient in __init__ (avoid per-call gRPC handshake) - Refresh ack_deadline on AlreadyExists in create_subscription - Wrap non-NotFound errors in EndpointResourceInfraException on delete - Validate project_id at construction time (fail loud on misconfig) - Either thread topic_prefix/subscription_prefix or drop unused args (worker A's call) - live_endpoint_resource_gateway: guard num_undelivered_messages=-1 sentinel - start_batch_job_orchestration: add gcp branch for task-queue gateway routing - Tests: assert subscription-deleted-before-topic order - Helm: hoist \$gcp_cloud_provider local variable - Fix: apply black formatting to pass CI formatting check
Addresses Greptile P1 on PR #827: the Helm chart injects GCP_PROJECT_ID as a pod env var (from .Values.gcp.project_id), but the prior code only read it from infra_config().gcp_project_id (a different source rendered from .Values.config.values.infra). On any GCP cluster that follows the sample values, infra_config().gcp_project_id was None and the delegate's project_id guard always raised ValueError at startup. Mirror the SQS_PROFILE pattern (os.getenv("SQS_PROFILE", hmi_config.sqs_profile)) in all three factory sites: - api/dependencies.py - entrypoints/start_batch_job_orchestration.py - entrypoints/k8s_cache.py Also: black --config .black.toml + isort to clear the run_unit_tests_server CI check (formatting drift introduced by the previous commits).
…import Greptile P1 (gcp_pubsub_queue_endpoint_resource_delegate.py:118): delete_queue previously short-circuited after a non-NotFound failure on delete_subscription, leaving the topic orphaned. Now we attempt BOTH deletions, collect any GoogleAPIErrors, and raise a single EndpointResourceInfraException with all cleanup failures after both attempts. NotFound on either step remains silent (idempotent). Ruff F401 (api/dependencies.py:111): RedisQueueEndpointResourceDelegate was imported but is no longer used — removed in the prior commit when GCP switched from the redis fallback to the new Pub/Sub delegate. Dropping the import fixes the run_unit_tests_server Ruff Lint Check. Test: adds test_delete_queue_subscription_failure_does_not_orphan_topic to pin the orphan-prevention invariant.
The Type Check (mypy) CI step on run_unit_tests_server failed because
os.getenv("GCP_PROJECT_ID") or infra_config().gcp_project_id
evaluates to str | None, but GcpPubSubQueueEndpointResourceDelegate
declares project_id: str. The delegate's runtime guard catches misconfig,
but mypy can't see through __init__ to narrow the type at the call site.
Narrow explicitly at the 3 factory sites — extract the project_id into
a local, check `if not project_id: raise ValueError(...)`, then pass to
the delegate. mypy is happy, and misconfig surfaces with a clearer error
naming both possible config sources rather than the generic ValueError
from the delegate.
lilyz-ai
approved these changes
May 13, 2026
…-pubsub
Previous commits added `google-cloud-pubsub>=2.18` to requirements.in but
left requirements.txt stale. CI installs from the locked file, so the
new dependency wasn't available and the unit-test collection failed with:
ImportError: cannot import name 'pubsub_v1' from 'google.cloud'
Regenerated with the command in the file header:
uv pip compile requirements.in --python-version 3.13 --no-cache -o requirements.txt
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Adding google-cloud-pubsub indirectly broke test collection because: 1. google-cloud-pubsub depends on opentelemetry-sdk (transitive). 2. common/startup_tracing/correlation.py sets OTEL_AVAILABLE=True when opentelemetry.trace is importable — which is now the case. 3. With OTEL_AVAILABLE=True, common/startup_tracing/tracer.py:19 imports opentelemetry.exporter.otlp.proto.grpc.metric_exporter, which lives in opentelemetry-exporter-otlp-proto-grpc — not previously a declared dependency anywhere in the project. CI was failing test collection on both startup_tracing tests with ModuleNotFoundError: No module named 'opentelemetry.exporter'. Pin opentelemetry-exporter-otlp-proto-grpc explicitly so the import chain in tracer.py resolves whenever OTEL_AVAILABLE is True. Note: the underlying inconsistency (OTEL_AVAILABLE flagged by a subset of imports that the actual code path uses) is a pre-existing latent bug in tracer.py — a follow-up could either tighten the import guard or make the exporter imports lazy. Out of scope for this PR.
| subscription_path = ( | ||
| f"projects/{self.project_id}/subscriptions/{self._subscription_id(endpoint_id)}" | ||
| ) | ||
| ack_deadline = min(queue_message_timeout_seconds or 60, GCP_PUBSUB_MAX_ACK_DEADLINE_SECONDS) |
There was a problem hiding this comment.
Pub/Sub enforces a 10–600 second range for
ack_deadline_seconds. The current expression only clamps to the 600-second ceiling; values of 1–9 (a valid user-supplied queue_message_timeout_seconds) will be forwarded to the API and rejected with INVALID_ARGUMENT. Add a lower bound of 10 to mirror the ceiling clamp.
Suggested change
| ack_deadline = min(queue_message_timeout_seconds or 60, GCP_PUBSUB_MAX_ACK_DEADLINE_SECONDS) | |
| GCP_PUBSUB_MIN_ACK_DEADLINE_SECONDS = 10 # Pub/Sub hard minimum | |
| ack_deadline = max( | |
| GCP_PUBSUB_MIN_ACK_DEADLINE_SECONDS, | |
| min(queue_message_timeout_seconds or 60, GCP_PUBSUB_MAX_ACK_DEADLINE_SECONDS), | |
| ) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/gateways/resources/gcp_pubsub_queue_endpoint_resource_delegate.py
Line: 64
Comment:
Pub/Sub enforces a 10–600 second range for `ack_deadline_seconds`. The current expression only clamps to the 600-second ceiling; values of 1–9 (a valid user-supplied `queue_message_timeout_seconds`) will be forwarded to the API and rejected with `INVALID_ARGUMENT`. Add a lower bound of 10 to mirror the ceiling clamp.
```suggestion
GCP_PUBSUB_MIN_ACK_DEADLINE_SECONDS = 10 # Pub/Sub hard minimum
ack_deadline = max(
GCP_PUBSUB_MIN_ACK_DEADLINE_SECONDS,
min(queue_message_timeout_seconds or 60, GCP_PUBSUB_MAX_ACK_DEADLINE_SECONDS),
)
```
How can I resolve this? If you propose a fix, please make it concise.Two regressions from prior commits surfaced when CI actually ran the
unit tests:
1) test_gcp_provider_selects_gcp_implementations failed with
google.auth.exceptions.DefaultCredentialsError. Eagerly constructing
pubsub_v1.PublisherClient() / SubscriberClient() in __init__ triggers
Google ADC auth at delegate-construction time; CI has no credentials.
Make both clients lazy: store None on __init__, materialize on first
property access, cache thereafter. Still avoids per-call construction
(the Greptile P2 we addressed previously), but doesn't trip auth when
the delegate is merely constructed by a factory under test.
2) test_k8s_endpoint_resource_delegate's helm-template subprocess tests
failed with "block sequence entries are not allowed in this context".
The $gcp_cloud_provider local-variable line in _helpers.tpl used
{{- ... -}} which stripped the trailing newline, smashing the
preceding LAUNCH_SERVICE_TEMPLATE_FOLDER value: into the next
MODEL_CACHE_ENABLED entry.
Change to {{- ... }} so the newline survives.
…regression This is the third consecutive CI run failing on the same line in a file this PR does not own. The cause is environmental: * CircleCI runs `mypy --install-types --non-interactive` which fetches the latest type stubs on every job. * types-setuptools 82.0.0.20260508 tightened `package_data` to expect `_DictLike[str, list[str]]`. The literal `dict[str, list[str]]` we have here is runtime-compatible but the new stub disagrees. Suppress at the call site with `# type: ignore[arg-type]`. The runtime behavior is unchanged. Annotating here is cheaper than pinning the stub (which would mask future legitimate tightenings). This is in clients/python/setup.py which is not part of this PR's feature work; treating as required courtesy to unblock CI since the same regression hits every PR opened today.
…he new delegate The test was still asserting the pre-PR contract — that cloud_provider=gcp selects RedisQueueEndpointResourceDelegate. This PR's whole point is to replace that Redis fallback with the new GcpPubSubQueueEndpointResourceDelegate, so the assertion has to match the new behavior. This was the last failing test (1 of 751 in run_unit_tests_server). All other unit tests, integration_tests, run_unit_tests_python_client, build_docs, build_image, and Socket Security checks are green.
lilyz-ai
approved these changes
May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Google Cloud Pub/Sub-backed implementation of
QueueEndpointResourceDelegateso thatcloud_provider=gcpclusters can deploy async inference endpoints.Before this PR, async deploys on GCP fail at
live_endpoint_resource_gateway.create_queuebecause the only wired delegate is SQS (AWS), which raisesNoCredentialsErroron a GCP node. We see this consistently ongke_scale-dev-mofa_asia-northeast3_sgp-pmodev-kubernetes-cluster.Repro
Temporal workflow IDs from a recent attempt on the GCP cluster:
1020f4b2-e55f-4b6e-acf3-7c0090a4c5c2(qwen3-e-4b-7, sync — fails at workflow create_activity, separate bug)9f86bd97-4804-4945-b2e1-29e0dd2c794a(qwen3-e-4b-9-async, async — fails atsqs_queue_endpoint_resource_delegate.create_queue_if_not_existswithNoCredentialsError: Unable to locate credentials)Changes
infra/gateways/resources/gcp_pubsub_queue_endpoint_resource_delegate.pymirrors the ASB shape. Creates a topic + pull subscription per endpoint, idempotent (catchesAlreadyExists/NotFound).api/dependencies.py,entrypoints/start_batch_job_orchestration.py,entrypoints/k8s_cache.py.elif "num_undelivered_messages" in sqs_attributesbranch inlive_endpoint_resource_gateway.pyso the gateway recognizes the new attribute shape.google-cloud-pubsubtorequirements.in.gcp.project_id,gcp.pubsub_topic_prefix,gcp.pubsub_subscription_prefixand pipes them as env vars to the deployments that build/manage queues.Out of scope (deliberately)
serviceAccount.annotationsGCP SA withroles/pubsub.editoron the target project.num_undelivered_messages: Pub/Sub doesn't expose an undelivered-count attribute synchronously. The delegate currently returns-1— wiring Cloud Monitoring'spubsub.googleapis.com/subscription/num_undelivered_messagesmetric belongs in a follow-up.Test plan
python -m py_compileon every Python file touchedhelm template charts/model-enginerenders without new errorsVerification commands
Greptile Summary
GcpPubSubQueueEndpointResourceDelegate(topic + pull-subscription per endpoint) and wires it into the three factory sites (dependencies.py,k8s_cache.py,start_batch_job_orchestration.py), fixingNoCredentialsErroron GCP async endpoint deploys. Previous review concerns (lazy client init, prefix usage, orphaned-resource cleanup,-1queue depth guard,GCP_PROJECT_IDenv-first read) have all been addressed in this revision.PUBSUB_TOPIC_PREFIX/PUBSUB_SUBSCRIPTION_PREFIXHelm env vars are injected but not forwarded to the delegate constructor; andack_deadline_secondslacks a lower-bound clamp of 10 s (Pub/Sub rejects values below 10).gcp.project_id,gcp.pubsub_topic_prefix, andgcp.pubsub_subscription_prefix;opentelemetry-exporter-otlp-proto-grpcis added torequirements.inwithout a version constraint, unlike every other entry.Confidence Score: 3/5
Two carry-over P1s from prior rounds are unresolved: prefix env vars silently dropped and ack_deadline lower bound missing; otherwise the implementation is solid.
Prior review rounds flagged ack_deadline_seconds missing a 10 s lower bound (Pub/Sub rejects 1–9 s values with INVALID_ARGUMENT) and PUBSUB_TOPIC_PREFIX/PUBSUB_SUBSCRIPTION_PREFIX env vars injected by Helm but not consumed. Both remain unaddressed. Multiple P1s pull the ceiling below 4.
model-engine/model_engine_server/infra/gateways/resources/gcp_pubsub_queue_endpoint_resource_delegate.py (ack_deadline lower bound), model-engine/model_engine_server/api/dependencies.py + k8s_cache.py + start_batch_job_orchestration.py (prefix env vars not forwarded)
Important Files Changed
Comments Outside Diff (1)
model-engine/model_engine_server/api/dependencies.py, line 261-265 (link)PUBSUB_TOPIC_PREFIX/PUBSUB_SUBSCRIPTION_PREFIXenv vars are injected by Helm but never consumedThe Helm chart injects both
PUBSUB_TOPIC_PREFIXandPUBSUB_SUBSCRIPTION_PREFIXas pod env vars (from.Values.gcp.pubsub_topic_prefix/.Values.gcp.pubsub_subscription_prefix), but all three factory sites (dependencies.py,k8s_cache.py,start_batch_job_orchestration.py) constructGcpPubSubQueueEndpointResourceDelegatewith onlyproject_id, leaving both prefixes at their hardcoded defaults. Any operator who sets custom Helm prefix values will see them silently dropped; GCP resources will always be namedlaunch-endpoint-id-{id}regardless. The same fix is needed ink8s_cache.pyandstart_batch_job_orchestration.py.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (12): Last reviewed commit: "ci: retrigger integration tests (sync en..." | Re-trigger Greptile