feat: integrate vLLM artifact transfer by zhengluo-nv · Pull Request #451 · ai-dynamo/modelexpress

zhengluo-nv · 2026-06-23T23:06:21Z

Overview

This PR integrates cache artifact transfer into the vLLM ModelExpress loader so a target replica can reuse runtime-generated JIT caches from an already-warm source replica before vLLM starts model initialization.

The implementation covers three cache families:

torch compile cache: VLLM_CACHE_ROOT/torch_compile_cache
Triton cache: TRITON_CACHE_DIR or ~/.triton/cache
DeepGEMM cache: DG_JIT_CACHE_DIR, DEEP_GEMM_CACHE_DIR, or VLLM_CACHE_ROOT/deep_gemm

Artifact transfer is opt-in through MX_ARTIFACT_TRANSFER=1 and requires the P2P metadata path. If artifact transfer is enabled while MX_P2P_METADATA=0, the loader logs a warning and skips artifact transfer rather than changing the metadata default.

What changed

Installs compatible vLLM cache artifacts before initialize_model() runs.
Schedules artifact publication after model load, gated by vLLM readiness and a short cache-settle check so generated cache files are present before publishing.
Refactors the heartbeat path into PublisherThread, which can retry initial publication and then heartbeat READY status.
Shares the existing rank-local P2P WorkerGrpcServer for both tensor manifests and artifact manifests/chunk serving, instead of starting one artifact server per artifact source.
Publishes each artifact type once per pod/cache root to avoid multiple TP ranks untarring into the same cache directory.
Adds TIMING logs for artifact prepare, transfer, install, publish, and vLLM artifact install/publish wrapper timing.
Adds CI knobs and tests for artifact source publication and target artifact install.
Updates docs for artifact transfer env vars, P2P metadata requirement, and publisher-thread behavior.

Benchmark

DeepSeek-V4-Pro standalone vLLM on nscale B200:

vLLM 0.23.0
TP=8
--enable-flashinfer-autotune
ModelExpress P2P weights via RDMA
Cache artifacts enabled for Triton + DeepGEMM
torch.compile cache was not used because vLLM reports DeepSeek-V4-Pro does not support torch.compile

Scenario	API ready	K8s Ready	Notes
Cold start replica	8m12s	8m17s	Rerun after node OS page-cache drop; disk load path, no P2P source
P2P weights only replica	7m06s	7m14s	RDMA weights in ~4.0-4.1s/rank at 212-219 Gbps
P2P weights + cache artifacts replica	3m17s	3m17s	Triton install 0.223s, DeepGEMM install 0.276s

Additional timing detail:

Cold start rerun after OS page-cache drop: loader time ~77.8-80.1s, graph capture 109s, vLLM engine init 351.6s
P2P weights only loader time: ~18.2-18.5s
P2P weights + cache loader time: ~15.2-15.7s
P2P weights only graph capture: 108s
P2P weights + cache graph capture: 75s
P2P weights only vLLM engine init: 350.6s
P2P weights + cache vLLM engine init: 122.91s

Impact in this run:

P2P weights only: 8m12s -> 7m06s API ready
P2P weights + caches: 8m12s -> 3m17s API ready
Cache artifacts reduced the later vLLM warmup/init work, not the RDMA weight-transfer time.

Correctness sanity:

Source chat request returned Four
Target chat request returned Four

Validation

uv run --project modelexpress_client/python --extra dev pytest modelexpress_client/python/tests/test_vllm_artifacts.py modelexpress_client/python/tests/test_artifact_transfer.py modelexpress_client/python/tests/test_heartbeat.py -q
python3 -m py_compile ci/k8s/client/test_p2p_k8s.py
nscale DeepSeek-V4-Pro TP=8 e2e benchmark above

Summary by CodeRabbit

New Features
- Added vLLM cache artifact transfer for peer-to-peer deployments, enabling automatic installation of torch-compile, Triton, and DeepGEMM caches across workers.
- Introduced artifact readiness checking and health validation for reliable transfer.
- New CI test to validate artifact transfer behavior in Kubernetes P2P jobs.
Documentation
- Updated deployment and architecture documentation with artifact-transfer configuration and workflow details.
Chores
- Upgraded vLLM Docker base image to v0.23.0.

coderabbitai · 2026-06-23T23:21:43Z

Walkthrough

Introduces end-to-end vLLM cache artifact transfer in ModelExpress. HeartbeatThread is replaced by a general PublisherThread supporting deferred publish callbacks and readiness gating. WorkerGrpcServer is refactored to serve multiple artifact sources keyed by mx_source_id. publish_artifact_source now accepts a pre-started server. A new artifacts.py module installs and schedules publication of torch-compile, Triton, and DeepGEMM caches, wired into the loader. CI/CD is extended to assert artifact transfer completion in the vLLM test matrix.

Changes

vLLM Cache Artifact Transfer

Layer / File(s)	Summary
PublisherThread replaces HeartbeatThread `modelexpress_client/python/modelexpress/metadata/heartbeat.py`, `modelexpress_client/python/modelexpress/__init__.py`	`heartbeat.py` is rewritten as `PublisherThread` supporting heartbeat-only, retried-publish, and publish-then-stop modes with optional `ready_fn`/`cleanup_fn`; `HeartbeatThread` becomes an alias. `PublisherThread` is added to the package `__all__`.
WorkerGrpcServer multi-source artifact serving `modelexpress_client/python/modelexpress/metadata/worker_server.py`	`WorkerServiceServicer` replaces single-artifact fields with a locked `_artifact_sources` map and `_select_artifact_source` helper. All artifact RPCs route through the selected source. `WorkerGrpcServer` accepts optional `mx_source_id` and exposes `register_artifact_source`/`unregister_artifact_source`/`set_mx_source_id`.
`artifact_transfer.py` API refactor and timing logs `modelexpress_client/python/modelexpress/metadata/artifact_transfer.py`	`publish_artifact_source` now takes a pre-started `WorkerGrpcServer` instead of a port, registers/unregisters artifact sources on it, and `PublishedArtifactSource.stop()` unregisters instead of stopping the server. `worker_rank` becomes `int\|None`. `[TIMING]` log lines with throughput are added to `prepare_source`, `install`, and `transfer_artifact_from_worker`.
`publish.py` migration to `PublisherThread` `modelexpress_client/python/modelexpress/metadata/publish.py`	`publish_metadata_and_ready` creates `WorkerGrpcServer` with `mx_source_id=None`, defers `_publish_metadata_to_server` into a `publish_fn` callback given to `PublisherThread`, and sets `mx_source_id` on the server post-publish. A `cleanup_fn` stops and removes the server for P2P paths.
New vLLM cache artifacts module `modelexpress_client/python/modelexpress/engines/vllm/artifacts.py`	New module adds `install_vllm_cache_artifacts` (one-time marker-gated installs of torch-compile, Triton, DeepGEMM caches) and `schedule_vllm_cache_artifact_publish` (schedules `PublisherThread` per cache type, readiness-gated by vLLM HTTP health + cache directory stability). Includes NIXL manager init, artifact identity construction, SHA256 marker keys, and version/GPU-arch helpers.
Loader wiring `modelexpress_client/python/modelexpress/engines/vllm/loader.py`	`load_model()` calls `install_vllm_cache_artifacts(ctx)` inside `maybe_enter_vmm_arena` and `schedule_vllm_cache_artifact_publish(ctx)` after registry updates.
Tests: PublisherThread, artifact transfer, and loader publish flow `modelexpress_client/python/tests/test_heartbeat.py`, `modelexpress_client/python/tests/test_artifact_transfer.py`, `modelexpress_client/python/tests/test_vllm_loader.py`	`TestPublisherPublishAndReady` tests first-tick publish+READY, `ready_fn` gating, and `heartbeat_after_publish=False`. Artifact transfer tests add a shared-port tensor+artifact gRPC test, `caplog` `[TIMING]` assertions, and updated `publish_artifact_source` tests using `_FakeWorkerGrpcServer`. Loader tests validate `PublisherThread` start, `publish_fn` retry behavior, P2P gRPC server ordering, and `set_mx_source_id`.
Tests: vLLM cache artifacts module `modelexpress_client/python/tests/test_vllm_artifacts.py`	New test module covers default-off install, P2P metadata disabled warning, NIXL init failure skip, artifact identity fields, distinct transfer source types and cache roots, one-time install marker, publisher scheduling with readiness gating, `_vllm_artifact_ready_fn` health+settle gating, and `_vllm_health_url` derivation.
CI/CD: artifact transfer assertions and K8s manifest `.github/actions/run-mx-p2p-test/action.yml`, `.github/workflows/modelexpress-ci-tests.yml`, `ci/k8s/client/conftest.py`, `ci/k8s/client/test_p2p_k8s.py`, `ci/k8s/client/vllm/manifest-azure.yaml`	Action adds `expected_artifact_sources` input, bash CR-counting helpers, a wait loop, and conditional pytest flags. Workflow sets `expected_artifact_sources=1` for vLLM. `conftest.py` adds `--require-artifact-transfer` / `--expected-artifact-sources` options and fixtures. `test_p2p_k8s.py` adds `_artifact_source_count` and `test_artifact_transfer`. Azure manifest enables `MX_P2P_METADATA`, `MX_ARTIFACT_TRANSFER`, and `MX_ARTIFACT_READY_URL`.
Docs and Dockerfile `docs/ARCHITECTURE.md`, `docs/DEPLOYMENT.md`, `docs/metadata.md`, `examples/p2p_transfer_k8s/client/vllm/Dockerfile`	Architecture and metadata docs replace `HeartbeatThread` with `PublisherThread`; deployment docs add artifact-transfer env vars, update `MX_WORKER_GRPC_PORT` default to `6555`, and add a P2P artifact transfer configuration section. Dockerfile bumps `vllm-openai` base from `v0.17.1` to `v0.23.0`.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 Hop, hop! A new PublisherThread sprouts,
Cache artifacts bundled without any doubts.
The worker gRPC serves many at once,
No heartbeat left dangling — no silly stunts!
vLLM caches flow peer-to-peer through the night,
This bunny ships artifacts with pure delight! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.07% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: integrate vLLM artifact transfer' directly and clearly summarizes the main change in the pull request.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/p2p_transfer_k8s/client/vllm/Dockerfile`:
- Line 4: Update the verification claim in docs/ARCHITECTURE.md that currently
documents the adopt_hidden_tensors() fix verification only on v0.17.1 and
v0.19.0. Since v0.23.0 is now established as the canonical vLLM version in the
Dockerfile, update ARCHITECTURE.md to reflect that the fix has been verified on
v0.23.0 and include the context of testing with DeepSeek-V4-Pro as mentioned in
this PR scope. Ensure the documentation reflects v0.23.0 as the current baseline
version while you may retain historical version references if needed for
context.

In `@modelexpress_client/python/modelexpress/engines/vllm/artifacts.py`:
- Around line 349-350: The code uses raw int(os.environ.get(...)) for parsing
environment variables at two locations, which will raise a ValueError if the
environment variables contain malformed values and cause load_model() to fail.
Wrap the int() conversion for both the MX_METADATA_PORT and the second
environment variable (around line 539) in a try-except block that catches
ValueError, logs a warning message indicating the invalid value and that the
default is being used, and then gracefully falls back to the default value
instead of allowing the exception to propagate and break artifact loading.
- Around line 512-525: The is_vllm_server_ready() function calls
urllib.request.urlopen() with a URL that may come from the environment variable
_READY_URL_ENV without validating the URL scheme, which could allow dangerous
protocols like file://, gopher://, etc. Add URL scheme validation in the
_vllm_health_url() function to ensure the returned URL only uses http or https
schemes, and raise an appropriate exception or return a safe default if the
scheme is invalid. This validation should occur before the URL is used in the
urlopen() call within is_vllm_server_ready().

In `@modelexpress_client/python/modelexpress/metadata/artifact_transfer.py`:
- Around line 394-396: The current guard at the artifact publish location
(checking if worker_grpc_server.port is None) only catches servers that were
never started, not ones that have been stopped. In the WorkerGrpcServer.stop()
method, the _port is not cleared when the server stops, allowing stopped servers
to still return an old port. Fix this by either clearing the _port (and
_servicer if applicable) in the WorkerGrpcServer.stop() method so the port check
correctly identifies stopped servers, or alternatively add an explicit
is_started property to WorkerGrpcServer that properly tracks the server state
and use that property in the guard instead of checking if port is None.

In `@modelexpress_client/python/modelexpress/metadata/heartbeat.py`:
- Around line 133-139: A race condition exists where stop() can return after the
thread join times out while _tick() is still running, allowing _tick() to
publish a READY status after _mark_stale() has been called. To fix this, add a
check within the _tick() method to prevent publishing or sending READY status if
_stop_event has already been set. This ensures that once stop() has been
invoked, no READY announcements will be sent even if _tick() continues executing
briefly. The same prevention logic should also be applied to any other methods
listed in the comment (around lines 206-241 and 243-257) that may publish status
updates.

In `@modelexpress_client/python/modelexpress/metadata/publish.py`:
- Around line 39-40: The cleanup callback for worker servers is removing entries
from _worker_servers unconditionally by device_id, which can cause a race
condition where a stale publisher cleanup removes a newly registered server for
the same device. Guard the cleanup operation by checking that the current value
in _worker_servers for that device_id still matches the server instance being
cleaned up before removing it. This ensures only the actual stale server is
removed and not a newer registration that arrived after this server was
initially registered.

In `@modelexpress_client/python/modelexpress/metadata/worker_server.py`:
- Around line 84-90: The unregister_artifact_source method removes the entire
source from _artifact_sources dictionary when a single artifact_id is found,
making other manifests under the same mx_source_id unreachable. Instead of
calling pop on _artifact_sources, only remove the specific artifact_id from the
source.manifests dictionary. You may optionally remove the entire source only if
no manifests remain after this deletion to maintain a clean state.
- Around line 349-378: The new methods set_mx_source_id,
register_artifact_source, and unregister_artifact_source use _servicer and _port
as liveness guards to detect if the server is running, but the stop() method is
not clearing these fields when the server shuts down. This allows a stopped
server to still pass these checks and accept operations on dead endpoints.
Locate the stop() method in this class and ensure it explicitly sets both
_servicer and _port to None when the server is stopped, so these fields properly
serve as liveness indicators for the guard checks in the new helper methods.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aef8a993-785d-41bc-bfe1-bb0b51229909

📥 Commits

Reviewing files that changed from the base of the PR and between 485dfaf and 443c985.

📒 Files selected for processing (20)

.github/actions/run-mx-p2p-test/action.yml
.github/workflows/modelexpress-ci-tests.yml
ci/k8s/client/conftest.py
ci/k8s/client/test_p2p_k8s.py
ci/k8s/client/vllm/manifest-azure.yaml
docs/ARCHITECTURE.md
docs/DEPLOYMENT.md
docs/metadata.md
examples/p2p_transfer_k8s/client/vllm/Dockerfile
modelexpress_client/python/modelexpress/__init__.py
modelexpress_client/python/modelexpress/engines/vllm/artifacts.py
modelexpress_client/python/modelexpress/engines/vllm/loader.py
modelexpress_client/python/modelexpress/metadata/artifact_transfer.py
modelexpress_client/python/modelexpress/metadata/heartbeat.py
modelexpress_client/python/modelexpress/metadata/publish.py
modelexpress_client/python/modelexpress/metadata/worker_server.py
modelexpress_client/python/tests/test_artifact_transfer.py
modelexpress_client/python/tests/test_heartbeat.py
modelexpress_client/python/tests/test_vllm_artifacts.py
modelexpress_client/python/tests/test_vllm_loader.py

Signed-off-by: Zheng Luo <zheluo@nvidia.com>

nv-hwoo · 2026-06-25T21:51:35Z

Few questions:

should be update the file name to publisher.py as well?

should we leave heartbeat thread as is and extend with publisher thread as a subclass? we then keep the simple heartbeat thread for cases not involving artifact publishing jobs.

with artifact transfer enabled, does the source become READY only after all the artifacts are done transferring?

nv-hwoo · 2026-06-25T22:54:13Z

+def _triton_cache_identity(ctx: LoadContext) -> p2p_pb2.SourceIdentity:
+    # Triton cache entries are self-keyed by source/options; share by runtime stack.
+    identity = p2p_pb2.SourceIdentity(
+        mx_source_type=p2p_pb2.MX_SOURCE_TYPE_TRITON_CACHE,
+        backend_framework=p2p_pb2.BACKEND_FRAMEWORK_VLLM,
+        cuda_version=torch.version.cuda or "",
+        triton_version=_triton_version(),
+        gpu_arch=_gpu_arch(ctx.device_id),
+    )
+    _set_extra_if_present(identity, "triton_key", _triton_key())
+    return identity
+
+
+def _deep_gemm_cache_identity(ctx: LoadContext) -> p2p_pb2.SourceIdentity:
+    # DeepGEMM cache entries are self-keyed by JIT source/compiler; share by runtime.
+    identity = p2p_pb2.SourceIdentity(
+        mx_source_type=p2p_pb2.MX_SOURCE_TYPE_DEEP_GEMM_CACHE,
+        backend_framework=p2p_pb2.BACKEND_FRAMEWORK_VLLM,
+        cuda_version=torch.version.cuda or "",
+        gpu_arch=_gpu_arch(ctx.device_id),
+    )
+    _set_extra_if_present(identity, "deep_gemm_jit_key", _deep_gemm_jit_key())
+    return identity


how does the two identity work without specifying model_name like torch.compile identity? it seems like the server rejects any identity with empty model name.

nv-hwoo · 2026-06-26T19:12:08Z

+    # vLLM JIT cache artifacts are pod-scoped, so one successful install per pod
+    # is enough for all local workers.
+    marker_path = _artifact_marker_path(transfer, identity, "install")
+    with _artifact_lock(marker_path):


On success this is intended (other TP ranks block until the cache is installed once per pod). But when a transfer fails after discovery (source went STALE or RDMA error mid-stream) the install marker is never written, the lock releases, and the next blocked rank re-runs the full transfer from scratch. With N TP ranks this serializes into up to N x 120s of blocked load_model before all ranks give up (e.g. for TP8, this translates to 8 x 120s = 16mins).

Fix suggestion: Mitigate with a pod-level negative-result marker (short TTL), a hard cap of one transfer attempt per pod, or a shorter timeout on the install path.

nv-hwoo · 2026-06-26T19:14:36Z

+        self._cleanup()
+        logger.info(f"[Worker {self._worker_rank}] Publisher thread stopped")

    def _on_exit(self) -> None:


Correctness warning: _on_exit (atexit) sets _stop_event then calls _mark_stale() -> _update_status(SOURCE_STATUS_STALE) WITHOUT joining the worker thread -- unlike stop(), which joins first (line 137). The daemon _run/_tick thread can already be past its _stop_event.is_set() checks (lines 251 and 263) and then send _update_status(SOURCE_STATUS_READY) at line 265 after the STALE, leaving the source advertised READY until the server-side reaper TTL. That directly defeats the documented guarantee 'sends UpdateStatus(STALE) for immediate detection without waiting for the reaper timeout' (docstring lines 30-32), which is exactly the P2P failover latency this project cares about. The window is narrow but real and the fix is cheap: join the thread (bounded) in _on_exit, or serialize _mark_stale against _update_status(READY) so STALE always wins.

nv-hwoo · 2026-06-26T19:17:20Z

+        marker_path.unlink(missing_ok=True)
+
+
+def _install_vllm_cache_artifact_once(


Security warning: _install_vllm_cache_artifact_once -> transfer.install(header) unpacks a tar fetched from a peer replica into the live torch_compile / triton / deep_gemm cache dirs, which the engine then loads and executes (TorchInductor generated code + .so, Triton .cubin/.ptx, DeepGEMM JIT kernels). The integrity chain (_crc32c_hex per chunk + artifact_manifest_id sha256 recomputed in _validate_fetched_artifact_manifest) only proves the received bytes match what the source advertised; it does NOT authenticate the source or attest the content. Any replica (or a compromised MX metadata server) able to publish a source with a matching SourceIdentity can therefore ship arbitrary executable code to every target that pulls it -> RCE across the deployment. It is opt-in (MX_ARTIFACT_TRANSFER=1) and identity-matched, but identity-match is not authorization. Document this trust boundary in docs/DEPLOYMENT.md and docs/ARCHITECTURE.md, require the worker gRPC + MX endpoints to be network-isolated to the trusted deployment, and consider artifact signing/attestation.

zhengluo-nv temporarily deployed to GITLAB June 23, 2026 23:06 — with GitHub Actions Inactive

pull-request-size Bot added the size/XXL label Jun 23, 2026

copy-pr-bot Bot temporarily deployed to automated-release June 23, 2026 23:06 Inactive

github-actions Bot added the feat label Jun 23, 2026

zhengluo-nv marked this pull request as ready for review June 23, 2026 23:11

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

zhengluo-nv self-assigned this Jun 24, 2026

zhengluo-nv had a problem deploying to GITLAB June 24, 2026 17:19 — with GitHub Actions Error

copy-pr-bot Bot had a problem deploying to automated-release June 24, 2026 17:19 Error

zhengluo-nv force-pushed the zheluo/artifact-transfer-pr-4-vllm-loader branch from 11a10c4 to bb0afaa Compare June 24, 2026 17:20

zhengluo-nv temporarily deployed to GITLAB June 24, 2026 17:20 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to automated-release June 24, 2026 17:20 Inactive

zhengluo-nv temporarily deployed to GITLAB June 24, 2026 19:04 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to automated-release June 24, 2026 19:04 Inactive

ganeshku1 requested a review from nv-hwoo June 24, 2026 20:17

zhengluo-nv added 4 commits June 24, 2026 15:10

feat: integrate vLLM artifact transfer

e244fc5

Signed-off-by: Zheng Luo <zheluo@nvidia.com>

fix: address vLLM artifact review comments

adfd7be

Signed-off-by: Zheng Luo <zheluo@nvidia.com>

fix: refine vLLM artifact cache identity coverage

e70059e

Signed-off-by: Zheng Luo <zheluo@nvidia.com>

fix: clarify artifact rank discovery semantics

695d385

Signed-off-by: Zheng Luo <zheluo@nvidia.com>

zhengluo-nv force-pushed the zheluo/artifact-transfer-pr-4-vllm-loader branch from cefea86 to 695d385 Compare June 24, 2026 22:10

zhengluo-nv temporarily deployed to GITLAB June 24, 2026 22:10 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to automated-release June 24, 2026 22:10 Inactive

Merge branch 'main' into zheluo/artifact-transfer-pr-4-vllm-loader

5c122d8

zhengluo-nv temporarily deployed to GITLAB June 25, 2026 21:05 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to automated-release June 25, 2026 21:05 Inactive

nv-hwoo reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: integrate vLLM artifact transfer#451

feat: integrate vLLM artifact transfer#451
zhengluo-nv wants to merge 5 commits into
mainfrom
zheluo/artifact-transfer-pr-4-vllm-loader

zhengluo-nv commented Jun 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-hwoo Jun 25, 2026

Uh oh!

nv-hwoo Jun 25, 2026

Uh oh!

nv-hwoo Jun 26, 2026

Uh oh!

nv-hwoo Jun 26, 2026

Uh oh!

nv-hwoo Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		marker_path.unlink(missing_ok=True)


		def _install_vllm_cache_artifact_once(

Uh oh!

Conversation

zhengluo-nv commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What changed

Benchmark

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-hwoo Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

nv-hwoo Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengluo-nv commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading