Skip to content

test(ci): add fleet scale test for row 20 (~15 workers, CRD + P2P metadata)#458

Draft
tanushriya910 wants to merge 1 commit into
mainfrom
tanushriyas/fleet-scale-ci-test
Draft

test(ci): add fleet scale test for row 20 (~15 workers, CRD + P2P metadata)#458
tanushriya910 wants to merge 1 commit into
mainfrom
tanushriyas/fleet-scale-ci-test

Conversation

@tanushriya910

@tanushriya910 tanushriya910 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the small-fleet scale test from ci/TEST_PLAN.md row 20 — a realistic scale test that runs ~15 vLLM workers sharing weights over ModelExpress P2P, all coordinated through the Kubernetes CRD metadata backend.

Approach

Raw Deployment — uses a plain Deployment + Service; the first worker to start loads from HF and becomes the source, the rest discover it via mx-server and pull via NIXL.
No pod anti-affinity — workers pack onto a100a MIG slices on shared nodes, which is what makes ~15 workers feasible on a pool with far fewer physical nodes.
NIXL over TCP (NIXL_UCX_TLS=tcp,cuda_copy) — MIG slices the GPU but not RDMA: each a100a node has a single IB NIC, so requesting rdma/ib: 1 per pod would cap schedulable replicas to the node count. TCP sidesteps that. cuda_copy is required alongside tcp so UCX stages GPU tensors through the CPU rather than treating them as host memory (without it, every non-source worker silently falls back to HF disk load).
Wave scaling (1→5→10→15) — the action scales the Deployment in waves and waits for the CR count at each step, so a failure localises to a specific wave rather than the full fleet at once.

What's included

.github/actions/run-mx-fleet-test/action.yml — composite action: deploys mx-server + fleet, waits for the source, scales in waves, runs pytest, cleans up.
ci/k8s/client/vllm/manifest-azure-fleet.yaml — the fleet Deployment + Service manifest.
ci/k8s/client/test_fleet_scale.py — asserts CR count, all CRs Ready, ≥ fleet_size - 1 P2P transfers, and end-to-end inference.
test-fleet-scale job wired into the workflow and the ci-status-check gate.
ci/TEST_PLAN.md row 20 → In CI.

Testing

Passing in CI (run).

…adata)

Adds the small-fleet scale test from TEST_PLAN row 20. Uses a raw
Kubernetes Deployment (no Dynamo), no pod anti-affinity so workers pack
onto a100a MIG slices, and NIXL over TCP+CUDA (NIXL_UCX_TLS=tcp,cuda_copy)
because MIG slices GPU but not RDMA — each node has one IB NIC, so
requesting rdma/ib: 1 per pod would cap schedulable replicas to the node
count. Plain "tcp" without cuda_copy causes UCX to treat GPU tensors as
host memory and fall back to HF disk load.

Scales in waves (1->5->10->15) to localise failures. Asserts at least
fleet_size CRs (>= to tolerate Deployment surge pods), all CRs Ready,
>= 14 P2P transfers, and one inference request served.
@tanushriya910 tanushriya910 force-pushed the tanushriyas/fleet-scale-ci-test branch from a6787be to 23bad18 Compare June 27, 2026 01:38
@copy-pr-bot copy-pr-bot Bot deployed to automated-release June 27, 2026 01:38 Active
@copy-pr-bot copy-pr-bot Bot temporarily deployed to automated-release June 27, 2026 01:38 Inactive
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant