test(ci): add fleet scale test for row 20 (~15 workers, CRD + P2P metadata)#458
Draft
tanushriya910 wants to merge 1 commit into
Draft
test(ci): add fleet scale test for row 20 (~15 workers, CRD + P2P metadata)#458tanushriya910 wants to merge 1 commit into
tanushriya910 wants to merge 1 commit into
Conversation
…adata) Adds the small-fleet scale test from TEST_PLAN row 20. Uses a raw Kubernetes Deployment (no Dynamo), no pod anti-affinity so workers pack onto a100a MIG slices, and NIXL over TCP+CUDA (NIXL_UCX_TLS=tcp,cuda_copy) because MIG slices GPU but not RDMA — each node has one IB NIC, so requesting rdma/ib: 1 per pod would cap schedulable replicas to the node count. Plain "tcp" without cuda_copy causes UCX to treat GPU tensors as host memory and fall back to HF disk load. Scales in waves (1->5->10->15) to localise failures. Asserts at least fleet_size CRs (>= to tolerate Deployment surge pods), all CRs Ready, >= 14 P2P transfers, and one inference request served.
a6787be to
23bad18
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the small-fleet scale test from ci/TEST_PLAN.md row 20 — a realistic scale test that runs ~15 vLLM workers sharing weights over ModelExpress P2P, all coordinated through the Kubernetes CRD metadata backend.
Approach
Raw Deployment — uses a plain Deployment + Service; the first worker to start loads from HF and becomes the source, the rest discover it via mx-server and pull via NIXL.
No pod anti-affinity — workers pack onto a100a MIG slices on shared nodes, which is what makes ~15 workers feasible on a pool with far fewer physical nodes.
NIXL over TCP (NIXL_UCX_TLS=tcp,cuda_copy) — MIG slices the GPU but not RDMA: each a100a node has a single IB NIC, so requesting rdma/ib: 1 per pod would cap schedulable replicas to the node count. TCP sidesteps that. cuda_copy is required alongside tcp so UCX stages GPU tensors through the CPU rather than treating them as host memory (without it, every non-source worker silently falls back to HF disk load).
Wave scaling (1→5→10→15) — the action scales the Deployment in waves and waits for the CR count at each step, so a failure localises to a specific wave rather than the full fleet at once.
What's included
.github/actions/run-mx-fleet-test/action.yml — composite action: deploys mx-server + fleet, waits for the source, scales in waves, runs pytest, cleans up.
ci/k8s/client/vllm/manifest-azure-fleet.yaml — the fleet Deployment + Service manifest.
ci/k8s/client/test_fleet_scale.py — asserts CR count, all CRs Ready, ≥ fleet_size - 1 P2P transfers, and end-to-end inference.
test-fleet-scale job wired into the workflow and the ci-status-check gate.
ci/TEST_PLAN.md row 20 → In CI.
Testing
Passing in CI (run).