feat: decoupled replay -- flow recording and independent NS3 replay#291
Open
yanzhenghao wants to merge 32 commits into
Open
feat: decoupled replay -- flow recording and independent NS3 replay#291yanzhenghao wants to merge 32 commits into
yanzhenghao wants to merge 32 commits into
Conversation
- Fix curl global init thread safety: use singleton CurlGlobalManager - Fix cross-rack detection: use global_rank_rack_map_ instead of gpus_per_server_ - Initialize WorkloadConfig members with default values - Optimize dependency tracking from O(n²) to O(n) using map lookup - Add error return values to OxcFlowOutput functions - Rename static debug counters for clarity - Add DP workload test file - Update design document with Mermaid diagrams Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove git submodules (SimCCL/aicb/ns-3-alibabacloud) — all code in one repo - Add ranks field to LogItem CSV output with participating GPU rank IDs - Add sidecar _rank_mapping.csv with full rank group decomposition - Add rank_mapper.py: CommGroup-to-RankGenerator token bridge (7 group types) - Add _fill_ranks() in WorkloadGenerator for automatic rank population - Add Domain Flow Graph + Domain MsgSize Bar visualization charts - Add per-rank CSV generation script (generate_per_rank_csv.py) - Add 15 unit tests (rank_mapper + LogItem serialization) - Fix LRA gate to support multiple concurrent in_progress features - Deep Interview spec + Ralplan consensus plan artifacts Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v3 field mapping: node_ip→node_id, a_node_ip→a_node_id, port_infos→port_id_list, port_name→port_id, server_type→chassis_topo merger.py: N:N topology (OXC→spine→leaf fan-out, server→N leaves) ns3_emitter.py: bandwidth from port_id (800GE→800Gbps), NPU from chassis_topo edg_client.py: spine-aware mock crosses and smart adjustment HomePage.tsx: frontend v3 auto-detect (server IP, bandwidth, NPU type) lld_to_topology.py: v3 visualization with IP-based cell IDs SimAI.conf: /etc→/tmp paths, +800Gbps rate map SimAI_training_workload_generator.py: fix get_model_details() model→self.model Tests: 99/99 pass, NS3 verified 8/16/32 GPU ALLREDUCE with AIOB workload Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace spine_to_leaves[spine_ip] with spine_port_to_leaf[(spine_ip, spine_port)] for exact per-port edge resolution from LLDP data. No hardcoded formulas. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lient Apply re.sub(r"\(\d+\)$", "", node_id) in _build_edge_maps, resolve_paths, _mock_baseline_crosses, and _smart_adjustment so OXC node_id "IP(0)" matches edge a_node_id "IP(0)" regardless of format. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Migrate lld.json/init_crosses.json from per-session workspace to global
EDG_DATA_ROOT/{topology_dir}/ storage. wizard-store add zustand/persist
for EDG graph data survival across page refreshes.
- server/config.py: add EDG_DATA_ROOT config
- server/edg/routes.py: _edg_global_dir() + _edg_load() with global-first,
workspace-fallback strategy. init writes to both stores.
- edg-api.ts + EdgPage.tsx: pass topologyDir to baseline-graph/register-task
- wizard-store.ts: persist EDG graph data to localStorage
- feature_list.json: F091 added, F090 marked done
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v3 lld server node_id is a name (superpod#0_server#0), not an IP. Rename across 10 files: frontend types/stores/api/pages + backend routes/merger/tests. npu_match server_ip field preserved (external EDG protocol contract). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- network-store setActiveNetwork now resets wizard store and clears ocs-sim-wizard localStorage to prevent stale graph data leakage - lld_to_topology.py now detects group_id from lld.json and generates per-group pod XMLs instead of 1 pod per input file (8 pods for 8 groups) - Fix pre-existing ntype→node_type variable reference bug in generate_pod_xml / generate_pod_xml_with_crosses Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Major changes: - lld.json: rewrite to 8NPU×8port per server topology (64 leaf, 512 edges) - ns3_emitter.py: each NPU→all leaves (not round-robin 1:1) - merger.py: fix IP string sort→numeric sort for leaf ordering - F091: EDG init global persistence (EDG_DATA_ROOT) - F092: serverIps→serverIds rename (v3 lld uses names not IPs) - network-store.ts: localStorage migration + network switch reset - routes.py: global EDG store + server_ids params - Various OXC/NS3 C++ fixes from previous sessions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ns3_emitter.py: add unfolded mode (explicit spine/OXC switch nodes) via optional lld param. Leaf→Spine + Spine→OXC replaces Leaf↔Leaf. Backward compatible: no lld = folded mode. - merger.py: fix IP string sort→numeric sort for leaf ordering - lld.json: rewrite to 8NPU×8port×8leaf per server topology - F091: EDG init global persistence (EDG_DATA_ROOT) - F092: serverIps→serverIds rename + localStorage migration Unfolded topology: 35 nodes (16NPU+16Leaf+2Spine+1OXC), 202 links. Cross-server path: NPU→Leaf→Spine→OXC→Spine→Leaf→NPU (7 hops). Single OXC avoids NS3 multi-path routing loops. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ment + WorkloadPage stale EDG cleanup - ProcessList: add View cmd / View error expandable buttons with return_code + error_message display, color-coded status badges - ns3_emitter: raise RuntimeError when lld has spine/OXC but unfolding produces zero links — no more silent fallback to folded mode - WorkloadPage: clear edgTopologyPath/BaselineGraph/AdjustedGraph/Diff on new workload generation to prevent stale EDG topology leak Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… inference, CP, Chakra, tensor graph
Complete AICB workload generator extensibility implementation (F093-F099):
F093: Model registry -- MODEL_REGISTRY dict + _bootstrap.py registration
replaces hardcoded if/elif dispatch in generate_megatron_workload.py
F094: LLaMA MockedModel -- MockedLlama.py (539 lines): GQA + SwiGLU + RMSNorm
pre-norm, reuses MegatronColumn/RowLinear for TP. Supports LLaMA
2/3/4 configs (7B through 70B, dense and MoE).
F095: Parameterized MoE routing -- --n_shared_expert moved from DeepSeek-only
to get_moe_params (all MoE models). MOEMLP shared expert computation.
F096: Qwen3 inference -- MockedQwen3Moe.py (344 lines, 8 classes) +
MockedQwen3Next.py (287 lines, 4 classes + GatedDeltaNet).
F097: Context Parallelism -- CommGroup.cp_group + ContextParallelRing (110
lines) for ring-attention isend/irecv between CP neighbors.
F098: Chakra output format -- ChakraWriter (178 lines) converts AICB LogItem
to MLCommons Chakra JSON schema (COMP_ONLY/COMM_COLL/COMM_SEND/COMM_RECV).
F099: Declarative tensor graph -- tensor_graph package (345 lines): TensorGraph
CSV load/dump, ReplicateGraph layer stacking, ConnectGraph port wiring.
SwiGLU FFN 8-line CSV template as proof-of-concept.
Also: registry.py (119 lines) + _bootstrap.py (103 lines) infrastructure,
--num_kv_heads CLI arg for GQA architectures,
test_registry.py and test_mocked_llama.py test files.
Research: research_aicb_extensibility.md (491 lines) -- STAGE paper analysis,
PARAM/Chakra comparison, 2025-2026 model parallel strategy survey.
21 files, +3662/-130
Co-Authored-By: Claude <noreply@anthropic.com>
…g, Qwen3 inference, CP, Chakra, tensor graph" This reverts commit 8604a3c.
- Fetch simulation progress via fetchProgress API for running processes - Show progress bar with percentage, layer count, and ETA - Extract and display workload filename from command line (-w argument) - Restructure layout into two-row format (status+PID+buttons / progress bar) - Add formatETA and extractWorkloadName helper functions Co-Authored-By: Claude <noreply@anthropic.com>
All three use LLaMA-compatible architecture (RMSNorm + SwiGLU + GQA + RoPE), reuse existing MegatronModel workload generator. Verified parameters from deep-research HF config.json analysis. Co-Authored-By: Claude <noreply@anthropic.com>
…letion-based timing Adds flow recording instrumentation for NS3 decoupled replay: - MockNcclGroup: flow buffer accumulation, send-time & completion-time recording, deferred finalizeFlowFile() with completion-based relative_delay_ns - AstraSimNetwork.cc: recordFlowSendTime in sim_send, explicit finalize in main - Sys.cc: finalizeFlowFile call in destructor (analytical mode safety net) - common.h: relative_delay_ns field in FlowRecord - scripts/check-common-h-consistency.sh: CI diff-check for 3 common.h copies relative_delay_ns = send_time - max(prev completion times), clamped to 0. No flow-level dependency graph -- causality fully encoded in timing. Co-Authored-By: Claude <noreply@anthropic.com>
Independent binary (scratch/decoupled_replay/): 8 files, 2,596 lines - Zero SimAI linkage (nm check target) - DepScheduler with prev[] dependency graph + layer_num constraint - flow_reader.h parses complete 21-field format - Whitelisted in scratch/.gitignore CI: scripts/check-common-h-consistency.sh Co-Authored-By: Claude <noreply@anthropic.com>
Plan and all documentation reference scratch_decoupled_replay. CMakeLists.txt had scratch_decoupled_replay_main which would break nm verification commands. Co-Authored-By: Claude <noreply@anthropic.com>
prev[] contains rank IDs (0,1,2...) but _flow_completion_times is keyed by flow ID (10423,10424...). The naive _flow_completion_times.count(pid) always failed for ring allreduce flows, causing every flow to fall back to relative_delay_ns = absolute send_time. Build a per-rank sorted map of (flow_id, completion_time) pairs, then resolve each prev rank ID to the most recent predecessor flow from that rank via lower_bound. This correctly computes relative_delay_ns as send_time - max(predecessor completion times).
Co-Authored-By: Claude <noreply@anthropic.com>
Cherry-picked reverted commits from submodule reflog: - feat: decoupled replay Phase 2 (1598bbc) - fix: GPUType enum, fct_writer format string (f0e19bb) - refactor: inline SendFlow, remove _QPS_PER_CONNECTION_ (6843092) - fix: sequential step numbering in main.cc (64a3613)
prev[] contains rank IDs (0,1,2...) but _flow_completion_times is keyed by flow ID (10423,10424...). The naive _flow_completion_times.count(pid) always failed for ring allreduce flows, causing every flow to fall back to relative_delay_ns = absolute send_time. Build a per-rank sorted map of (flow_id, completion_time) pairs, then resolve each prev rank ID to the most recent predecessor flow from that rank via lower_bound. This correctly computes relative_delay_ns as send_time - max(predecessor completion times).
Co-Authored-By: Claude <noreply@anthropic.com>
|
Anthony seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements decoupled replay: SimAI captures flow metadata and timing during coupled simulation, then an independent NS3 binary replays from the flow file without linking any SimAI code.
SimAI side
Independent binary
8 files under ns-3-alibabacloud/simulation/scratch/decoupled_replay/. SetConfig() is called explicitly after ReadConf() because the independent binary has no SimAI framework to apply NS3 defaults (QCN, PFC thresholds, CC mode).
Scheduling: layer constraint (hard gate) + relative_delay_ns (soft gate). No flow-level dependency graph. Causality fully encoded in completion-based timing from Phase 1.
Co-Authored-By: Claude noreply@anthropic.com