Environment
- OS: Windows 11
- GPU: NVIDIA RTX 2050
- Backend: ggml-Vulkan (portable driver)
- Model: Qwen3-0.6B
Symptom
Calling ctx.fork() in an inferlet causes all subsequent generation from the forked context to produce incoherent/garbage output, even with a single fork and no concurrency. Plain Context::new() without fork works correctly with the same model.
Root Cause
AuxServer::handle_command_ in driver/portable/src/aux_server.cpp (line 294) implements CopyD2D by round-tripping each KV page through a host buffer via ggml_backend_tensor_get / ggml_backend_tensor_set with a non-zero byte offset (src_off = pair.src * page_bytes). On ggml-Vulkan, partial tensor reads at non-zero offsets appear to return zeros or garbage, silently corrupting the copied KV pages. The comment in the code already acknowledges this is not universally supported across backends.
This is consistent with ctx.fork() working correctly on Metal — ggml_backend_tensor_get with offset works on Metal/CUDA but not on Vulkan.
Workaround
Avoid ctx.fork() entirely — create a fresh Context::new() per branch and replay prior turns as text. Verified working at num_branches=2 (8 concurrent leaves), 512 tokens/step. Branch: fix/tot-fork-corruption.
Suggested Fix
For the Vulkan backend, either:
- (a) Fall back to
CopyD2H + CopyH2D (round-trip through CPU swap pool) instead of the direct CopyD2D path
- (b) Fix
ggml_backend_tensor_get offset support in ggml-Vulkan upstream
Environment
Symptom
Calling
ctx.fork()in an inferlet causes all subsequent generation from the forked context to produce incoherent/garbage output, even with a single fork and no concurrency. PlainContext::new()without fork works correctly with the same model.Root Cause
AuxServer::handle_command_indriver/portable/src/aux_server.cpp(line 294) implementsCopyD2Dby round-tripping each KV page through a host buffer viaggml_backend_tensor_get/ggml_backend_tensor_setwith a non-zero byte offset (src_off = pair.src * page_bytes). On ggml-Vulkan, partial tensor reads at non-zero offsets appear to return zeros or garbage, silently corrupting the copied KV pages. The comment in the code already acknowledges this is not universally supported across backends.This is consistent with
ctx.fork()working correctly on Metal —ggml_backend_tensor_getwith offset works on Metal/CUDA but not on Vulkan.Workaround
Avoid
ctx.fork()entirely — create a freshContext::new()per branch and replay prior turns as text. Verified working atnum_branches=2(8 concurrent leaves), 512 tokens/step. Branch:fix/tot-fork-corruption.Suggested Fix
For the Vulkan backend, either:
CopyD2H+CopyH2D(round-trip through CPU swap pool) instead of the directCopyD2Dpathggml_backend_tensor_getoffset support in ggml-Vulkan upstream