You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The subvm-opt branch (4 commits on top of master, ending in 306b8640 perf(nub-arch-x86): cache per-frame PT + bufs across re-entries) cut sub_vm_recurse from ~30 µs per VM to ~10 µs/entry through:
Page-aligned DataCap storage (db83db56)
Per-Image compiled arena (9d2f76ca)
Per-Image template PT for the arena PD subtree (9758b11f)
Per-frame caching of PT + mem/ctx/stack PageBufs across re-entries (306b8640)
The original plan bundled two further pieces that were deferred because they (a) don't help sub_vm_recurse (the bench guest has zero DataCap mappings) and (b) need correctness work that benefits from its own planning pass. This issue tracks them.
Goal
Two largely orthogonal items, both relevant for real workloads (Cap services with persistent state, multi-Instance images) even though they're invisible on the spawn-recurse microbench:
A. Direct DataCap → PT mapping (RO + CoW + HALT auto-mint)
Today build_frame_from_image (rust/nub-arch-x86/src/call_loop.rs:563-581) walks an image's mappings, resolves each via the per-frame cnode to a Cap::Data, and copies the bytes into overlays: Vec<(u32, Vec<u8>)>. On entry, build_frame_runtime (rust/nub-arch-x86/src/jit_run.rs:305-322) memcpys each overlay into the per-frame mem_buf. For a service with a multi-MiB state region the per-entry cost is dominated by this memcpy even though the cap's pages already live in shared STATE_CACHE_VA memory.
Goal: map the cap's pages directly into the ring-3 PT.
paging::va_to_pa (rust/nub-arch-x86/src/paging.rs:90-99) currently handles kernel-high and scratch ranges only. Add a third regime for [STATE_CACHE_VA, STATE_CACHE_VA + STATE_CACHE_SIZE) mapped to [STATE_CACHE_GPA, STATE_CACHE_GPA + STATE_CACHE_SIZE).
Refactor KernelFrame.overlays into ro_mappings: Vec<(u32, CapHash)> (pinned slots, mapped Perm::user_ro()), rw_mappings: Vec<RwMapping> (unpinned slots, mapped RO initially with CoW), eph_mappings: Vec<(u32, u32)> (Ephemeral, kernel-allocated zero pages, mapped RW).
build_frame_runtime calls pt.map over each cap page with the right perms. Refcount-bump the source pages so eviction can't free them mid-call.
B. CoW #PF handler + HALT auto-mint
Extension of (A) for RW mappings: spec §2 says HALT publishes a fresh DataCap containing the modified contents. We currently never re-mint; the bench guest writes are silently dropped on HALT (the recurse bench doesn't write, so this isn't observable).
Extend jit_pf_handler (rust/nub-arch-x86/src/jit_run.rs:87-134): on a write fault inside a registered cow_range, allocate a fresh page, memcpy from the original, flip the PTE writable + new PA, invlpg, record the page on the frame's dirty_pages list, retry the instruction. Faults outside cow ranges fall through to the existing JIT-window handling.
Static-atomic plumbing to publish the active frame's cow_ranges to the handler, alongside the existing JIT_CODE_BASE etc.
In pop_and_reflect (rust/nub-arch-x86/src/call_loop.rs:351-367): when popping a frame with non-empty dirty_pages, construct a new Cap::Data whose content is the page-by-page merge of (original cap pages, replaced by dirty pages where present), publish via Cache::put_cap (the in-kernel publish path is already wired — see 0e559e56 feat(nub-arch-x86): in-kernel cap publish via shared CacheDirectory), update the parent's cnode slot to the new hash.
C. Global page bits
Per-Image template PT pages (rust/nub-arch-x86/src/paging.rs:191-...TemplatePT) hold leaf PTEs that are identical across every Instance of an Image. Setting the G bit on those PTEs would keep them resident across the per-call mov cr3, ... (rust/nub-arch-x86/src/ring3.rs:119) instead of being TLB-flushed.
Mark the template PT's leaf PTEs global (per-call CTX/MEM/STACK stay non-global).
Correctness gotcha: if Image A's template covers VA range V at PA X, then Image B's template (different image_hash, different arena, different PA X') gets installed at the same VA on a fresh PT, the stale global TLB entries from A still resolve to X. Need a discipline (e.g. INVLPG over the template VA range on per-call install) or per-PCID global. The ~1 µs CR3 reload savings has to justify that complexity.
Why deferred from subvm-opt
sub_vm_recurse's bench guest has zero DataCap mappings, so (A) and (B) don't move the headline number. The original perf motivation for (A) — eliminating per-call PT construction — was already absorbed by the per-frame PT cache landed in 306b8640. The ~1 µs CR3-reload win (C) is in-bench-noise.
These remain real correctness + general-perf wins for production workloads with persistent state and high call rates, but the scope (multi-commit PT refactor, new #PF path, HALT auto-mint semantics, CoW lifecycle) merits a focused effort separate from the spawn-recurse perf push.
Out of scope
PCID. Would let global pages avoid the cross-Image stale-entry problem cleanly, but lands in a separate optimization pass.
Disk-persistent CoW (snapshot dirty pages to the persistent cache before HALT publishes the re-minted DataCap). The in-memory cache rebuild via Cache::put_cap is sufficient for V1.
Multi-threaded host. V1's unsafe impl Sync reliance on Hyperlight's serialization holds for the current single-threaded model; multi-threaded PT sharing across instances needs the refcounted PDPT design noted in the deferred plan.
Background
The
subvm-optbranch (4 commits on top of master, ending in306b8640 perf(nub-arch-x86): cache per-frame PT + bufs across re-entries) cutsub_vm_recursefrom ~30 µs per VM to ~10 µs/entry through:DataCapstorage (db83db56)9d2f76ca)9758b11f)PageBufs across re-entries (306b8640)The original plan bundled two further pieces that were deferred because they (a) don't help
sub_vm_recurse(the bench guest has zero DataCap mappings) and (b) need correctness work that benefits from its own planning pass. This issue tracks them.Goal
Two largely orthogonal items, both relevant for real workloads (Cap services with persistent state, multi-Instance images) even though they're invisible on the spawn-recurse microbench:
A. Direct
DataCap→ PT mapping (RO + CoW + HALT auto-mint)Today
build_frame_from_image(rust/nub-arch-x86/src/call_loop.rs:563-581) walks an image'smappings, resolves each via the per-frame cnode to aCap::Data, and copies the bytes intooverlays: Vec<(u32, Vec<u8>)>. On entry,build_frame_runtime(rust/nub-arch-x86/src/jit_run.rs:305-322) memcpys each overlay into the per-framemem_buf. For a service with a multi-MiB state region the per-entry cost is dominated by this memcpy even though the cap's pages already live in sharedSTATE_CACHE_VAmemory.Goal: map the cap's pages directly into the ring-3 PT.
paging::va_to_pa(rust/nub-arch-x86/src/paging.rs:90-99) currently handles kernel-high and scratch ranges only. Add a third regime for[STATE_CACHE_VA, STATE_CACHE_VA + STATE_CACHE_SIZE)mapped to[STATE_CACHE_GPA, STATE_CACHE_GPA + STATE_CACHE_SIZE).KernelFrame.overlaysintoro_mappings: Vec<(u32, CapHash)>(pinned slots, mappedPerm::user_ro()),rw_mappings: Vec<RwMapping>(unpinned slots, mapped RO initially with CoW),eph_mappings: Vec<(u32, u32)>(Ephemeral, kernel-allocated zero pages, mapped RW).build_frame_runtimecallspt.mapover each cap page with the right perms. Refcount-bump the source pages so eviction can't free them mid-call.B. CoW #PF handler + HALT auto-mint
Extension of (A) for RW mappings: spec §2 says HALT publishes a fresh
DataCapcontaining the modified contents. We currently never re-mint; the bench guest writes are silently dropped on HALT (the recurse bench doesn't write, so this isn't observable).jit_pf_handler(rust/nub-arch-x86/src/jit_run.rs:87-134): on a write fault inside a registeredcow_range, allocate a fresh page, memcpy from the original, flip the PTE writable + new PA,invlpg, record the page on the frame'sdirty_pageslist, retry the instruction. Faults outside cow ranges fall through to the existing JIT-window handling.cow_rangesto the handler, alongside the existingJIT_CODE_BASEetc.pop_and_reflect(rust/nub-arch-x86/src/call_loop.rs:351-367): when popping a frame with non-emptydirty_pages, construct a newCap::Datawhose content is the page-by-page merge of (original cap pages, replaced by dirty pages where present), publish viaCache::put_cap(the in-kernel publish path is already wired — see0e559e56 feat(nub-arch-x86): in-kernel cap publish via shared CacheDirectory), update the parent's cnode slot to the new hash.C. Global page bits
Per-Image template PT pages (
rust/nub-arch-x86/src/paging.rs:191-...TemplatePT) hold leaf PTEs that are identical across every Instance of an Image. Setting the G bit on those PTEs would keep them resident across the per-callmov cr3, ...(rust/nub-arch-x86/src/ring3.rs:119) instead of being TLB-flushed.Perm::user_ro_global()/Perm::user_rx_global()(flag::G = 1 << 8).rust/nub-arch-x86/src/main.rs.image_hash, different arena, different PA X') gets installed at the same VA on a fresh PT, the stale global TLB entries from A still resolve to X. Need a discipline (e.g. INVLPG over the template VA range on per-call install) or per-PCID global. The ~1 µs CR3 reload savings has to justify that complexity.Why deferred from
subvm-optsub_vm_recurse's bench guest has zero DataCap mappings, so (A) and (B) don't move the headline number. The original perf motivation for (A) — eliminating per-call PT construction — was already absorbed by the per-frame PT cache landed in306b8640. The ~1 µs CR3-reload win (C) is in-bench-noise.These remain real correctness + general-perf wins for production workloads with persistent state and high call rates, but the scope (multi-commit PT refactor, new #PF path, HALT auto-mint semantics, CoW lifecycle) merits a focused effort separate from the spawn-recurse perf push.
Out of scope
Cache::put_capis sufficient for V1.unsafe impl Syncreliance on Hyperlight's serialization holds for the current single-threaded model; multi-threaded PT sharing across instances needs the refcounted PDPT design noted in the deferred plan.Related
subvm-opt(4 commits, not yet PR'd) — supersedes nothing; this issue picks up after it merges.cargo bench -p javm-bench --bench sub_vm_recurse— depth=10 currently ~198 µs (was ~304 µs at master).