Skip to content

nub-arch-x86: direct DataCap → PT mapping (RO + CoW + HALT auto-mint), global PT entries #855

@sorpaas

Description

@sorpaas

Background

The subvm-opt branch (4 commits on top of master, ending in 306b8640 perf(nub-arch-x86): cache per-frame PT + bufs across re-entries) cut sub_vm_recurse from ~30 µs per VM to ~10 µs/entry through:

  1. Page-aligned DataCap storage (db83db56)
  2. Per-Image compiled arena (9d2f76ca)
  3. Per-Image template PT for the arena PD subtree (9758b11f)
  4. Per-frame caching of PT + mem/ctx/stack PageBufs across re-entries (306b8640)

The original plan bundled two further pieces that were deferred because they (a) don't help sub_vm_recurse (the bench guest has zero DataCap mappings) and (b) need correctness work that benefits from its own planning pass. This issue tracks them.

Goal

Two largely orthogonal items, both relevant for real workloads (Cap services with persistent state, multi-Instance images) even though they're invisible on the spawn-recurse microbench:

A. Direct DataCap → PT mapping (RO + CoW + HALT auto-mint)

Today build_frame_from_image (rust/nub-arch-x86/src/call_loop.rs:563-581) walks an image's mappings, resolves each via the per-frame cnode to a Cap::Data, and copies the bytes into overlays: Vec<(u32, Vec<u8>)>. On entry, build_frame_runtime (rust/nub-arch-x86/src/jit_run.rs:305-322) memcpys each overlay into the per-frame mem_buf. For a service with a multi-MiB state region the per-entry cost is dominated by this memcpy even though the cap's pages already live in shared STATE_CACHE_VA memory.

Goal: map the cap's pages directly into the ring-3 PT.

  • paging::va_to_pa (rust/nub-arch-x86/src/paging.rs:90-99) currently handles kernel-high and scratch ranges only. Add a third regime for [STATE_CACHE_VA, STATE_CACHE_VA + STATE_CACHE_SIZE) mapped to [STATE_CACHE_GPA, STATE_CACHE_GPA + STATE_CACHE_SIZE).
  • Refactor KernelFrame.overlays into ro_mappings: Vec<(u32, CapHash)> (pinned slots, mapped Perm::user_ro()), rw_mappings: Vec<RwMapping> (unpinned slots, mapped RO initially with CoW), eph_mappings: Vec<(u32, u32)> (Ephemeral, kernel-allocated zero pages, mapped RW).
  • build_frame_runtime calls pt.map over each cap page with the right perms. Refcount-bump the source pages so eviction can't free them mid-call.

B. CoW #PF handler + HALT auto-mint

Extension of (A) for RW mappings: spec §2 says HALT publishes a fresh DataCap containing the modified contents. We currently never re-mint; the bench guest writes are silently dropped on HALT (the recurse bench doesn't write, so this isn't observable).

  • Extend jit_pf_handler (rust/nub-arch-x86/src/jit_run.rs:87-134): on a write fault inside a registered cow_range, allocate a fresh page, memcpy from the original, flip the PTE writable + new PA, invlpg, record the page on the frame's dirty_pages list, retry the instruction. Faults outside cow ranges fall through to the existing JIT-window handling.
  • Static-atomic plumbing to publish the active frame's cow_ranges to the handler, alongside the existing JIT_CODE_BASE etc.
  • In pop_and_reflect (rust/nub-arch-x86/src/call_loop.rs:351-367): when popping a frame with non-empty dirty_pages, construct a new Cap::Data whose content is the page-by-page merge of (original cap pages, replaced by dirty pages where present), publish via Cache::put_cap (the in-kernel publish path is already wired — see 0e559e56 feat(nub-arch-x86): in-kernel cap publish via shared CacheDirectory), update the parent's cnode slot to the new hash.

C. Global page bits

Per-Image template PT pages (rust/nub-arch-x86/src/paging.rs:191-... TemplatePT) hold leaf PTEs that are identical across every Instance of an Image. Setting the G bit on those PTEs would keep them resident across the per-call mov cr3, ... (rust/nub-arch-x86/src/ring3.rs:119) instead of being TLB-flushed.

  • Add Perm::user_ro_global() / Perm::user_rx_global() (flag::G = 1 << 8).
  • Set CR4.PGE at boot in rust/nub-arch-x86/src/main.rs.
  • Mark the template PT's leaf PTEs global (per-call CTX/MEM/STACK stay non-global).
  • Correctness gotcha: if Image A's template covers VA range V at PA X, then Image B's template (different image_hash, different arena, different PA X') gets installed at the same VA on a fresh PT, the stale global TLB entries from A still resolve to X. Need a discipline (e.g. INVLPG over the template VA range on per-call install) or per-PCID global. The ~1 µs CR3 reload savings has to justify that complexity.

Why deferred from subvm-opt

sub_vm_recurse's bench guest has zero DataCap mappings, so (A) and (B) don't move the headline number. The original perf motivation for (A) — eliminating per-call PT construction — was already absorbed by the per-frame PT cache landed in 306b8640. The ~1 µs CR3-reload win (C) is in-bench-noise.

These remain real correctness + general-perf wins for production workloads with persistent state and high call rates, but the scope (multi-commit PT refactor, new #PF path, HALT auto-mint semantics, CoW lifecycle) merits a focused effort separate from the spawn-recurse perf push.

Out of scope

  • PCID. Would let global pages avoid the cross-Image stale-entry problem cleanly, but lands in a separate optimization pass.
  • Disk-persistent CoW (snapshot dirty pages to the persistent cache before HALT publishes the re-minted DataCap). The in-memory cache rebuild via Cache::put_cap is sufficient for V1.
  • Multi-threaded host. V1's unsafe impl Sync reliance on Hyperlight's serialization holds for the current single-threaded model; multi-threaded PT sharing across instances needs the refcounted PDPT design noted in the deferred plan.

Related

  • Cache compiled JIT code across nub-kernel invocations #844 — JIT compile cache (already landed): the per-Image arena builds on top of this.
  • Branch subvm-opt (4 commits, not yet PR'd) — supersedes nothing; this issue picks up after it merges.
  • Recurse bench: cargo bench -p javm-bench --bench sub_vm_recurse — depth=10 currently ~198 µs (was ~304 µs at master).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions