nub-arch-x86: direct DataCap → PT mapping (RO + CoW + HALT auto-mint), global PT entries

## Background

The `subvm-opt` branch (4 commits on top of master, ending in `306b8640 perf(nub-arch-x86): cache per-frame PT + bufs across re-entries`) cut `sub_vm_recurse` from ~30 µs per VM to ~10 µs/entry through:

1. Page-aligned `DataCap` storage (`db83db56`)
2. Per-Image compiled arena (`9d2f76ca`)
3. Per-Image template PT for the arena PD subtree (`9758b11f`)
4. Per-frame caching of PT + mem/ctx/stack `PageBuf`s across re-entries (`306b8640`)

The original [plan](https://github.com/jarchain/jar/issues/new) bundled two further pieces that were deferred because they (a) don't help `sub_vm_recurse` (the bench guest has zero DataCap mappings) and (b) need correctness work that benefits from its own planning pass. This issue tracks them.

## Goal

Two largely orthogonal items, both relevant for real workloads (Cap services with persistent state, multi-Instance images) even though they're invisible on the spawn-recurse microbench:

### A. Direct `DataCap` → PT mapping (RO + CoW + HALT auto-mint)

Today `build_frame_from_image` ([`rust/nub-arch-x86/src/call_loop.rs:563-581`](rust/nub-arch-x86/src/call_loop.rs#L563-L581)) walks an image's `mappings`, resolves each via the per-frame cnode to a `Cap::Data`, and copies the bytes into `overlays: Vec<(u32, Vec<u8>)>`. On entry, `build_frame_runtime` ([`rust/nub-arch-x86/src/jit_run.rs:305-322`](rust/nub-arch-x86/src/jit_run.rs#L305-L322)) memcpys each overlay into the per-frame `mem_buf`. For a service with a multi-MiB state region the per-entry cost is dominated by this memcpy even though the cap's pages already live in shared `STATE_CACHE_VA` memory.

Goal: map the cap's pages **directly** into the ring-3 PT.

- `paging::va_to_pa` ([`rust/nub-arch-x86/src/paging.rs:90-99`](rust/nub-arch-x86/src/paging.rs#L90-L99)) currently handles kernel-high and scratch ranges only. Add a third regime for `[STATE_CACHE_VA, STATE_CACHE_VA + STATE_CACHE_SIZE)` mapped to `[STATE_CACHE_GPA, STATE_CACHE_GPA + STATE_CACHE_SIZE)`.
- Refactor `KernelFrame.overlays` into `ro_mappings: Vec<(u32, CapHash)>` (pinned slots, mapped `Perm::user_ro()`), `rw_mappings: Vec<RwMapping>` (unpinned slots, mapped RO initially with CoW), `eph_mappings: Vec<(u32, u32)>` (Ephemeral, kernel-allocated zero pages, mapped RW).
- `build_frame_runtime` calls `pt.map` over each cap page with the right perms. Refcount-bump the source pages so eviction can't free them mid-call.

### B. CoW #PF handler + HALT auto-mint

Extension of (A) for RW mappings: spec §2 says HALT publishes a fresh `DataCap` containing the modified contents. We currently never re-mint; the bench guest writes are silently dropped on HALT (the recurse bench doesn't write, so this isn't observable).

- Extend `jit_pf_handler` ([`rust/nub-arch-x86/src/jit_run.rs:87-134`](rust/nub-arch-x86/src/jit_run.rs#L87-L134)): on a write fault inside a registered `cow_range`, allocate a fresh page, memcpy from the original, flip the PTE writable + new PA, `invlpg`, record the page on the frame's `dirty_pages` list, retry the instruction. Faults outside cow ranges fall through to the existing JIT-window handling.
- Static-atomic plumbing to publish the active frame's `cow_ranges` to the handler, alongside the existing `JIT_CODE_BASE` etc.
- In `pop_and_reflect` ([`rust/nub-arch-x86/src/call_loop.rs:351-367`](rust/nub-arch-x86/src/call_loop.rs#L351-L367)): when popping a frame with non-empty `dirty_pages`, construct a new `Cap::Data` whose content is the page-by-page merge of (original cap pages, replaced by dirty pages where present), publish via `Cache::put_cap` (the in-kernel publish path is already wired — see `0e559e56 feat(nub-arch-x86): in-kernel cap publish via shared CacheDirectory`), update the parent's cnode slot to the new hash.

### C. Global page bits

Per-Image template PT pages ([`rust/nub-arch-x86/src/paging.rs:191-...`](rust/nub-arch-x86/src/paging.rs#L191) `TemplatePT`) hold leaf PTEs that are identical across every Instance of an Image. Setting the G bit on those PTEs would keep them resident across the per-call `mov cr3, ...` ([`rust/nub-arch-x86/src/ring3.rs:119`](rust/nub-arch-x86/src/ring3.rs#L119)) instead of being TLB-flushed.

- Add `Perm::user_ro_global()` / `Perm::user_rx_global()` (`flag::G = 1 << 8`).
- Set CR4.PGE at boot in [`rust/nub-arch-x86/src/main.rs`](rust/nub-arch-x86/src/main.rs).
- Mark the template PT's leaf PTEs global (per-call CTX/MEM/STACK stay non-global).
- **Correctness gotcha**: if Image A's template covers VA range V at PA X, then Image B's template (different `image_hash`, different arena, different PA X') gets installed at the same VA on a fresh PT, the stale global TLB entries from A still resolve to X. Need a discipline (e.g. INVLPG over the template VA range on per-call install) or per-PCID global. The ~1 µs CR3 reload savings has to justify that complexity.

## Why deferred from `subvm-opt`

`sub_vm_recurse`'s bench guest has zero DataCap mappings, so (A) and (B) don't move the headline number. The original perf motivation for (A) — eliminating per-call PT construction — was already absorbed by the per-frame PT cache landed in `306b8640`. The ~1 µs CR3-reload win (C) is in-bench-noise.

These remain real correctness + general-perf wins for production workloads with persistent state and high call rates, but the scope (multi-commit PT refactor, new #PF path, HALT auto-mint semantics, CoW lifecycle) merits a focused effort separate from the spawn-recurse perf push.

## Out of scope

- PCID. Would let global pages avoid the cross-Image stale-entry problem cleanly, but lands in a separate optimization pass.
- Disk-persistent CoW (snapshot dirty pages to the persistent cache before HALT publishes the re-minted DataCap). The in-memory cache rebuild via `Cache::put_cap` is sufficient for V1.
- Multi-threaded host. V1's `unsafe impl Sync` reliance on Hyperlight's serialization holds for the current single-threaded model; multi-threaded PT sharing across instances needs the refcounted PDPT design noted in the deferred plan.

## Related

- #844 — JIT compile cache (already landed): the per-Image arena builds on top of this.
- Branch `subvm-opt` (4 commits, not yet PR'd) — supersedes nothing; this issue picks up after it merges.
- Recurse bench: `cargo bench -p javm-bench --bench sub_vm_recurse` — depth=10 currently ~198 µs (was ~304 µs at master).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nub-arch-x86: direct DataCap → PT mapping (RO + CoW + HALT auto-mint), global PT entries #855

Background

Goal

A. Direct `DataCap` → PT mapping (RO + CoW + HALT auto-mint)

B. CoW #PF handler + HALT auto-mint

C. Global page bits

Why deferred from `subvm-opt`

Out of scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

nub-arch-x86: direct DataCap → PT mapping (RO + CoW + HALT auto-mint), global PT entries #855

Description

Background

Goal

A. Direct DataCap → PT mapping (RO + CoW + HALT auto-mint)

B. CoW #PF handler + HALT auto-mint

C. Global page bits

Why deferred from subvm-opt

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

A. Direct `DataCap` → PT mapping (RO + CoW + HALT auto-mint)

Why deferred from `subvm-opt`