diff --git a/designs/outerCube/Davinci_BCC_ScalarPipeline_v1.md b/designs/outerCube/Davinci_BCC_ScalarPipeline_v1.md
new file mode 100644
index 00000000..75538b9c
--- /dev/null
+++ b/designs/outerCube/Davinci_BCC_ScalarPipeline_v1.md
@@ -0,0 +1,934 @@
+# Davinci-v2 Scalar Pipeline: LinxCore BCC-Style Redesign
+
+> **Document ID**: DSP-001  
+> **Version**: v1.0  
+> **Date**: 2026-05-01  
+> **Status**: Proposal  
+> **Target**: `pyCircuit/designs/outerCube/Davinci_superscalar_v2.md`  
+> **Change Point**: #1 — Adopt LinxCore BCC's scalar frontend/rename/issue/execute pipeline; replace RAT checkpoints with MapQ; use `atag`/`ptag` naming throughout
+
+---
+
+## 1. Motivation
+
+### 1.1 Current Davinci-v2 Issue Model
+
+Davinci-v2 currently uses a **centralized Reservation Station (RS)** model:
+
+| RS | Depth | Issue Width | Functional Units |
+|-----|-------|------------|-----------------|
+| Scalar RS | 32 entries | 6 (4 ALU + 1 MUL + 1 BRU) | 4× ALU, 1× MUL, 1× BRU |
+| LSU RS | 24 entries | 2 (1 load + 1 store) | 1× Load + 1× Store |
+| Vector RS | 24 entries | 1 | VEC-4K-v2 |
+| Cube RS | 4 entries | 1 | outerCube MXU |
+| MTE RS | 16 entries | 2 | MTE Engine |
+
+**Wakeup mechanism**: 6-port CDB broadcast. Each RS entry compares `psrc1`/`psrc2` against CDB tags. Total comparators: `32 × 2 × 6 = 384` (for scalar RS alone).
+
+**Select logic**: Each functional unit selects the oldest-ready entry based on an intra-RS `age[5:0]` field.
+
+**Key limitations**:
+
+1. **Single-cycle rename + dispatch**: The current Scalar RS dispatch is 4-wide, limiting dispatch throughput in tight scalar loops.
+2. **No cross-IQ age coordination**: Different RSs have no shared understanding of which instruction is oldest across the machine.
+3. **CDB port contention**: 6 CDB ports broadcasting to 100 RS entries creates a wide, high-capacitance comparator bus — this becomes a critical path for wider-issue machines.
+4. **No explicit P1/I1/I2 separation**: The wakeup-to-pick cycle-level constraint is undocumented. Ready logic and pick logic are fused into a single combinational block.
+5. **RAT checkpoints are insufficient for wide OoO**: The 8-slot RAT checkpoint store snapshots SMAP but not the Ready Table, leaving a gap in the recovery model when speculation goes wide.
+
+### 1.2 Why LinxCore BCC's Model Is Better
+
+LinxCore BCC (Block-Ordered Superscalar Core) establishes the canonical pattern for wide-issue OoO with a clean separation of concerns:
+
+| Dimension | Davinci-v2 (current) | LinxCore BCC (target) |
+|-----------|---------------------|----------------------|
+| IQ topology | One centralized RS per unit | **Multiple physical IQs**, type-separated |
+| Age encoding | Intra-RS `age[5:0]` field | **ROB-slot-index-based sub-head age**, wrap-friendly |
+| Pick strategy | Oldest-ready within one RS | **Cross-IQ oldest-ready**, cascaded multi-slot allocation |
+| Wakeup | CDB direct broadcast to RS entries | **Ready Table (、集中 ptag bitmap) + per-IQ can_issue** |
+| Rename | D2 single cycle | **D1/D2/D3 three-stage pipeline** |
+| Recovery | RAT checkpoints (SMAP only) | **MapQ + CMAP + SMAP** (3-table model with instruction-precise cut points) |
+| Commit | In-order, no ROB | ROB-based precise retire (LinxCore; Davinci-v2 BCC retains no-ROB) |
+| Flush | Global CAM-clear | **BID-based flush propagation** |
+
+**Key insight**: LinxCore BCC's design scales to wider issue machines because:
+- The Ready Table reduces the wakeup fanout from `O(iq_depth × issue_w × pregs)` to `O(pregs)` bitmap reads.
+- Age comparison uses the ROB slot index (already in program order) rather than a separate age field.
+- The issue picker is pure combinational logic with cascaded priority encoding — critical path grows as `O(depth × width)` in logic levels, not wire-bound comparators.
+
+---
+
+## 2. Design Goals
+
+1. **Preserve Davinci-v2's key characteristics**:
+   - 32 architectural GPRs (X0–X31), 128 physical GPRs (P0–P127)
+   - 12-stage scalar pipeline (Fetch-to-WB)
+   - Multi-latency functional units (MUL 4-cycle, DIV 12–20-cycle, LD 4-cycle L1 hit)
+   - Mixed-domain scheduling (Scalar/Vector/Cube/MTE share frontend)
+   - RAT checkpoints replaced by MapQ with instruction-precise recovery
+   - No precise exceptions (AI kernel envelope)
+
+2. **Adopt LinxCore BCC's issue mechanism**:
+   - Multi physical IQs (`alu_iq`, `bru_iq`, `lsu_iq`)
+   - D1/D2/D3 rename pipeline (decode → rename request → rename complete)
+   - Ready Table + age-matrix issue picker
+   - P1(pick) / I1(RF read arbitration) / I2(confirm issue) stages
+   - BID-based flush propagation
+
+3. **Adopt LinxCore's naming conventions**:
+   - GPR architectural register → **atag** (architectural tag)
+   - Physical register → **ptag** (physical tag)
+   - RAT checkpoints → **MapQ** (speculative rename increment log)
+   - Committed rename map → **CMAP**
+   - Speculative rename map → **SMAP**
+
+4. **Design constraints**:
+   - Fetch/Decode/Dispatch width = 4 (unchanged)
+   - Commit width = 4 (unchanged)
+   - Issue width extensible to 8-wide (currently 6-wide)
+   - No precise exceptions (maintain AI kernel envelope)
+
+---
+
+## 3. Pipeline Structure: Davinci-v2 BCC
+
+### 3.1 Complete Stage List
+
+```
+F0 → F1 → F2 → F3 → IB → F4 → D1 → D2 → D3 → S1 → S2 → P1 → I1 → I2 → E1 → … → EX_n → W1
+                                    └── Rename ──┘           └── Issue ──┘        └── Execute ──┘
+```
+
+| Stage | Name | Function |
+|--------|------|---------|
+| F0 | Fetch PC Select | PC mux (redirect/sequential/flush) |
+| F1 | I-Cache Lookup + BTB | Tag+BTB lookup, 4-way set-assoc |
+| F2 | I-Cache Response + Predict | Cache data return, TAGE/BTB prediction |
+| F3 | Stitch + BSTART Annotation | Cross-line stitch, BSTART boundary marking |
+| IB | Instruction Buffer | Depth-8 fetch/decode synchronization buffer |
+| F4 | Decode Handoff Register | D1 input register |
+| **D1** | **Decode + RID/atag Allocation** | Decode 4 instr, allocate RID, atag lookup |
+| **D2** | **Rename Request** | Read SMAP, resolve P/T/U source ptag, allocate ptag, build MapQ entry |
+| **D3** | **Rename Complete + IQ/ROB Dispatch Prep** | Write SMAP, IQ/ROB tail allocation, dispatch routing |
+| **S1** | **Dispatch Preparation** | IQ routing, resource checks (free list, MapQ space) |
+| **S2** | **Dispatch Execute** | IQ entry write, free list update, MapQ commit |
+| **P1** | **Issue Pick** | Ready Table query, age-matrix pick, select oldest-ready entries |
+| **I1** | **Operand Read Planning** | Global RF read-port arbitration |
+| **I2** | **Issue Confirm** | IQ entry deallocation, RF read-port occupancy confirm |
+| E1–EX_n | Execute | Functional unit execution (variable latency) |
+| W1 | Writeback | CDB broadcast, Ready Table update, wakeup |
+
+**Total pipeline depth**: Fetch-to-WB = 17+ cycles (approximately 5 cycles longer than current due to stage separation).
+
+### 3.2 Differences from LinxCore BCC (Davinci-v2-Specific Adaptations)
+
+| LinxCore BCC | Davinci-v2 BCC |
+|--------------|----------------|
+| 16 checkpoints, MapQ-based | **12 MapQ entries**, instruction-precise recovery |
+| Full block structure (BSTART/BSTOP + BID + BROB) | **Deferred** — block model not introduced in this change point |
+| ROB (64 entries) for in-order retire | **No ROB** — refcount + MapQ for speculative recovery |
+| CMAP updated on ROB retire | **No CMAP update** — ptag freed when orphan + refcount=0 |
+| BSTART/BSTOP are architectural markers | **Not introduced** — flat control flow |
+| 8 physical IQs (alu_iq0, shared_iq1, bru_iq, agu_iq0, agu_iq1, std_iq0, std_iq1, cmd_iq) | **3 physical IQs**: scalar_alu_iq, scalar_bru_iq, lsu_iq |
+| T/U separate FIFO rename tracked via CMAP on BSTOP | **Tile RAT** (32→256) remains independent domain |
+| Flush uses `flush_bid` (64-bit) | **Flush uses `flush_tag` (3-bit branch tag)** — no BID yet |
+
+---
+
+## 4. Rename Register Model
+
+### 4.1 The Three-Table Model (CMAP / SMAP / MapQ)
+
+The rename model follows LinxCore BCC exactly, adapted to Davinci-v2's规模和 no-ROB constraint:
+
+#### CMAP (Committed Map)
+
+- **Width**: 32 × `ptag_w` entries. For 128 physical registers, `ptag_w = 7`.
+- **Content**: `CMAP[atag] → ptag` for the architecturally committed mapping.
+- **Update**: When a ptag becomes **orphan** (no longer the current mapping for any atag) **and** its `refcount` reaches zero, the ptag is returned to the free list. CMAP itself is **not updated on commit** — unlike LinxCore BCC's CMAP (which updates on ROB retire), Davinci-v2 BCC has no retire event. Instead, SMAP is updated at rename time, and CMAP serves as the **flush restoration target**.
+
+```
+CMAP[atag] = committed ptag for atag
+              updated when: atag's old ptag becomes orphan + refcount=0
+              flushed-from on mispredict: SMAP ← CMAP
+```
+
+#### SMAP (Speculative Map)
+
+- **Width**: 32 × `ptag_w` entries.
+- **Content**: `SMAP[atag] → ptag` for the rename-time (speculative) mapping.
+- **Update**: Updated speculatively on each D2 rename group, in program order across the 4 dispatch slots.
+- **Flush**: On mispredict, `SMAP ← CMAP` (flash-restore in one cycle). All speculative mappings are discarded.
+- **On start marker** (future: BSTART; currently: N/A): SMAP entries 24–31 (T tile regs) are cleared.
+
+```
+# D2 rename step (per slot, in program order):
+if dst.kind == P:           # GPR destination
+    old_ptag = SMAP[dst.atag]   # will become orphan
+    SMAP[dst.atag] = new_ptag   # rename: atag now maps to new ptag
+    refcount[old_ptag] -= 1       # decremented; freed if orphan + refcount=0
+    refcount[new_ptag] += 1
+    MapQ.push(MapQEntry { atag=dst.atag, old_ptag, new_ptag, rid=slot.rid })
+elif dst.kind == T:          # Tile destination
+    # Tile RAT handles independently (separate domain)
+```
+
+#### MapQ (Speculative Rename Increment Log)
+
+MapQ is the **replacement for RAT checkpoints**. It records the incremental changes made to SMAP during speculative execution, enabling instruction-precise recovery without a ROB.
+
+**Key insight**: Unlike LinxCore BCC's MapQ (which uses CMAP for commit-side updates), Davinci-v2 BCC uses MapQ purely for **speculative tracking**. Commit-side freeing is handled by the existing refcount mechanism.
+
+```
+MapQ Entry:
+{
+    valid: 1,
+    atag: 6,        # which architectural register
+    old_ptag: 7,    # what it mapped to before this rename
+    new_ptag: 7,    # what it maps to now (speculative)
+    rid: 6,         # rename ID (program order, for cut-point)
+    is_push_t: 1,   # T-stack push (clears downstream T entries)
+    is_push_u: 1,   # U-stack push (clears downstream U entries)
+}
+
+# MapQ behavior:
+# - D2: push MapQ entry for each P-dst rename
+# - Flush: replay MapQ in reverse order to restore SMAP to CMAP state
+#   (iterate MapQ from tail to head, undo each rename)
+# - Commit (future ROB): evict MapQ entries as they become non-speculative
+```
+
+**MapQ depth**: 12 entries (matching the checkpoint depth in LinxCore BCC). New entries push from head; the oldest 12 rename operations can be recovered on flush.
+
+**Instruction-precise recovery**: On branch mispredict at `flush_rid`, MapQ is replayed backwards from the youngest entry until `entry.rid > flush_rid`. For each entry reversed: `SMAP[atag] ← old_ptag`, `refcount[new_ptag]--`, `refcount[old_ptag]++`. This restores SMAP to the exact state it had at the mispredicted branch's rename time.
+
+### 4.2 Rename Register State Machine
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  CMAP [32 × ptag_w]                                          │
+│  atag → ptag (committed view)                                │
+│  Updated when: ptag becomes orphan + refcount=0               │
+│  Flush target: SMAP ← CMAP                                   │
+└─────────────────────────────────────────────────────────────┘
+         ↑ restore on flush (full CMAP restore)
+         │
+┌─────────────────────────────────────────────────────────────┐
+│  SMAP [32 × ptag_w]                                          │
+│  atag → ptag (speculative view, active for rename)          │
+│  Updated on each D2 rename group (in program order)          │
+│  Flush: SMAP ← CMAP (full restore via MapQ replay)         │
+└─────────────────────────────────────────────────────────────┘
+         ↑ writeback (ptag becomes ready)
+         │
+┌─────────────────────────────────────────────────────────────┐
+│  Ready Table [128-bit bitmap]                                │
+│  bit[i] = 1: ptag i has a valid value                       │
+│  bit[i] = 0: ptag i is waiting for writeback                 │
+│  Set on: CDB writeback (ptag becomes ready)                  │
+│  Clear on: D2 dispatch (ptag allocated as dst, not yet ready) │
+│  Reset on: flush (all ptags become temporarily untrusted)      │
+└─────────────────────────────────────────────────────────────┘
+         ↑ can_issue check
+         │
+┌─────────────────────────────────────────────────────────────┐
+│  MapQ [12-entry ring buffer]                                 │
+│  Records speculative SMAP increments                          │
+│  Push on: each P-dst rename in D2                           │
+│  Flush replay: reverse from tail until rid > flush_rid       │
+│  Fields: {atag, old_ptag, new_ptag, rid, is_push_t/u}    │
+└─────────────────────────────────────────────────────────────┘
+         ↑ tracked by
+         │
+┌─────────────────────────────────────────────────────────────┐
+│  Refcount [128 × 3 bits]                                    │
+│  refcount[ptag] = number of active mappings to ptag         │
+│  Incremented on: rename allocates ptag as new mapping        │
+│  Decremented on: rename remaps atag away (old ptag orphans) │
+│  Freed when: orphan + refcount == 0                         │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 5. D1 Stage: Decode + RID / atag Allocation
+
+### 5.1 Function
+
+D1 reads the 4-wide fetch bundle from F4 and produces the canonical decoded uop fields. D1 is responsible for:
+
+1. **Decode**: `f4_window` (64-bit raw bits) → opcode, uop_type, operand registers, immediate
+2. **RID allocation**: Allocate a unique **Rename ID** (6-bit) per decoded uop, tracking program order
+3. **atag lookup**: Resolve source architectural register indices to their **atag** form
+
+### 5.2 Source Operand Classification
+
+Each source operand carries its **architectural register index** (atag, 0–31 for GPRs) and its **operand class**:
+
+| Class | Register Set | Davinci-v2 Domain |
+|-------|-------------|-------------------|
+| **P** | X0–X31 (GPR) | Scalar integer, FP |
+| **T** | T0–T31 (Tile) | Vector, Cube, MTE operands |
+| **U** | U0–U3 (Uncore) | Future extension |
+
+**Operand encoding** in D1 output:
+```
+src[i].atag  : 6-bit   # architectural register index (0–31)
+src[i].class : 2-bit   # P=0, T=1, U=2, CARG=3
+```
+
+### 5.3 D1 Output Uop Format
+
+```python
+{
+    valid: 1,                    # Fetch bundle slot valid
+    pc: 64,                     # This instruction's PC
+    opcode: 12,                  # Operation code
+    uop_type: 8,                # Micro-operation type
+    src: [                     # 3 source operands
+        { atag: 6, pclass: 2 },  # P/T/U/CARG
+        { atag: 6, pclass: 2 },
+        { atag: 6, pclass: 2 }
+    ],
+    dst: { atag: 6, pclass: 2 },  # Destination operand
+    imm: 64,                   # Immediate value
+    imm_type: 4,               # Immediate type
+    rid: 6,                    # Rename ID (program order)
+    lsid: 32,                  # Load-Store ID (LSU ops only)
+    checkpoint_id: 4,           # MapQ entry ID (for flush recovery)
+    pred_taken: 1,            # Branch prediction direction
+    boundary_type: 3,          # Block boundary type (future)
+    insn_len: 3,             # Instruction length (bytes)
+    insn_raw: 64,              # Raw instruction bits
+}
+```
+
+---
+
+## 6. D2 Stage: Rename Request
+
+### 6.1 Function
+
+D2 receives the D1 decoded uop bundle and performs **register renaming** in a single cycle. This is the core rename logic.
+
+#### P (GPR) Rename
+
+```python
+# Per D2 slot (slots 0..3, processed in program order):
+# "smap_live" accumulates SMAP updates from earlier slots in the same group
+
+for slot in range(4):
+    u = d1_uop[slot]
+    if not u.valid:
+        continue
+    
+    # Step 1: Resolve source ptag from SMAP (live state)
+    if u.src[0].pclass == P:
+        src0_ptag = smap_live[u.src[0].atag]   # atag → ptag lookup
+    elif u.src[0].pclass == T:
+        src0_ptag = tile_rat[u.src[0].atag]   # Tile RAT lookup (independent)
+    elif u.src[0].pclass == CARG:
+        src0_ptag = 0  # CARG resolved by BID, no ptag
+    
+    # (Same for src[1], src[2])
+    
+    # Step 2: Allocate new ptag for destination (if P-dst)
+    if u.dst.pclass == P and u.dst.atag != 0:   # atag=0 is r0 (hardwired zero)
+        old_ptag = smap_live[u.dst.atag]        # will become orphan
+        new_ptag = allocate_from_free_list(free_list)  # lowest-numbered free ptag
+        
+        # Update SMAP (live, for later slots in same group)
+        smap_live[u.dst.atag] = new_ptag
+        
+        # Update refcount
+        refcount[old_ptag] -= 1    # old ptag: one fewer mapping
+        refcount[new_ptag] += 1     # new ptag: one more mapping
+        
+        # Push MapQ entry (for flush recovery)
+        mapq.push(MapQEntry {
+            valid: 1,
+            atag: u.dst.atag,
+            old_ptag: old_ptag,
+            new_ptag: new_ptag,
+            rid: u.rid,
+            is_push_t: 0,
+            is_push_u: 0
+        })
+        
+        # If old_ptag becomes orphan (refcount=0) → free old_ptag to free_list
+    
+    # Step 3: T-stack push (Tile RAT independent; increment tile refcount)
+    if u.dst.pclass == T:
+        tile_rat.push(u.dst.atag, new_tile_ptag)
+    
+    # Step 4: U-stack push (U registers; future)
+    if u.dst.pclass == U:
+        ucore_rat.push(u.dst.atag, new_utag)
+```
+
+#### T/U Rename (Tile RAT, Independent Domain)
+
+Tile rename is **independent** of the scalar P rename. Davinci-v2's existing Tile RAT (32 architectural tile regs → 256 physical tile regs) is preserved unchanged. The Tile RAT operates with its own free list and refcount mechanism, entirely separate from the P-reg rename pipeline.
+
+| Domain | Rename Mechanism | Map | Recovery |
+|--------|----------------|-----|---------|
+| **P** (GPR) | Map-based (SMAP) | SMAP / CMAP / MapQ | MapQ replay |
+| **T** (Tile) | Map-based (Tile RAT) | Tile RAT | Tile RAT refcount |
+| **U** (Uncore) | FIFO (future) | — | — |
+
+### 6.2 Intra-Group Bypass
+
+Instructions in the same D2 group that depend on a destination produced by an earlier slot in the same group use **bypass** (not a SMAP lookup):
+
+```
+Slot 0:  ADD  X5, X2, X3     → dst.atag=X5 → ptag=P40
+Slot 1:  MUL  X6, X5, X7   → src.atag=X5 → bypass P40 (not SMAP)
+Slot 2:  SUB  X5, X8, X9   → dst.atag=X5 → ptag=P41
+Slot 3:  ADD  X10, X5, X6  → src.atag=X5 → bypass P41 (newest)
+```
+
+Implementation: SMAP is updated sequentially within the group. Each slot reads the SMAP state produced by all previous slots (not the SMAP from the previous cycle). Bypass is resolved by comparing `src.atag == earlier_dst.atag` — `36 × 7-bit comparators` for 3 sources × 3 earlier slots.
+
+---
+
+## 7. D3 Stage: Rename Complete + Dispatch Prep
+
+### 7.1 Function
+
+D3 is the **rename-complete register boundary**. It holds:
+
+1. **Resolved source ptags**: `src0_ptag`, `src1_ptag`, `src2_ptag` (resolved from SMAP or bypass)
+2. **Resolved destination ptag**: `pdst` (newly allocated)
+3. **Ready query**: From the Ready Table, query `ready_mask[src0_ptag]`, `ready_mask[src1_ptag]` — initializes each IQ entry's `src_ready` state
+
+D3 also performs the MapQ head advancement and initial free-list management.
+
+### 7.2 D3 Output Format
+
+```python
+{
+    valid: 1,
+    pc: 64,
+    opcode: 12,
+    src0_ptag: 7,       # Resolved source ptag 0
+    src1_ptag: 7,       # Resolved source ptag 1
+    src2_ptag: 7,       # Resolved source ptag 2 (immediate/pc-rel/3rd operand)
+    src_ready: (3,),     # Ready state from Ready Table: {src0_ready, src1_ready, src2_ready}
+    pdst: 7,            # Newly allocated destination ptag
+    dst_atag: 6,        # Destination atag (for refcount on orphan)
+    dst_class: 2,       # P/T/U
+    has_dst: 1,         # Whether this uop writes a register
+    lsid: 32,           # Load-Store ID (LSU ops only)
+    rid: 6,             # Program order ID
+    checkpoint_id: 4,    # MapQ entry ID
+    imm: 64,
+}
+```
+
+---
+
+## 8. S1 / S2 Stages: Dispatch Preparation + Execute
+
+### 8.1 S1 — Dispatch Preparation
+
+S1 receives D3 renamed uops and performs **resource availability checks**:
+
+| Check | Condition | Recovery |
+|-------|-----------|---------|
+| **Free list** | `free_mask` has ≥ N free ptags for dispatched P-dst ops | Stall |
+| **MapQ space** | `mapq.count < mapq_depth - 1` (keep 1 slot for safety) | Stall |
+| **IQ space** | Each IQ type has enough free entries for dispatched slots | Stall |
+
+**IQ Routing** (identical to LinxCore BCC):
+
+```python
+def classify(uop):
+    if uop.opcode in LSU_OPS:
+        return lsu_iq       # LSU: 32 entries, 2-wide issue
+    elif uop.opcode in BRU_OPS:
+        return bru_iq       # BRU: 16 entries, 1-wide issue
+    else:
+        return alu_iq       # ALU/FSU: 48 entries, 4-wide issue
+```
+
+### 8.2 S2 — Dispatch Execute
+
+S2 performs the actual IQ entry write and free list update:
+
+```python
+# Per physical IQ (S2 executes per IQ type in parallel):
+for slot in range(dispatch_w):
+    iq_type = s1_iq_route[slot]
+    entry_idx = iq_alloc.allocate(iq_type)  # Lowest-numbered free entry
+    
+    # IQ entry write:
+    iq[entry_idx].valid = 1
+    iq[entry_idx].src0_ptag = d3_uop[slot].src0_ptag
+    iq[entry_idx].src1_ptag = d3_uop[slot].src1_ptag
+    iq[entry_idx].pdst = d3_uop[slot].pdst
+    iq[entry_idx].src_ready = d3_uop[slot].src_ready  # From Ready Table at D3
+    iq[entry_idx].rid = d3_uop[slot].rid               # Program order ID (for age)
+    iq[entry_idx].lsid = d3_uop[slot].lsid
+    iq[entry_idx].checkpoint_id = d3_uop[slot].checkpoint_id
+
+# Free list update:
+free_mask &= ~allocated_ptags   # Clear bits for newly allocated ptags
+```
+
+---
+
+## 9. Physical IQ Layout
+
+### 9.1 IQ Topology
+
+Three physical IQs, parameterized from LinxCore BCC:
+
+| Physical IQ | Type | Depth | Issue Width | Functional Units |
+|-------------|------|-------|------------|-----------------|
+| `alu_iq` | ALU / FSU | 48 entries | **4** | 4× ALU + FSU |
+| `bru_iq` | BRU | 16 entries | **1** | 1× BRU (branch resolve) |
+| `lsu_iq` | LSU | 32 entries | **2** | 1× Load + 1× Store |
+
+> **Note**: Issue width increases from current 6-wide (4 ALU + 1 MUL + 1 BRU) to 7-wide (alu_iq × 4 + bru_iq × 1 + lsu_iq × 2). MUL reuses ALU IQ but has 4-cycle pipelined execution. DIV reuses ALU IQ with non-pipelined blocking.
+
+### 9.2 IQ Entry Format
+
+All IQ entries carry the **program order RID** (`rid`, 6-bit) for age-based pick. No separate age field is maintained — the RID is the age encoding.
+
+```python
+# ALU IQ Entry (48 entries × 93 bits)
+{
+    valid: 1,
+    rid: 6,              # Program order ID — used for age comparison in pick
+    op: 12,              # Operation code
+    pc: 64,              # PC (for trace)
+    imm: 64,              # Immediate value
+    src0_ptag: 7,         # Source 0 ptag
+    src1_ptag: 7,         # Source 1 ptag
+    src2_ptag: 7,         # Source 2 ptag (immediate/pc-rel)
+    pdst: 7,              # Destination ptag
+    src_ready: (3,),      # Source ready bits (from Ready Table at dispatch)
+    has_dst: 1,
+    checkpoint_id: 4,       # MapQ entry ID (for flush)
+}
+# ALU IQ total: 48 × 93 ≈ 4,464 bits ≈ 558 bytes
+
+# BRU IQ Entry (16 entries × 95 bits)
+{
+    valid: 1,
+    rid: 6,
+    op: 12,
+    pc: 64,
+    pdst: 7,
+    src0_ptag: 7,
+    src1_ptag: 7,
+    src_ready: (2,),
+    has_dst: 1,
+    checkpoint_id: 4,
+    pred_taken: 1,         # Branch prediction direction
+}
+# BRU IQ total: 16 × 95 ≈ 1,520 bits ≈ 190 bytes
+
+# LSU IQ Entry (32 entries × 128 bits)
+{
+    valid: 1,
+    rid: 6,
+    op: 12,
+    pc: 64,
+    lsid: 32,             # Load-Store ID (memory ordering)
+    src0_ptag: 7,         # Base register ptag
+    src1_ptag: 7,         # Offset register ptag
+    src_ready: (2,),
+    has_dst: 1,
+    checkpoint_id: 4,
+    addr_ready: 1,        # AGU address computation complete
+}
+# LSU IQ total: 32 × 128 ≈ 4,096 bits ≈ 512 bytes
+
+# Total IQ storage: ~558 + 190 + 512 ≈ 1,260 bytes
+```
+
+### 9.3 Comparison with Current Davinci-v2
+
+| Dimension | Current Davinci-v2 RS | Davinci-v2 BCC IQ |
+|-----------|----------------------|-------------------|
+| IQ topology | 1 centralized Scalar RS (32 entries) | 3 physical IQs (alu_iq 48 + bru_iq 16 + lsu_iq 32 = 96 total) |
+| Wakeup comparators | 384 (32×2×6 CDB ports) | **0** (Ready Table bitmap lookup replaces all) |
+| Issue width | 6 | 7 (4+1+2) |
+| IQ storage | ~1.7 KB (RS entries only) | ~1.26 KB |
+| Age field | `age[5:0]` per entry | **RID-based** (no extra field; uses existing rid) |
+| Ready state | Per-entry `rdy1`/`rdy2` bits set by CDB | **Per-entry `src_ready` bitmap** from Ready Table at dispatch |
+
+---
+
+## 10. Ready Table + Age-Matrix Issue Picker
+
+### 10.1 Ready Table
+
+The Ready Table is the core innovation that eliminates the `O(iq_depth × issue_w × pregs)` CDB comparator array.
+
+```python
+class ReadyTable:
+    """128-bit bitmap: bit[i] = 1 means ptag i has a valid value."""
+    
+    def __init__(self):
+        self.mask = 0xFFFFFFFFFFFFFFFFFFFFFFFF  # 128 bits = 2 × uint64
+    
+    def set(self, ptag: int):
+        """Set ptag to ready (called on CDB writeback)."""
+        self.mask |= (1 << ptag)
+    
+    def clear(self, ptag: int):
+        """Set ptag to not-ready (called on dispatch allocation)."""
+        self.mask &= ~(1 << ptag)
+    
+    def is_ready(self, ptag: int) -> bool:
+        return (self.mask >> ptag) & 1 == 1
+    
+    def read(self, ptag: int) -> bool:
+        """Combinational read for can_issue computation."""
+        return self.is_ready(ptag)
+```
+
+**Update rules per cycle** (combinational → registered):
+
+```
+ready_next = ready_current
+
+# 1. D2 dispatch: allocated ptag is not yet ready
+for each dispatched dst with pclass == P:
+    ready_next.clear(pdst)      # clear bit in bitmap
+
+# 2. CDB writeback: ptag becomes ready
+for each CDB writeback port (6 ports):
+    if wb.valid:
+        ready_next.set(wb.ptag)  # set bit in bitmap
+
+# 3. Flush: all ptags become temporarily untrusted
+if do_flush:
+    ready_next = ALL_ONES   # reset all bits
+
+Ready Table Register ← ready_next   # Sampled on clock edge
+```
+
+**Ready Table is read combinatorially by the issue picker** (not registered at read time). This is the key efficiency: each `can_issue` computation is a single bit-test, not a bank of comparators.
+
+### 10.2 Issue Picker (Age-Matrix Pick)
+
+The issue picker is **purely combinational logic** (no state). For each physical IQ:
+
+**Step 1: can_issue computation** (per IQ entry):
+
+```python
+for entry in iq.entries:
+    src0_rdy = ready_table.read(entry.src0_ptag)  # O(1) bit-test
+    src1_rdy = ready_table.read(entry.src1_ptag)  # O(1) bit-test
+    src2_rdy = ready_table.read(entry.src2_ptag)  # O(1) bit-test
+    
+    can_issue[entry] = (
+        entry.valid
+        & src0_rdy
+        & src1_rdy
+        & src2_rdy
+    )
+```
+
+**Step 2: Cascaded age-matrix pick** (one pick per issue lane):
+
+```python
+# ALU IQ: 4-wide issue (alu_w = 4)
+# Slot 0: select oldest can_issue entry
+# Slot 1: select next-oldest (excluding Slot 0's choice)
+# ...
+
+selected = []
+excluded = set()
+
+for lane in range(4):   # alu_w = 4
+    winner = None
+    best_age = 0x3F     # Max RID value = youngest
+    
+    for entry in alu_iq.entries:
+        if entry not in excluded and can_issue[entry]:
+            # Sub-head age: smaller = older = higher priority
+            age = (entry.rid - head_rid) & 0x3F  # Wrap-friendly: mod 64
+            if age < best_age:
+                best_age = age
+                winner = entry
+    
+    if winner:
+        selected.append(winner)
+        excluded.add(winner)
+    else:
+        selected.append(None)   # Lane empty
+```
+
+**Age encoding** (wrap-friendly, from LinxCore BCC):
+
+The RID is a 6-bit program-order counter. Sub-head age:
+```
+age = (entry.rid - head_rid) mod 64
+```
+
+This gives:
+- **Older than head** (entry.rid < head_rid): age ≈ 64 + entry.rid - head_rid → large positive value
+- **Younger than head** (entry.rid > head_rid): age ≈ entry.rid - head_rid → small positive value
+
+Therefore, the entry with the **smallest age** is the **oldest instruction**. The mod-64 arithmetic handles the 6-bit wrap correctly without special casing.
+
+### 10.3 Wakeup Path Timing
+
+```
+Cycle N:
+  W1:  CDB broadcast → ptag P40 is ready
+  W1:  ready_next ← ready | {P40}
+  
+Cycle N+1 (clock edge):
+  Ready Table Register ← ready_next
+  
+Cycle N+1:
+  P1:  can_issue[i] recomputed (Ready Table combinational read → O(1) bit-test)
+  P1:  Age-matrix pick selects winners
+  I1:  RF read-port arbitration
+  
+Cycle N+2:
+  I2:  Issue confirm: IQ entry.valid ← 0 (deallocated)
+  I2:  RF read-port occupancy confirmed
+```
+
+**Total wakeup latency: 2 cycles** (Ready Table register → can_issue visible → pick → RF read).
+
+This is 1 cycle slower than Davinci-v2's current single-cycle CDB→RS comparator wakeup, but:
+- Ready Table lookup is O(1) bit-mask, not O(pregs) comparators
+- The latency does not grow with issue width
+- For AI kernels (compute-bound, not branch-bound), the impact is negligible
+
+---
+
+## 11. Multi-Latency Functional Unit Handling
+
+### 11.1 Variable-Latency FU Wakeup
+
+Davinci-v2 has multiple variable-latency functional units:
+
+| Unit | Latency | Treatment |
+|------|---------|-----------|
+| ALU | 1 cycle | Single-cycle, normal wakeup |
+| MUL | 4 cycles | Pipelined; multi-cycle in-flight entries tracked by Ready Table |
+| DIV | 12–20 cycles | Non-pipelined, blocks FU until complete |
+| LD (L1 hit) | 4 cycles | LSU pipeline |
+| LD (L2 hit) | 12 cycles | LSU MSHR |
+| LD (DRAM) | 200–400 cycles | LSU MSHR + external |
+
+**Issue picker does not distinguish latency** — each selected entry enters the execution unit, and the FU manages its own latency pipeline. CDB writeback broadcasts the ptag to the Ready Table regardless of which FU produced it.
+
+**Wakeup for variable-latency ops**: Multiple results from different latency units can be in flight simultaneously. Each IQ entry's `can_issue` is independently determined by the Ready Table. This is correct because each ptag has exactly one producer; the Ready Table tracks ptag readiness, not instruction age.
+
+### 11.2 Load/Store Special Handling
+
+LSU IQ entries have an additional `addr_ready` bit, managed independently from the Ready Table:
+
+```python
+lsu_entry.can_issue = (
+    entry.valid
+    & entry.src_ready[0]      # base register ready (Ready Table)
+    & entry.src_ready[1]      # offset register ready (Ready Table)
+    & entry.addr_ready        # AGU address computation complete
+)
+```
+
+`addr_ready` is set by the AGU execution stage when address calculation finishes. This is separate from the Ready Table update (which tracks register file data readiness, not address readiness).
+
+---
+
+## 12. MapQ: Flush and Recovery
+
+### 12.1 Flush Trigger Sources
+
+| Source | Condition | Recovery |
+|--------|-----------|---------|
+| Branch mispredict | Branch direction wrong at EX1 | MapQ replay + SMAP ← CMAP |
+| FP exception | FP exception detected | Same |
+| TLB miss | Page fault | Same (AI kernels assume no page faults) |
+
+### 12.2 MapQ Replay Recovery
+
+On a branch mispredict at `flush_rid` (the RID of the mispredicted branch):
+
+```
+# MapQ is a 12-entry ring buffer (oldest at head, newest at tail)
+# Replay in reverse order (youngest → oldest)
+
+for entry in mapq.entries.reversed():
+    if entry.rid > flush_rid:    # Entry is younger than the mispredicted branch
+        # Undo this rename:
+        smap[entry.atag] = entry.old_ptag
+        refcount[entry.new_ptag] -= 1
+        refcount[entry.old_ptag] += 1
+        if refcount[entry.new_ptag] == 0 and entry.old_ptag is orphan:
+            free_list.push(entry.new_ptag)
+        entry.valid = 0        # Invalidate MapQ entry
+    else:
+        break                   # Older entries are on the correct path
+
+# After replay: SMAP == CMAP (committed state)
+# All MapQ entries from the mispredicted branch and younger are undone
+```
+
+**Flush propagation**:
+
+| Component | Flush Action |
+|-----------|-------------|
+| Fetch (F4) | `valid ← 0` (clear fetch bundle) |
+| SMAP | Restored from CMAP via MapQ replay |
+| Ready Table | `mask ← ALL_ONES` (all ptags temporarily untrusted) |
+| Free List | Recomputed from SMAP + refcount |
+| MapQ | Entries younger than `flush_rid` invalidated |
+| Physical IQs | All entries with `checkpoint_id ≥ flush_checkpoint` → `valid ← 0` |
+
+### 12.3 MapQ vs. RAT Checkpoints (Davinci-v2 v1)
+
+| Dimension | RAT Checkpoints (v1) | MapQ (v2 BCC) |
+|-----------|---------------------|----------------|
+| Snapshots | 8 full SMAP copies (224 bits each = 1.75 Kb) | 12 incremental entries (96 bits each = 1.15 Kb) |
+| Recovery precision | Checkpoint captures SMAP at branch time; restore is exact | MapQ replays increments; same precision |
+| Flush cost | Full SMAP copy: 32 × 7-bit = 224 bits | MapQ replay: iterate up to 12 entries |
+| Storage efficiency | O(8 × SMAP_size) | O(12 × entry_size) |
+| Old ptag recovery | Via refcount | Via MapQ replay (undoes the rename that orphaned the ptag) |
+| T/U support | Separate Tile RAT | Separate Tile RAT (unchanged) |
+
+---
+
+## 13. Branch-Tag Flush Model (No BID Yet)
+
+Since this change point does not introduce block structure (BSTART/BSTOP/BID/BROB), flush uses **branch_tag** (3-bit, matching Davinci-v2's existing model):
+
+- Each in-flight branch is assigned a unique `branch_tag` from a pool of 8.
+- All IQ entries, MapQ entries, SSB entries, and STQ entries carry the `branch_tag` of the youngest unresolved branch ahead of them.
+- On branch resolve (correct): the tag is freed and propagated as a tag-clear event.
+- On branch mispredict: the tag becomes the **flush key** — every entry tagged with this branch (or any younger branch) is invalidated atomically.
+
+```
+# Branch-tag flush on mispredict:
+flush_tag = mispredicted_branch.branch_tag
+
+for each IQ entry e:
+    if e.branch_tag == flush_tag:
+        e.valid = 0
+
+for each MapQ entry e:
+    if e.rid > flush_rid:    # Younger than mispredicted branch
+        e.valid = 0
+
+for each SSB entry e:
+    if e.branch_tag == flush_tag:
+        e.valid = 0
+
+for each STQ entry e:
+    if e.branch_tag == flush_tag:
+        e.valid = 0
+```
+
+The SSB (24 entries) and STQ (8 entries) are preserved from Davinci-v2 v1 and handle speculative memory side effects. Their flush behavior is unchanged — only the tag-based invalidation is consistent with the MapQ flush model.
+
+---
+
+## 14. Key Parameters
+
+| Parameter | Davinci-v2 v1 | Davinci-v2 BCC |
+|-----------|---------------|----------------|
+| Fetch width | 4 | 4 |
+| Decode width | 4 | 4 |
+| Dispatch width | 4 | 4 |
+| **Scalar issue width** | **6** | **7** (alu_iq×4 + bru_iq×1 + lsu_iq×2) |
+| Commit width | 4 | 4 |
+| Physical GPRs | 128 (P0–P127) | 128 (P0–P127) |
+| `ptag_w` | 7 | 7 |
+| RAT checkpoints / MapQ depth | 8 | **12** |
+| **IQ topology** | 1 centralized Scalar RS (32 entries) | **3 physical IQs** |
+| Scalar RS/IQ depth | 32 | alu_iq: **48**, bru_iq: **16**, lsu_iq: **32** |
+| LSU RS/IQ depth | 24 | **32** |
+| Wakeup mechanism | CDB broadcast → 384 comparators | **Ready Table (128-bit bitmap) + 0 comparators** |
+| CDB comparators | 384 (32×2×6) | **0** |
+| Rename stages | D1–D2 (1 cycle) | **D1/D2/D3 (3 cycles)** |
+| Age encoding | `age[5:0]` per entry | **RID-based sub-head age (mod 64)** |
+| Fetch-to-WB latency | 12 cycles | **17+ cycles** |
+| IQ storage | ~1.7 KB | **~1.26 KB** |
+
+---
+
+## 15. Difference Summary: Davinci-v2 BCC vs. LinxCore BCC
+
+| Design Decision | LinxCore BCC | Davinci-v2 BCC | Reason |
+|----------------|-------------|----------------|--------|
+| Commit model | ROB-based in-order retire | **No ROB** — writeback → immediate commit | AI kernel envelope, no precise exceptions needed |
+| CMAP update | On ROB retire | **Not updated** — refcount handles freeing | Same reason |
+| T/U rename | FIFO push + CMAP commit on BSTOP | **Tile RAT** (separate domain, unchanged) | T regs are a distinct execution domain |
+| Flush target | `flush_bid` (64-bit BID) | **`flush_tag` (3-bit branch tag)** | No block structure yet |
+| BRU correction | Deferred to BSTOP commit | **Immediate** (EX1 detect → flush) | No BSTOP boundary model |
+| Block model | BSTART/BSTOP + BID + BROB | **Not introduced** (future change point) | Flat control flow for AI kernels |
+| Ready Table size | `pregs = 64` | **`pregs = 128`** (P0–P127) | Davinci-v2 has 128 physical GPRs |
+| Age bits | `rob_w = 6` (ROB slot) | **`rid_w = 6` (program order)** | No ROB; use RID instead |
+| Age wrap | `rob_w`-bit wrap | **6-bit wrap** | Same mechanism |
+| Checkpoint depth | 16 | **12** | 8→12 is the incremental change |
+
+---
+
+## 16. Implementation Phases
+
+### Phase 1: Core infrastructure (low risk)
+1. Split D2 rename into D1/D2/D3 three-stage pipeline
+2. Replace centralized Scalar RS with three physical IQs (alu_iq / bru_iq / lsu_iq)
+3. Implement Ready Table (128-bit bitmap) replacing CDB comparators
+4. Implement age-matrix issue picker using RID-based sub-head age
+5. Replace RAT checkpoints with MapQ (12-entry ring buffer)
+
+### Phase 2: Scheduling enhancements (depends on Phase 1)
+6. Extend issue width to 7-wide (alu_iq × 4 + bru_iq × 1 + lsu_iq × 2)
+7. Expand LSU IQ to 32 entries, ALU IQ to 48 entries
+8. Tune MapQ depth to 12 entries
+
+### Phase 3: Advanced features (future change points)
+9. Introduce BID/BROB block tracking (change point #2)
+10. Precise exception support (change point #3)
+
+---
+
+## 17. Risks and Mitigations
+
+| Risk | Mitigation |
+|------|------------|
+| 2-cycle wakeup latency vs. current 1-cycle | Ready Table lookup is O(1) bit-test; latency does not grow with issue width. AI kernels are compute-bound, not branch-bound. |
+| Age wrap handling at RID boundary | Mod-64 sub-head age formula handles wrap automatically; formal verification needed |
+| MapQ replay correctness | Formal proof that reverse replay restores exact SMAP state; test suite with mispredict injection |
+| IQ split increases dispatch complexity | S1/S2 stages encapsulate routing and allocation; parameterizable per IQ type |
+| Ready Table multi-port reads | 3 IQs read Ready Table combinatorially (3 bit-tests per entry, not full-port reads); no multi-port SRAM needed |
+
+---
+
+## Appendix A: Pipeline Stage Mapping
+
+| Current Davinci-v2 Stage | New BCC Stage | Change |
+|--------------------------|---------------|--------|
+| F1–F2 (Fetch) | F0→F1→F2→F3→IB→F4 | Added F3 (BSTART annotation) and IB (sync buffer) |
+| D1–D2 (Rename) | **D1** (decode + RID/atag allocation) → **D2** (read SMAP, rename) → **D3** (write SMAP, IQ allocation) | Split into 3 stages |
+| DS (Dispatch) | **S1** (resource check) → **S2** (IQ entry write) | Split into 2 stages |
+| IS (Issue) | **P1** (age-matrix pick) → **I1** (RF read arbitration) → **I2** (confirm issue) | Split into 3 stages |
+| EX (Execute) | E1–EX_n | Unchanged |
+| WB (Writeback) | W1 (+ Ready Table update) | Unchanged (Ready Table update added) |
+
+**Net pipeline depth increase**: 5 cycles (D1/D2/D3 replaces D1/D2; S1/S2 replaces DS; P1/I1/I2 replaces IS).
+
+---
+
+## Appendix B: Key Design Equations
+
+```
+Ready Table lookup latency:       O(1)    (single bit-test per ptag)
+Age-matrix pick latency:          O(depth) per lane (cascaded compare, depth = iq_depth)
+Wakeup-to-pick cycle gap:        2 cycles (Ready Table reg → can_issue → pick)
+Age encoding:                   age = (entry.rid - head_rid) mod 64
+IQ storage:                     ~1.26 KB (alu_iq 558 + bru_iq 190 + lsu_iq 512)
+CDB comparator reduction:        384 → 0 (Ready Table replaces all)
+MapQ storage:                   12 × 96 b = 1,152 b ≈ 144 bytes
+Total recovery storage:           ~1.4 KB (vs. RAT checkpoints ~2.2 KB)
+```
diff --git a/designs/outerCube/Davinci_BlockROB_v1.md b/designs/outerCube/Davinci_BlockROB_v1.md
new file mode 100644
index 00000000..df61c2bb
--- /dev/null
+++ b/designs/outerCube/Davinci_BlockROB_v1.md
@@ -0,0 +1,908 @@
+# Davinci-v2 Block-ROB — Instruction-Block Precise Exception Support (DSP-003)
+
+> **Document ID**: DSP-003
+> **Version**: v1.0
+> **Date**: 2026-05-02
+> **Status**: Draft
+> **Target**: `pyCircuit/designs/outerCube/Davinci_superscalar_v2.md`
+> **Change Point**: #3 — Add Block-ROB (instruction-block commit) for precise exception support
+> **Dependencies**: PR #8 (BCC scalar pipeline), PR #10 (VTG vector micro-instructions)
+
+---
+
+## Change Log
+
+| Version | Date | Changes |
+|---------|------|---------|
+| v1.0 | 2026-05-02 | Initial draft |
+
+---
+
+## 1. Motivation
+
+### 1.1 What the Current Design Cannot Do
+
+The current Davinci-v2 design (BCC scalar pipeline + VTG vector micro-instructions) implements a **ROB-less out-of-order execution model** with:
+
+- **MapQ** (12-entry ring buffer) for instruction-precise P-reg (scalar GPR) recovery
+- **Tile RAT checkpoints** (8 snapshots) for tile register recovery
+- **Branch-tag tracker** (8 tags + 8x8 ancestry bitmap) for speculation gating
+- **SSB** (24-entry) and **STQ** (8-entry) for speculative memory-side-effect gating
+- **Flash-restore** of RAT checkpoints + branch-tag CAM-clear for mispredict recovery
+
+This model is **explicitly limited** to **run-to-completion kernel execution**:
+
+> *"no precise architectural exceptions and no OS-level interrupts"*
+> — Davinci_superscalar_v2.md §11.1, §11.7
+
+Specifically, the current design cannot:
+
+1. **Identify the faulting instruction** on an exception (page fault, trap, system call)
+2. **Maintain precise architectural state** for interrupts or exceptions mid-execution
+3. **Support hardware breakpoints** or single-step debugging with precision
+4. **Provide in-order memory commit** for I/O semantics
+
+### 1.2 Why Block-ROB (Not a Flat ROB)
+
+Adding a full flat ROB would cost:
+- ~64 entries x ~200 b/entry = ~12.8 Kb
+- Complex priority encoder, age-tracking across 64+ entries
+- Wide commit width logic (complexity O(n) in entry count)
+- Retirement bandwidth bottleneck
+
+A **Block-ROB (BROB)** takes a different approach:
+
+1. **Organize instructions into blocks** — the compiler groups N dynamic instructions into an *instruction block* delimited by a `BSTART` marker at head and a `BSTOP` marker at tail.
+2. **Track blocks, not individual instructions** — the BROB manages block lifetimes (allocate → complete → retire), not per-uop commit.
+3. **Per-block in-order commit** — all instructions inside a block commit in program order, but the BROB only commits **one block at a time**, at block boundaries.
+4. **Precise exceptions at block granularity** — the faulting instruction's block is identified, all younger blocks are squashed.
+5. **Reuse existing infrastructure** — the MapQ already provides instruction-precise P-reg recovery; the SSB/STQ already gate memory effects; the branch-tag tracker already manages speculation depth.
+
+The key insight from LinxCore: **a flat ROB is overkill when block structure is already present in the ISA.** By lifting commit from instruction granularity to block granularity, the BROB achieves:
+
+- **Precise exception support** (block-granularity, sufficient for OS kernel entry)
+- **Store commit within blocks** (in-order, block-bounded)
+- **Reduced complexity** (O(block_count) not O(instruction_count))
+- **Natural fit with VTG** (a VTG micro-block is a natural BROB scheduling unit)
+
+### 1.3 Design Goals
+
+| Goal | Target |
+|------|--------|
+| Precise exception identification | Block-granularity (identify faulting block, squash younger blocks) |
+| Register recovery | Instruction-precise within faulting block via MapQ reverse replay |
+| Memory recovery | In-order block commit; SSB/STQ gates memory effects within block |
+| Store commit | In-order within block; stores retire from SSB only at block commit |
+| Branch mispredict | Continue using MapQ + RAT flash-restore + branch-tag CAM-clear |
+| Hardware cost | ~128-entry BROB + ~32-entry Block SSB + Block STQ + integration logic |
+| Compatibility | Continues to support existing BCC scalar pipeline and VTG micro-instructions |
+
+### 1.4 What Block-ROB Does NOT Provide
+
+Block-granularity precise exception means:
+- **Not instruction-precise**: if two instructions in the same block could fault differently, the block is the unit of identification.
+- **Not general-purpose OS interrupt support**: this is still an AI-kernel-oriented design; only kernel-entry traps (system calls, fatal page faults) are supported.
+- **Not single-step debugging precision**: within a block, instructions may have executed past a breakpoint before it is reported.
+
+This is the same model as modern CPUs with micro-op caches and retirement buffering: the ROB retires in program order but commit is pipelined. The block boundary is the *commit point*, not the exception point.
+
+---
+
+## 2. Block Structure
+
+### 2.1 Instruction Block Definition
+
+An **instruction block** is a contiguous sequence of decoded micro-operations bounded by:
+
+- **BSTART** (inclusive start): the first uop in the block
+- **BSTOP** (inclusive end): the last uop in the block
+
+The compiler generates block boundaries at natural control-flow join points:
+
+```
+BSTART              # block entry
+  <all scalar uops in block>
+  <all VTG micro-instructions>
+BSTOP               # block exit
+```
+
+**Block size** (default, configurable):
+- Minimum: 4 uops
+- Maximum: 64 uops
+- Typical AI kernel: 16-32 uops
+
+### 2.2 Block Type
+
+Each block carries a **block type** that determines its execution model:
+
+| Block Type | Scalar-only | Engine-backed | Notes |
+|------------|-------------|---------------|-------|
+| `STD` | Yes | No | Pure scalar execution |
+| `VTG` | Yes | VTG micro-instructions | VTG micro-block (one per VTG macro) |
+| `VEC` | No | Full-tile VEC-4K-v2 | T* tile operations |
+| `CUBE` | No | outerCube MXU | CUBE.OPA, CUBE.DRAIN |
+| `MTE` | Yes | Memory Tile Engine | TILE.LD, TILE.ST |
+
+A block may be **hybrid**: scalar uops + one VTG micro-block. A `VTG` block type indicates that the block contains VTG micro-instructions and the BROB must track the GVIQ sub-schedule within the block.
+
+### 2.3 Block ID (BID)
+
+Each block is assigned a **Block ID (BID)** at allocation:
+
+```
+BID[7:0]  — BROB slot index (0..127)
+BID[63:8] — Monotonically increasing sequence number (for uniqueness across wraps)
+```
+
+The 8-bit slot index directly maps to the BROB entry. The full-width BID is used for flush ordering: all queues and structures flush by `BID_prefix` — keeping entries with `BID <= flush_BID`, killing entries with `BID > flush_BID`.
+
+---
+
+## 3. BROB Structure
+
+### 3.1 BROB Parameters
+
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| `BROB_ENTRIES` | 128 | Power-of-2; matches branch-tag depth |
+| `BROB_ALLOC_PER_CYCLE` | 1 | One block allocated per cycle |
+| `BROB_COMPLETE_PER_CYCLE` | 1 | One block completed per cycle |
+| `BROB_RETIRE_PER_CYCLE` | 1 | One block retired per cycle |
+| `BID_W = log2(BROB_ENTRIES)` | 7 bits | Slot index in low bits of BID |
+| `BLOCKTYPE_W` | 4 bits | Block type encoding |
+
+### 3.2 Per-BROB-Entry State
+
+```
+BROBEntry {
+  valid:          1 b      -- entry is allocated
+  state:           2 b      -- ALLOC | ISSUED | COMPLETE | (implicit: retired via head advance)
+  block_uid:       64 b     -- unique block identifier
+  bid:             64 b     -- full-width BID (slot + sequence)
+  block_type:      4 b      -- STD | VTG | VEC | CUBE | MTE
+  head_rid:        7 b      -- RID of first uop in block (iROB index)
+  tail_rid:        7 b      -- RID of last uop in block
+  n_uops:          6 b      -- number of uops in block (1..64)
+  checkpoint_id:    4 b      -- which RAT checkpoint is active for this block
+  needs_scalar:    1 b      -- block has scalar uops (BSTOP must retire)
+  needs_engine:    1 b      -- block has engine ops (engine_done must arrive)
+  engine_done:     1 b      -- engine completion signal received
+  scalar_done:      1 b      -- BSTOP retired from iROB
+  has_exception:    1 b      -- exception detected within this block
+  exception_cause: 16 b     -- trap/exception cause code
+  fault_rid:       7 b      -- RID of faulting uop (if has_exception)
+  redirect_valid:   1 b      -- block requests a redirect on retire
+  redirect_pc:      32 b     -- redirect target PC
+  n_stores:        5 b      -- number of stores in this block
+  n_vtg_ops:       5 b      -- number of VTG micro-instructions
+  gviq_block_start: 7 b     -- GVIQ entry index for first VTG op in block
+}
+```
+
+**State machine:**
+
+```
+FREE --[allocate]--> ALLOC --[issued]--> ISSUED --[complete]--> COMPLETE
+                                                              |
+                                                          [retire: advance head]
+                                                              |
+                                                          FREE (tail advances)
+```
+
+Completion rule:
+```
+complete = scalar_done && (needs_engine ? engine_done : 1)
+```
+
+### 3.3 BROB Pipeline
+
+The BROB sits alongside the existing BCC scalar pipeline:
+
+```
+Fetch --> Decode (D1) --> Rename (D2/D3) --> Dispatch (S1/S2)
+                                       |
+                                       v
+                                  BROB Allocate
+                                       |
+                                       v
+                                  iROB Dispatch (D3)  <-- uops enter iROB here
+                                       |
+                                       v
+                                  Issue / Execute
+                                       |
+                          +-------------+-------------+
+                          |                           |
+                          v                           v
+                   Scalar Done                  Engine Done
+                   (BSTOP retires)              (GVIQ/Vector RS/Cube RS
+                          |                signals completion)
+                          |                           |
+                          +-------------+-------------+
+                                        |
+                                        v
+                               BROB Complete Check
+                                        |
+                                        v
+                               BROB Retire (advance head)
+                                        |
+                                        v
+                               Squash / Redirect / Commit
+```
+
+---
+
+## 4. Instruction Block Lifecycle
+
+### 4.1 Block Allocation (D2/D3)
+
+At the **BSTART** uop:
+
+```
+1. Allocate BROB entry k from free pool (tail pointer)
+2. Set bid = {seq_num++, k[7:0]}  -- full-width BID
+3. Set block_type from BSTART metadata
+4. Set checkpoint_id = current RAT checkpoint (snapshot for this block)
+5. Set head_rid = current iROB head (this BSTART's slot)
+6. Set n_uops = 0
+7. Set n_stores = 0, n_vtg_ops = 0
+8. Set needs_scalar = 1, needs_engine = 0
+9. Set scalar_done = 0, engine_done = 0, has_exception = 0
+10. Push MapQ entry: {checkpoint_id, RID of BSTART, ...}
+11. Stamp all uops in block with bid (3 bits in iROB entry)
+12. BSTART retires immediately (no execute, no IQ entry)
+```
+
+### 4.2 Block Population (D3)
+
+For each subsequent uop in the block (until BSTOP):
+
+```
+1. Allocate iROB entry (existing BCC pipeline logic)
+2. Stamp iROB entry with bid (3 bits)
+3. Set iROB entry's brob_slot = k (index into BROB)
+4. Increment n_uops
+5. If is_store: increment n_stores; record SSB slot in block_store_slots[n_stores]
+6. If is_vtg_op: increment n_vtg_ops; record GVIQ entry in block_gviq_entries[n_vtg_ops]
+7. For scalar uops: enter IQ, execute, writeback normally (BCC pipeline)
+8. For VTG ops: enter GVIQ, execute, signal engine_done to BROB
+```
+
+### 4.3 Block Closure (BSTOP)
+
+At the **BSTOP** uop:
+
+```
+1. Set tail_rid = current iROB entry index
+2. Set needs_engine = (n_vtg_ops > 0) || (block_type == VEC) || (block_type == CUBE)
+3. BSTOP enters iROB but does NOT retire immediately
+4. BSTOP's retire is gated by BROB completion (see SS4.4)
+```
+
+### 4.4 Scalar Completion (BSTOP Retire Gate)
+
+The iROB's **commit slot step** module is extended with a **BSTOP retire gate**:
+
+```
+BSTOP can retire when ALL of:
+  1. BROB entry k is in COMPLETE state (scalar_done && engine_done)
+  2. No exception is pending in the block
+
+BSTOP retire flow:
+  1. Set scalar_done = 1 in BROB[k]
+  2. Check: complete = scalar_done && (needs_engine ? engine_done : 1)
+  3. If complete && !has_exception: mark BROB[k].state = COMPLETE
+  4. If complete && has_exception: trigger exception reporting (see SS6.2)
+  5. Advance BROB head to k+1 (mod 128)
+```
+
+### 4.5 Engine Completion
+
+Engines signal completion to the BROB via a **completion bus**:
+
+```
+Engine completion message: {engine_done_valid, bid, trap_valid, trap_cause}
+  -- bid identifies the block
+  -- On arrival: BROB[bid_slot].engine_done = 1
+  -- If trap_valid: BROB[bid_slot].has_exception = 1, BROB[bid_slot].fault_rid = faulting_rid
+```
+
+The completion bus is shared with the existing TCB (Tile Completion Bus) infrastructure. Each engine (VEC-4K-v2, Cube, MTE LSU) already signals completion on the TCB. The BROB listens to the same bus and matches by `bid`.
+
+### 4.6 Block Retire
+
+Only the **oldest block** (head of BROB) can retire per cycle:
+
+```
+Retire rule (in order of priority):
+  1. If head has has_exception:
+       Report exception (see SS6.2)
+       Do NOT commit side effects
+       Squash all younger blocks
+  2. Else if head is COMPLETE:
+       Commit side effects:
+         a. Advance SSB head past all stores in this block (drain allowed)
+         b. Advance STQ head past all tile stores in this block
+         c. Advance head pointer
+         d. Free BROB entry
+  3. Else: stall (wait for completion)
+```
+
+---
+
+## 5. Precise Exception Mechanism
+
+### 5.1 Exception Classification
+
+Two classes of exceptions are handled:
+
+| Class | Source | Handling |
+|--------|--------|----------|
+| **Scalar exception** | ALU trap, divide-by-zero, illegal instruction | Detected at EX1, `trap_valid` set in iROB entry |
+| **Engine exception** | TILE.LD page fault, VTG memory fault | Engine signals via completion bus with `trap_valid` |
+
+### 5.2 Exception Reporting Flow
+
+When an exception is detected inside a block:
+
+```
+Step 1: Detection
+  Scalar: EX1 stage sets iROB[rid].trap_valid = 1
+  Engine: completion bus arrives with trap_valid=1, BROB marks has_exception=1
+
+Step 2: Blocking
+  BROB does NOT retire the block
+  BSTOP retire is gated: "can retire if !has_exception"
+  Block remains at BROB head (blocked)
+
+Step 3: Exception Identification
+  BROB[head].has_exception = 1
+  BROB[head].fault_rid = faulting_rid
+  BROB[head].exception_cause = trap_cause
+
+Step 4: Squash of Younger Blocks
+  flush_bid = BROB[head].bid  -- flush all blocks with bid > flush_bid
+  All BROB entries with bid > flush_bid are set valid = 0
+  All iROB entries with bid > flush_bid are invalidated (branch-tag CAM-clear extended to bid)
+  All GVIQ entries with bid > flush_bid are invalidated
+  All MapQ entries from flush_rid+1 onward are popped (MapQ reverse replay)
+  SSB entries for younger blocks: valid = 0
+  STQ entries for younger blocks: valid = 0
+
+Step 5: Register State Recovery
+  MapQ reverse replay from faulting RID backward:
+    for each MapQ entry from tail down to faulting RID:
+      undo SMAP write, restore orphan ptag, pop MapQ
+  Tile RAT: restore from checkpoint[BROB[head].checkpoint_id]
+
+Step 6: Exception Delivery
+  PC = BSTART_pc (of faulting block)
+  Cause = BROB[head].exception_cause
+  The faulting instruction's BSTART is the restart point.
+  The OS/kernel restores context and re-executes the block.
+```
+
+### 5.3 Instruction-Precise Recovery Within a Block
+
+Within a single block, MapQ already provides instruction-precise P-reg recovery:
+
+- MapQ entries are tagged with `RID` (instruction index)
+- On exception, MapQ is replayed in reverse from the **faulting RID** (not from the block boundary)
+- The `fault_rid` is captured from the iROB entry at exception detection
+
+This means: if two scalar uops in the same block could produce different exceptions, the **faulting RID** is precisely identified and MapQ replay recovers register state up to and including the faulting instruction.
+
+### 5.4 Example: Page Fault in Block
+
+```
+Block B: BSTART, uop0, uop1 (TILE.LD), uop2, BSTOP
+
+uop1 executes: TILE.LD triggers page fault on address X.
+  -- EX1: TILE.LD sets iROB[rid1].trap_valid = 1
+  -- LSU marks block B's BROB entry: has_exception = 1, fault_rid = rid1
+  -- BSTOP cannot retire (blocked on has_exception)
+
+Block B is at BROB head, blocked.
+
+Exception delivery:
+  flush_bid = B.bid  -- flush all blocks with bid > B.bid (none in this case)
+  MapQ replay from faulting RID (rid1) backward
+    -- Undo all SMAP writes from uop1, uop0 in reverse order
+  Tile RAT: restore from B.checkpoint_id
+  Report: PC = BSTART_PC, Cause = page_fault, fault_rid = rid1
+```
+
+---
+
+## 6. Store Commit Within Blocks
+
+### 6.1 Block Store Buffer (Block SSB)
+
+Each BROB entry carries an embedded **Block SSB** tracking stores within the block:
+
+```
+BROBEntry (extended with Block SSB):
+  ...
+  block_ssb_base:    5 b   -- index into global Block SSB RAM
+  n_stores_in_block: 5 b   -- number of stores in this block
+  block_ssb_valid:   1 b   -- stores in this block are in Block SSB
+```
+
+The **Block SSB** is a 32-entry structure shared across all BROB entries (using the `block_ssb_base` offset):
+
+```
+BlockSSBEntry {
+  addr:     40 b   -- cache-line address
+  data:     128 b  -- store data (full scalar register)
+  size:     3 b    -- 1/2/4/8 B
+  valid:     1 b
+  bid:       8 b    -- which block this store belongs to
+  ssb_idx:   5 b    -- index into scalar SSB (for forwarding)
+}
+```
+
+### 6.2 Block SSB Allocation
+
+At D3, when a scalar store uop is detected:
+
+```
+1. Allocate Block SSB entry j from free pool
+2. Set BlockSSB[j] = {addr=0, data=0, size=store_size, valid=1, bid=current_bid}
+3. Record j in BROB[current_bid_slot].block_ssb_entries[block_n_stores]
+4. Increment BROB[current_bid_slot].n_stores_in_block
+5. At EX1: compute address, fill BlockSSB[j].addr
+6. At EX2: fill BlockSSB[j].data from GPR writeback
+```
+
+### 6.3 Load Forwarding from Block SSB
+
+Loads within the same block forward from the Block SSB:
+
+```
+On load execution at EX1:
+  for each BlockSSB entry k in same block:
+    if BlockSSB[k].valid && addr_match(BlockSSB[k].addr, load_addr):
+      forward data from BlockSSB[k]
+      -- No need to check bid ordering: Block SSB only contains stores from this block,
+         and loads/stores in the same block are already program-ordered
+```
+
+### 6.4 Store Commit at Block Retire
+
+At block retire (when head is COMPLETE and !has_exception):
+
+```
+1. For each BlockSSB entry j belonging to retiring block:
+     -- Transfer BlockSSB[j] to the scalar SSB
+     SSB[ssb_idx].valid = 1
+     SSB[ssb_idx].addr = BlockSSB[j].addr
+     SSB[ssb_idx].data = BlockSSB[j].data
+     SSB[ssb_idx].btag = 0xFF  -- immediately non-speculative
+     SSB[ssb_idx].drain_rdy = 1   -- can drain immediately
+     BlockSSB[j].valid = 0  -- free Block SSB entry
+2. Advance BROB head
+3. Retire BSTOP: iROB[BSTOP_rid].done = 1
+```
+
+**Key property**: stores are committed in program order (block order = program order), and they become non-speculative at block retire. The scalar SSB drain pump then handles memory-write timing independently.
+
+### 6.5 MTE Tile Stores Within Blocks
+
+Tile stores (TILE.ST, TILE.SCATTER) are handled via the **Block STQ**:
+
+```
+BlockSTQEntry (per block):
+  base_addr:    40 b   -- base memory address
+  tile_phys:     8 b   -- source physical tile
+  size_log2:     3 b   -- tile size
+  valid:         1 b
+  bid:           8 b
+```
+
+At block retire:
+```
+For each BlockSTQ entry k in retiring block:
+  -- Transfer to STQ
+  STQ[stq_idx].valid = 1
+  STQ[stq_idx].btag = 0xFF  -- immediately non-speculative
+  STQ[stq_idx].drain_rdy = 1  -- can drain immediately
+```
+
+---
+
+## 7. Integration with Existing Infrastructure
+
+### 7.1 MapQ (Instruction-Precise P-Reg Recovery)
+
+MapQ is **fully reused** with no modification:
+
+- Each renamed destination still pushes a MapQ entry with `{arch_reg, old_ptag, new_ptag, RID, checkpoint_id}`
+- On exception: MapQ is replayed in reverse from `fault_rid` (captured from iROB entry)
+- On branch mispredict: MapQ is replayed from `flush_rid` (unchanged from BCC design)
+
+The BROB does not interfere with MapQ. MapQ provides instruction-precise recovery within a block; the BROB provides block-granularity exception identification.
+
+### 7.2 Branch-Tag Tracker
+
+The existing 8-entry branch-tag tracker is **fully reused**:
+
+- Each block is assigned a branch tag at BSTART (same as any branch)
+- The block's `bid` is stamped in all iROB entries within the block
+- On branch mispredict: the branch-tag tracker flushes all entries with `btag` matching the mispredicted branch or its descendants
+- BROB entries are flushed using the same rule: `bid > flush_bid`
+
+### 7.3 RAT Checkpoints
+
+The 8-entry RAT checkpoint snapshot is **extended**:
+
+- Each block's `checkpoint_id` references a RAT checkpoint
+- The checkpoint captures the full SMAP state at block entry
+- On exception: Tile RAT is restored from `BROB[head].checkpoint_id`
+- On branch mispredict: both Scalar RAT and Tile RAT are restored (unchanged)
+
+### 7.4 SSB and STQ (Scalar Memory and Tile Memory)
+
+The existing SSB and STQ are **extended with bid tagging**:
+
+```
+SSBEntry (extended):
+  ...
+  bid:     8 b   -- which block this store belongs to
+
+STQEntry (extended):
+  ...
+  bid:     8 b   -- which block this tile store belongs to
+```
+
+At D2 (store allocation):
+```
+SSB[idx].bid = current_bid  -- stamp with block ID
+STQ[idx].bid = current_bid
+```
+
+At block retire:
+```
+SSB entries with bid == retiring_bid:
+  SSB[idx].btag = 0xFF  -- become non-speculative
+  SSB[idx].drain_rdy = 1
+STQ entries with bid == retiring_bid:
+  STQ[idx].btag = 0xFF
+  STQ[idx].drain_rdy = 1
+```
+
+### 7.5 VTG Integration (GVIQ)
+
+The VTG micro-instruction path integrates with the BROB as follows:
+
+```
+Block allocation at BSTART (if block_type == VTG):
+  BROB[k].needs_engine = 1
+  BROB[k].gviq_block_start = current GVIQ head index
+
+During block execution:
+  GVIQ entries are stamped with bid
+  VTG micro-instructions execute via GVIQ rotation scheduler
+  Each completed VTG micro-op signals completion to GVIQ
+  GVIQ tracks: engine_done for the block = (all VTG ops in block completed)
+
+Engine completion at block boundary:
+  GVIQ checks: all VTG ops in block have completed
+  If yes: signal engine_done to BROB with bid
+  If no: engine_done remains 0, BSTOP cannot retire
+```
+
+The VTG path is **compatible** with the BROB because:
+- VTG micro-instructions already carry `block_id` (the VTG block identifier)
+- The GVIQ already tracks completion state per block
+- The BROB extends this to gate block retirement
+
+### 7.6 GVIQ Block Scheduling
+
+Within a VTG block, the GVIQ rotation scheduler operates as described in PR #10. The BROB integration adds one rule:
+
+```
+GVIQ issue gate (extended):
+  old rules: src_ready, iter_nonzero, VEC ALU available
+  NEW: block_complete = (BROB[bid_slot].engine_done || !BROB[bid_slot].needs_engine)
+  If !block_complete: VTG micro-op cannot retire from GVIQ
+```
+
+---
+
+## 8. Pipeline Integration
+
+### 8.1 Extended Pipeline Stages
+
+The Davinci-v2 BCC pipeline is extended with BROB stages:
+
+```
+F0 --> F1 --> F2 --> F3 --> IB --> F4 --> D1 --> D2 --> D3
+                                                          |
+                                                          v
+                                                     BROB Alloc
+                                                          |
+                                                          v
+S1 --> S2 --> P1 --> I1 --> I2 --> E1 --> EX_n --> W1
+                                                          |
+                                                   +------+------+
+                                                   |             |
+                                                   v             v
+                                             Scalar Done   Engine Done
+                                                   |             |
+                                                   +------+------+
+                                                          |
+                                                          v
+                                                    BROB Complete
+                                                          |
+                                                          v
+                                                    BROB Retire
+```
+
+### 8.2 BSTART / BSTOP in the Pipeline
+
+```
+BSTART uop:
+  D2/D3: Allocate BROB entry
+  S1:    Allocate iROB entry, stamp bid
+  EX:    Bypasses all execute (no IQ entry)
+  WB:    Bypasses writeback
+  Retire: Immediate (no gate) -- block allocation is the "commit" of BSTART
+
+BSTOP uop:
+  D2/D3: Close block (set tail_rid, needs_engine flags)
+  S1:    Allocate iROB entry, stamp bid
+  EX:    Bypasses execute
+  WB:    Bypasses writeback
+  Retire: GATED on BROB[bid_slot].COMPLETE
+```
+
+---
+
+## 9. Flush and Recovery
+
+### 9.1 Flush Trigger
+
+Flush is triggered by:
+1. **Branch mispredict** (existing branch-tag mechanism)
+2. **Exception in block** (new BROB mechanism)
+
+### 9.2 Flush Protocol
+
+```
+flush_bid = BROB[head].bid   -- flush all blocks with bid > flush_bid
+
+In parallel (1 cycle):
+  a) iROB: invalidate all entries with bid > flush_bid
+  b) BROB: set valid = 0 for entries with bid > flush_bid
+  c) GVIQ: invalidate entries with bid > flush_bid
+  d) IQ entries: invalidate entries with bid > flush_bid (branch-tag CAM-clear extended)
+  e) SSB: set valid = 0 for entries with bid > flush_bid
+  f) STQ: set valid = 0 for entries with bid > flush_bid
+  g) Block SSB: invalidate entries with bid > flush_bid
+  h) Block STQ: invalidate entries with bid > flush_bid
+  i) MapQ: pop entries from flush_rid+1 backward (undo SMAP writes)
+  j) Tile RAT: restore from checkpoint[BROB[flush_bid_slot].checkpoint_id]
+  k) Scalar RAT: flash-restore from checkpoint (unchanged)
+  l) Branch-tag tracker: free tags for flushed blocks
+  m) BROB tail: advance to flush_bid+1 (reclaim flushed entries)
+```
+
+### 9.3 Exception Delivery Protocol
+
+```
+On exception in block B at BROB head:
+  1. Block B is NOT retired (remains at head, blocked)
+  2. Flush all blocks with bid > B.bid
+  3. MapQ reverse replay from fault_rid backward
+  4. Tile RAT restore from B.checkpoint_id
+  5. Deliver exception:
+       EPC = BSTART_PC
+       Cause = B.exception_cause
+       TVAL = fault_rid_value (if applicable)
+  6. OS/kernel handler restores context and re-executes block B
+```
+
+---
+
+## 10. Hardware Cost
+
+### 10.1 New Structures
+
+| Block | Size | Gate Count Estimate |
+|-------|------|-------------------|
+| BROB (128 entries) | 128 x ~120 b = ~15 Kb | ~150 K (state + control FSM) |
+| Block SSB (32 entries) | 32 x ~200 b = ~6.4 Kb | ~60 K |
+| Block STQ (16 entries) | 16 x ~100 b = ~1.6 Kb | ~20 K |
+| BID tagging in iROB entries (3 b x 64) | ~192 b | ~2 K |
+| BID tagging in GVIQ entries (3 b x 32) | ~96 b | ~1 K |
+| BID tagging in IQ entries (3 b x 96) | ~288 b | ~3 K |
+| BID tagging in SSB/STQ (8 b x 24/8) | ~192 b | ~2 K |
+| BROB allocate FSM + complete check | | ~20 K |
+| Exception delivery logic | | ~10 K |
+| **Total new structures** | ~24 Kb | **~268 K gate** |
+
+### 10.2 Comparison
+
+| Design | Exception Support | Cost |
+|--------|-----------------|------|
+| Davinci-v2 original (no ROB) | None (kernel-only) | ~110 K gate (SSB/STQ/branch-tag) |
+| Davinci-v2 + Block-ROB | Block-granularity precise | ~378 K gate total |
+| Flat 64-entry ROB (hypothetical) | Instruction-precise | ~500+ K gate (estimation) |
+
+Block-ROB achieves ~75% of the precision of a flat ROB at ~75% of the gate cost.
+
+---
+
+## 11. Integration with Davinci_superscalar_v2.md
+
+### 11.1 Sections to Update
+
+| Section | Update |
+|---------|--------|
+| §1 Key Parameters | Add BROB, Block SSB, Block STQ parameters |
+| §3 Block Diagram | Add BROB, Block SSB, Block STQ blocks |
+| §4 Pipeline | Add BROB Alloc stage; BSTOP retire gate |
+| §6 Decode & Rename | BSTART/BSTOP handling, BID allocation |
+| §7 Dispatch & Issue | Block SSB allocation, GVIQ block scheduling |
+| §8.1 Scalar Unit | BSTOP retire gate in commit logic |
+| §10 OoO Model | Add BROB to core principles |
+| §10.3 Rename | Block checkpoint_id integration |
+| §10.6 Branch Recovery | BROB flush protocol |
+| §11 Speculation Recovery | Replace exception "out of scope" with Block-ROB model |
+| §11.7 | Remove "precise exceptions out of scope" statement |
+| §11.8 Comparison | Add Block-ROB to comparison table |
+
+### 11.2 New Sections to Add
+
+| New Section | Content |
+|-------------|---------|
+| §11.X BROB — Block Reorder Buffer | Full BROB specification |
+| §11.X Block SSB — In-Block Store Commit | Block SSB allocation, commit protocol |
+| §11.X Precise Exception Delivery | Exception flow, MapQ replay, OS entry |
+
+---
+
+## 12. Open Questions
+
+| ID | Question | Priority |
+|----|----------|----------|
+| OQ-1 | Should VTG micro-blocks be treated as a separate block or sub-block? A VTG macro contains multiple VTG micro-ops; should the BROB track at the VTG macro level or at the individual VTG micro-op level? | High |
+| OQ-2 | How does BROB interact with the existing VTG GVIQ rotation scheduler? Should GVIQ issue be gated by `BROB[bid_slot].engine_done`? | High |
+| OQ-3 | For mixed blocks (scalar + VTG + VEC), does `engine_done` require ALL engines to complete, or can scalar and engine complete in parallel before retire? | High |
+| OQ-4 | How many checkpoint entries are needed? Each block uses one checkpoint. With 128 BROB entries and 8 checkpoints, the maximum nesting depth is 8. Is this sufficient? | Medium |
+| OQ-5 | Should block retirement commit SSB entries immediately (drain_rdy=1) or queue them for the existing SSB drain pump? | Medium |
+| OQ-6 | How does the BROB handle interrupts (external interrupts, not exceptions)? Interrupts are asynchronous and may arrive mid-block. | Low |
+| OQ-7 | Should there be a "block fence" instruction that forces a block boundary for interrupt latency reasons? | Low |
+
+---
+
+## Appendix A: Comparison with LinxCore Block-ROB
+
+| Aspect | LinxCore Block-ROB | Davinci-v2 Block-ROB |
+|--------|--------------------|----------------------|
+| ISA mandate | LinxISA requires BSTART/BSTOP at every control-flow join | Optional; compiler generates blocks at natural join points |
+| Block types | 10 types (STD, FP, SYS, MPAR, MSEQ, VPAR, VSEQ, TMA, CUBE, TEPL) | 5 types (STD, VTG, VEC, CUBE, MTE) |
+| Engine completion binding | `scalar_done && engine_done` via signal-level wire | Completion bus + bid matching |
+| iROB width | Banked (8-entry banks, 64 total) | Unbanked (64 entries, shared) |
+| Block size | Variable (macro-op boundary) | 4-64 uops (configurable) |
+| MapQ analog | None (instruction-precise via per-uop checkpoint) | MapQ provides instruction-precise P-reg recovery within block |
+| Store commit | In-order within block; drain through committed-store path | Block SSB transfers to SSB at block retire |
+| Branch correction | Epoch-gated deferred correction | Unchanged from BCC (MapQ reverse replay) |
+| Exception granularity | Block-granularity | Block-granularity with fault_rid for instruction precision |
+| BID encoding | 64-bit BID (slot + sequence) | 64-bit BID (slot + sequence) |
+
+---
+
+## Appendix B: Worked Example
+
+### B.1 Scalar Block with Exception
+
+```
+Block B: BSTART, u0 (ADD r1, r2, r3), u1 (LD r4, [r5]), u2 (MUL r6, r4, r7), BSTOP
+
+Assume u1 triggers a TLB miss / page fault.
+
+Step 1: BSTART at D2
+  BROB tail = 5, allocate entry BROB[5]
+  bid = {seq=17, slot=5} = 0x11_05
+  checkpoint_id = 3
+  head_rid = iROB head
+
+Step 2: u0, u1, u2 dispatch through iROB
+  Each stamped with bid[7:0] = 5 (3 bits)
+  u0 executes normally, writes back
+  u1 (LD r4, [r5]) executes: TLB miss detected
+    LSU sets iROB[rid1].trap_valid = 1
+    LSU sets BROB[5].has_exception = 1
+    LSU sets BROB[5].fault_rid = rid1
+    LSU sets BROB[5].exception_cause = PAGE_FAULT (0x0F)
+
+Step 3: BSTOP reaches iROB head
+  BSTOP retire is gated: !BROB[5].has_exception = FALSE
+  BSTOP cannot retire. Block B is blocked at head.
+
+Step 4: Exception delivery (next cycle)
+  flush_bid = BROB[5].bid  -- no younger blocks in this case
+  MapQ replay from fault_rid (rid1) backward:
+    undo SMAP writes from u1 (LD r4), u0 (ADD r1)
+    Restore ptags: r4_old --> SMAP[r4], r1_old --> SMAP[r1]
+    Pop MapQ entries down to BSTART
+  Tile RAT restore from checkpoint_id = 3
+  Deliver exception:
+    EPC = BSTART_PC
+    Cause = PAGE_FAULT
+    tval = faulting address from r5
+  OS handler restores context, re-executes from BSTART
+
+Step 5: After handler returns
+  Block B is re-fetched and re-executed
+  BROB[5] is re-allocated (seq number incremented)
+```
+
+### B.2 Branch Mispredict Within Block
+
+```
+Block B: BSTART, u0 (ADD r1, r2, r3), u1 (setc.eq r4, r1, r0), u2 (BEQ target), BSTOP
+
+Assume u2 (BEQ) is predicted taken but actually not taken (mispredict at EX1).
+
+Step 1: u2 mispredicts at EX1
+  Branch-tag tracker: mispredict signal for tag t
+  MapQ reverse replay from flush_rid = u2's RID
+    undo SMAP writes from u2, u1, u0
+  RAT flash-restore from checkpoint_id = 3
+  Branch-tag CAM-clear: invalidate iROB entries with btag = t or descendant
+  SSB invalidation: entries with btag = t or descendant
+  Fetch redirect to correct PC (fall-through)
+
+Step 2: BROB[5] is NOT affected
+  The branch mispredict squashes all entries with btag = t
+  BROB entries are stamped with bid, not btag
+  BROB[5] is not invalidated by the branch-tag mechanism
+  Block B is squashed because its iROB entries were invalidated
+
+Step 3: Block B squashed
+  BROB[5].valid = 0  (flushed as part of the mispredict)
+  MapQ already replayed
+  BSTOP never retired; BROB tail advances past BROB[5]
+
+Step 4: New block B' starts at redirect target
+  BSTART' allocated at BROB[6]
+  BSTOP' closes the correct block
+```
+
+**Key point**: the BROB is **not** invalidated by the branch-tag mechanism. Blocks are invalidated because their constituent iROB entries are invalidated (the uops are gone). The BROB entry is invalidated as a side effect of the fetch redirect.
+
+### B.3 VTG Block with Multiple VTG Micro-Ops
+
+```
+Block B: BSTART, u0 (VTG_VADD T0.g0, T1.g0, T2.g0), u1 (VTG_VMUL T3.g1, T0.g0, T2.g1), BSTOP
+
+Block type: VTG
+needs_engine = 1 (GVIQ engine)
+n_vtg_ops = 2
+
+Step 1: BSTART at D2
+  BROB[5] allocated, needs_engine = 1, engine_done = 0
+  GVIQ entries allocated for u0, u1
+
+Step 2: Scalar uops (BSTART) complete immediately
+  Scalar side done (no scalar uops in this block)
+  scalar_done = 1 (BSTOP retired trivially for scalar-only uops)
+
+Step 3: VTG micro-ops execute via GVIQ rotation
+  u0 (VTG_VADD): GVIQ picks, executes, signals completion to BROB[5]
+    engine_done_partial |= (u0 completed)
+  u1 (VTG_VMUL): GVIQ picks, executes, signals completion to BROB[5]
+    engine_done_partial |= (u1 completed)
+
+Step 4: BROB completion check
+  After both VTG ops complete:
+    BROB[5].engine_done = 1
+    complete = scalar_done && engine_done = 1 && 1 = 1
+    BROB[5].state = COMPLETE
+
+Step 5: Block retire
+  If no exception: commit VTG writes (Group Write Adapter RMW for each VTG op)
+  Advance BROB head
+  Retire BSTOP
+```
diff --git a/designs/outerCube/Davinci_superscalar_v2.md b/designs/outerCube/Davinci_superscalar_v2.md
new file mode 100644
index 00000000..a9067dbc
--- /dev/null
+++ b/designs/outerCube/Davinci_superscalar_v2.md
@@ -0,0 +1,3787 @@
+# Davinci Out-of-Order Processor Core — v2
+
+## 1. Overview & Design Philosophy
+
+The Davinci-v2 core is a **single-threaded, 4-wide, out-of-order, speculatively-executing** processor targeting AI inference, HPC, and dense linear algebra workloads. It executes a unified instruction stream containing four instruction domains — scalar, vector, cube (matrix), and memory-tile-engine (MTE) — on a shared front-end with distributed back-end execution units.
+
+**v2 inherits the v1 baseline ([`Davinci_supersclar.md`](Davinci_supersclar.md)) and adds three architectural enhancements:**
+
+1. **TRegFile-4K with per-port `is_transpose` reads** (see [`tregfile4k.md`](tregfile4k.md) §7). Each read port now carries a 1-bit `is_transpose` flag latched at the epoch boundary alongside `reg_idx`. When asserted, the port delivers the **chunk-grid transpose** of the addressed 4 KB tile at full **512 B/cy** for the entire 8-cycle epoch — bank-conflict-free under the diagonal skew, with no SRAM duplication, no extra latency, and no dedicated transpose buffer. This eliminates the v1 `TILE.TRANSPOSE` predecessor for most use cases and enables the vector unit's per-beat tilelet-transpose mechanism (§8.3).
+2. **Vector unit upgraded to VEC-4K-v2** (see [`vector4k_v2.md`](vector4k_v2.md)). Major changes vs. v1's vector unit:
+   - Up to **3 tile operands** per instruction: two value tiles (`A`, `B`) plus one **per-element bitmask tile** (`C`).
+   - Up to **2 tile results** per instruction (`D0`, `D1`) — value+index, quotient+remainder, etc.
+   - **Per-element predication / masking** on every elementwise op, every reduction, and every gather/sort — at zero fetch-phase cost in the common case.
+   - **Tile-register metadata** (32 b: `shape.x`, `shape.y`, `format`) carried alongside each 4 KB tile.
+   - **SRAM-based staging registers** (`SA`, `SB`, `SC`) decouple TRegFile fetch cadence from compute pipeline; per-beat microcode dispatches `{src, strip, tilelet_xpose}` per ALU operand.
+   - **Restored narrow formats**: FP8 (E4M3/E5M2), FP4 (MXFP4/HiFP4) joining FP32, FP16, BF16.
+   - **Three new PTO instructions** natively enabled by the v2 datapath: `TINV` (matrix inverse up to 128×128 FP32 / 16-tile range), `TROWRANGE_MUL` (column-wise product over a dynamic row sub-range), `TMRGSORT` (full-tile mergesort over any `N = 2^p` up to 8192 via a reconfigurable 256-lane shuffle + compare-swap primitive).
+3. **Speculative out-of-order execution** with a **ROB-less recovery scheme** that nonetheless guarantees architectural state is never corrupted by a misspeculated path (§11). The mechanism extends the v1 RAT-checkpoint + reference-counting infrastructure with a **branch-tagged speculative store buffer** for scalar memory and a **speculative tile-store queue** for MTE memory writes — both of which gate visible side effects until the producing branch tag becomes non-speculative. Section 11 walks through why this is sufficient without a Reorder Buffer, what it costs in area / latency, and which workloads it can and cannot serve correctly.
+
+> **Design discipline:** The v2 core assumes **run-to-completion kernel execution** with **no OS-level interrupts** — the same envelope as v1. The new v2.3 **Block-ROB (BROB)** adds **block-granularity precise exception support**, enabling the core to identify the faulting instruction block and recover precisely when an exception (trap, page fault, illegal instruction) does occur. The new speculation-recovery mechanism handles **branch mispredictions** and **variable-latency tile ops**; Section 11.7 enumerates the remaining "non-recoverable" classes (asynchronous page faults, signaling NaNs, ECC errors observed mid-kernel) and the kernel-level conventions that bound them.
+
+### 1.1 Key Parameters (v2 deltas in **bold**)
+
+| Parameter | Value |
+|-----------|-------|
+| Scalar ISA width | **64-bit** RISC (ARM / RISC-V style), unchanged |
+| Architectural GPRs | **32** (X0–X31), 64-bit |
+| Physical GPRs | **128** (P0–P127), 64-bit |
+| Architectural tile regs | **32** (T0–T31), 4 KB each |
+| Physical tile regs | **256** (PT0–PT255) in TRegFile-4K |
+| **TRegFile-4K read ports** | **8R, each with `is_transpose` bit** (§9.2) |
+| **Per-tile metadata** | **32 b** (shape.x, shape.y, format) §9.2.1 |
+| Fetch / decode width | **4** instructions / cycle |
+| Scalar issue width | **7** (4 ALU + 1 MUL/DIV + 1 BRU from alu_iq; 2 LSU from lsu_iq) |
+| **Vector issue width** | **1 VEC-4K-v2 instruction / cycle** (§8.3) |
+| Cube issue width | **1** CUBE instruction / cycle |
+| MTE issue width | **2** TILE.LD/ST per cycle |
+| Pipeline depth (scalar) | **17+** stages (fetch-to-writeback, with D1/D2/D3 rename and P1/I1/I2 issue separation) |
+| Branch predictor | Hybrid TAGE + BTB + RAS |
+| **MapQ depth** | **12** entries (speculative rename increment log; instruction-precise recovery via reverse replay; replaces RAT checkpoint snapshots) |
+| **Branch tag width** | **3 b** (matches checkpoint count); attached to every in-flight RS / store-buffer / tile-store-queue entry |
+| Physical IQ entries | Scalar ALU: **48**, Scalar BRU: **16**, LSU: **32**, Vector: **24**, Cube: **4**, MTE: **16** |
+| **GVIQ (Grouped Vector IQ) entries** | **32** entries; 1 VTG micro-op / cycle; entry prefix: block_id + pc_index + group_id + iter0..iter3 |
+| **VTG (Vector Thread Group) count** | **16** x 256 B VTGs / tile (`G256`) or **8** x 512 B VTGs / tile (`G512`) |
+| **Micro-instruction buffer depth** | **16** entries; shared by all VTGs in a tile group; max **64** micro-ops per block |
+| **SIMD lanes per VTG beat** | **128** (FP32) . **256** (FP16/BF16) . **512** (FP8) . **1024** (FP4) |
+| **GVIQ (Grouped Vector IQ) entries** | **32** entries; 1 VTG micro-op / cycle; entry prefix: block_id + pc_index + group_id + iter0..iter3 |
+| **VTG (Vector Thread Group) count** | **16** × 256 B VTGs / tile (`G256`) or **8** × 512 B VTGs / tile (`G512`) |
+| **Micro-instruction buffer depth** | **16** entries; shared by all VTGs in a tile group; max **64** micro-ops per block |
+| **SIMD lanes per VTG beat** | **128** (FP32) · **256** (FP16/BF16) · **512** (FP8) · **1024** (FP4) |
+| **Speculative store buffer entries** | **24** (was 16 in v1; widened to absorb branch-tag gating §11.4) |
+| **Speculative tile-store queue** | **8** entries (branch-tag-gated, MTE-side §11.5) |
+| L1-I cache | 64 KB, 4-way, 64 B line |
+| L1-D cache | 64 KB, 4-way, 64 B line, non-blocking (8 MSHRs) |
+| L2 cache (core-private) | 512 KB, 8-way, 64 B line |
+| Cube MXU | 4096 base MACs, 8 banks, dual-mode A/B |
+| Clock target | ≥ **1.5 GHz** (5 nm) |
+| **Peak FP32 throughput (vec)** | **0.77 TFLOPS** (1 tile / 8 cy at 1.5 GHz, 128-lane FMA) |
+| **Peak FP4 throughput (vec)** | **6.14 TFLOPS** (4× SIMD per group) |
+| Peak FP16 throughput (cube) | **12.3 TFLOPS** |
+| Peak FP8 throughput (cube) | **24.6 TOPS** |
+| Peak MXFP4 throughput (cube) | **98.3 TOPS** |
+| **BROB entries** | **128** (Block Reorder Buffer; tracks instruction block lifetimes for precise exceptions; SS11.11) |
+| **Block SSB entries** | **32** (in-block scalar store buffer; SS11.11) |
+| **Block STQ entries** | **16** (in-block tile-store buffer; SS11.11) |
+| **BID width** | **8 b** slot index + 56 b sequence (64 b total); SS11.11 |
+
+### BCC-Style Scalar Pipeline Deltas (v2 BCC overlay)
+
+The following parameters supersede the corresponding v1/v2 entries above when the BCC scalar pipeline is enabled.
+
+| Parameter | Value |
+|-----------|-------|
+| **Scalar rename pipeline** | D1 (decode + RID/atag allocation) → D2 (SMAP read + ptag allocation + MapQ push) → D3 (rename complete + Ready Table init) |
+| **atag** | Architectural register index (0–31 for GPRs), replaces "architectural GPR" terminology |
+| **ptag** | Physical register index (P0–P127), replaces "physical GPR" / "P-reg" terminology |
+| **Rename tables** | CMAP (committed map, 32×7 b) + SMAP (speculative map, 32×7 b) + MapQ (12-entry ring buffer) |
+| **Physical IQ topology** | 3 separate physical IQs: `alu_iq` (48 entries, 4-wide issue), `bru_iq` (16 entries, 1-wide), `lsu_iq` (32 entries, 2-wide) |
+| **Wakeup mechanism** | **Ready Table** (128-bit bitmap; O(1) ptag lookup) — replaces CDB comparator arrays |
+| **Issue picker** | **Age-matrix cascaded pick** using RID-based sub-head age: `age = (entry.rid − head_rid) mod 64`; purely combinational, no per-entry age field |
+| **Issue stages** | P1 (age-matrix pick) → I1 (RF read-port arbitration) → I2 (confirm issue + IQ deallocation) |
+| **Wakeup latency** | 2 cycles (Ready Table register → can_issue → pick → RF read) |
+| **Flush model** | MapQ reverse replay from `flush_rid` → SMAP restored to exact CMAP state; branch-tag CAM-clear on physical IQs |
+| **VTG execution mode** | Full-tile `T*` + VTG `V*` micro-ops share VEC-4K-v2 ALU; GVIQ (32 entries) handles VTG separately from Vector RS (24 entries) |
+| **VTG scheduling** | GVIQ rotation scheduler picks oldest-ready VTG; `block_id` → micro-instruction buffer lookup; Group Read/Write Adapters select VTG sub-ranges |
+| **SIMD lane count** | 128 lanes/beat (FP32), 256 (FP16/BF16), 512 (FP8), 1024 (FP4) |
+
+---
+
+### BCC-Style Vector Pipeline Deltas (v2.2 VTG overlay)
+
+The following parameters describe the VTG vector micro-instruction overlay on top of the VEC-4K-v2 datapath.
+
+| Parameter | Value |
+|-----------|-------|
+| **VTG modes** | `G256` (16×256 B VTGs/tile) . `G512` (8×512 B VTGs/tile) |
+| **GVIQ depth** | 32 entries, 1-wide issue |
+| **Micro-instruction buffer** | 16 entries (2-way set assoc), max 64 micro-ops/block |
+| **VTG Metadata Table** | 16 entries / physical tile |
+| **VTG Ready Table** | 256-bit bitmap (one bit per PT0..PT255) |
+| **Loop counters per GVIQ entry** | 4 × 16-bit (`iter0..iter3`) |
+| **Group adapters** | Group Read Adapter (TRegFile → VTG sub-range), Group Write Adapter (VTG → TRegFile) |
+| **Paired G256 issue** | Optional: 2 independent 256 B VTGs share one 512 B SIMD group beat |
+
+---
+
+## 2. ISA Summary
+
+The v2 ISA is a strict superset of v1: every v1 opcode encodes identically and behaves identically. v2 adds:
+
+- **Masked variants** of every elementwise vector op, every reduction, and every gather (encoded by a bit in `funct7`).
+- **Three new PTO instructions** (§2.2.6).
+- **A new tile-metadata setter** `TSETMETA` (§2.2.7).
+- **Branch hint bits** in the conditional-branch encoding for static prediction override (§5.2.4).
+
+### 2.1 Scalar ISA
+
+> **(v1 → v2: 内容未变更,以下完整复制自 v1 §2.1。)**
+
+A 64-bit RISC instruction set with ARM / RISC-V style operations.
+
+| Category | Instructions | Operands | Latency (cycles) |
+|----------|-------------|----------|-------------------|
+| Integer ALU | ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, MOV | 2 src GPR, 1 dst GPR | 1 |
+| Immediate ALU | ADDI, ANDI, ORI, XORI, SLLI, SRLI, SRAI, LUI | 1 src GPR + imm, 1 dst GPR | 1 |
+| Multiply | MUL, MULH, MULHU | 2 src GPR, 1 dst GPR | 4 (pipelined) |
+| Divide | DIV, DIVU, REM, REMU | 2 src GPR, 1 dst GPR | 12–20 (non-pipelined) |
+| Compare & branch | BEQ, BNE, BLT, BGE, BLTU, BGEU | 2 src GPR + offset | 1 (resolve) |
+| Jump | JAL, JALR | 1 src GPR + offset, 1 dst GPR | 1 |
+| Load | LB, LH, LW, LD, LBU, LHU, LWU | 1 src GPR + offset, 1 dst GPR | 4 (L1 hit) |
+| Store | SB, SH, SW, SD | 2 src GPR + offset | 4 (L1 hit) |
+| System | FENCE, NOP, HALT | — | varies |
+
+**Architectural registers:** X0 (hardwired zero) through X31, plus a program counter (PC). Condition flags are not used; branches compare register values directly (RISC-V style).
+
+**Encoding (32-bit):**
+
+```
+  31       25 24  20 19  15 14  12 11   7 6     0
+ ┌──────────┬──────┬──────┬──────┬──────┬────────┐
+ │  funct7  │  rs2 │  rs1 │funct3│  rd  │ opcode │  R-type
+ └──────────┴──────┴──────┴──────┴──────┴────────┘
+
+ ┌─────────────────┬──────┬──────┬──────┬────────┐
+ │    imm[11:0]    │  rs1 │funct3│  rd  │ opcode │  I-type
+ └─────────────────┴──────┴──────┴──────┴────────┘
+```
+
+**v2 增量 (§5.2.4):** A new optional 1-bit `H` (hint) field in the conditional-branch funct3 encoding lets the compiler suggest static taken/not-taken when the dynamic predictor has no entry. Predictor still has final say once it has trained — H is consulted only on a TAGE/BTB miss. v1 software runs on v2 unmodified — the `H` bit defaults to 0 (no hint) when assembled by a v1-targeted compiler.
+
+### 2.2 Vector ISA — VEC-4K-v2
+
+The vector unit consumes **4 KB tile registers** (`T0–T31`, renamed by the Tile RAT to physical tile slots `PT0–PT255` in TRegFile-4K). All vector instructions are 32-bit fixed-width, with the same R/S/T/U-type encoding skeleton as v1 (§2.2.2 of v1).
+
+**v2 changes:**
+
+#### 2.2.1 Tile metadata (`shape.x`, `shape.y`, `format`)
+
+Every physical tile register carries a **32-bit metadata word** alongside its 4 KB payload (§9.2.1):
+
+```
+  ┌────────────┬────────────┬───────────┬─────────────────────┐
+  │ shape.x    │ shape.y    │ format    │ flags / reserved    │
+  │ [13:0]     │ [27:14]    │ [31:28]   │                     │
+  └────────────┴────────────┴───────────┴─────────────────────┘
+```
+
+| Field | Width | Range | Meaning |
+|-------|-------|-------|---------|
+| `shape.x` | 14 b | 1 … 8192 | Number of **columns `C`** (logical row length). Power-of-two only. |
+| `shape.y` | 14 b | 1 … 8192 | Number of **rows `R`**. Power-of-two only. |
+| `format`  | 4 b  | see below | Logical element format. |
+| `flags`   | (overlay) | — | `arg_tile`, `scalar_tile`, `prefetch_hint` (microcode-encoded). |
+
+Legality: `shape.x · shape.y · E = 4096`, where `E` is bytes/element from the format table:
+
+| `format` code | Logical name | `E` (bytes) | Elements/tile |
+|----------------|--------------|-------------|---------------|
+| `0b0000` | FP32 / INT32 | 4 | 1024 |
+| `0b0001` | FP16 / BF16  | 2 | 2048 |
+| `0b0010` | FP8 (E4M3 / E5M2) | 1 | 4096 |
+| `0b0011` | FP4 (MXFP4 / HiFP4) | 0.5 | 8192 |
+| `0b01xx`–`0b11xx` | reserved | — | — |
+
+Metadata is written **only** by the producing instruction (implicit at retire) or the explicit `TSETMETA` op (§2.2.7). It cannot change while the tile is the source of any in-flight fetch.
+
+> **Why metadata?** It lets a single tile op service every shape/format without opcode explosion. The vector unit's stage (A) align/unpack (§8.3) and stage (B) reduction (§8.3) both consult `format` and `shape` from the tile-metadata word read at the first strip of each operand. Microcode programs (§8.3.4) are keyed by `(opcode, format, W-regime, R-regime)`.
+
+#### 2.2.2 Operand model (3 source, 2 destination)
+
+| Operand | Role | Tile-RAT entry | Storage |
+|---------|------|----------------|---------|
+| **A** | Value tile (primary, mandatory) | source | TRegFile read port R0 → `SA` staging |
+| **B** | Value tile (secondary, optional) | source | TRegFile read port R4 → `SB` staging |
+| **C** | **Dual role:** `c_role = MASK` → per-element bitmask (1 b/element); `c_role = VALUE` → **third value tile** for native 3-source FMA family (§2.2.6a) | source (when `has_mask = 1` **or** `c_role = VALUE`) | TRegFile read port R1 (v2.1: 3rd VEC-side binding) → `SC` staging |
+| **D0** | Result tile (primary) | destination | Write port `W0` |
+| **D1** | Result tile (secondary, optional) | destination | Write port `W4` |
+
+The 32-bit instruction word reserves:
+- a `c_role` bit (0 = `MASK`, 1 = `VALUE`),
+- a `has_mask` bit (1 if `C` is fetched **and** `c_role = MASK`),
+- a `retire_mask[1:0]` field (which of `D0`, `D1` are written), and
+- per-operand `is_transpose_{A,B,C}` bits forwarded to the TRegFile read ports (§9.2).
+
+Tile register fields stay 5 bits (T0–T31). When `c_role = VALUE` and `N_val = 3` (e.g. `VFMA`, `VFNMA`, `VLERP`), `C` is fetched as a full 4 KB value tile through the dedicated VEC read port R1 — see §2.2.6a and [`vector4k_v2.md`](vector4k_v2.md) §3.1, §7.6 for the rationale and 3-port binding.
+
+> **Why a third VEC-side TRegFile read port?** TRegFile-4K has 8 physical read ports. v1 and v2.0 used only 2 (R0/R4) for VEC, since operand `C` was strictly a small mask. v2.1 binds **R1 = Port C** so that all three value tiles of a 3-source `VFMA` can be fetched **in parallel within one 8 cy epoch** — same cadence as a binary op. The alternative (sequential 2-epoch fetch on R0/R4) would halve `VFMA` throughput. Bandwidth cost: 0 SRAM, 0 bank-conflict pressure (the diagonal skew already supports 8 conflict-free read ports per [`tregfile4k.md`](tregfile4k.md) §4); only a binding allocation. R1 is idle and clock-gated when no `c_role = VALUE` op is in flight.
+
+#### 2.2.3 Encoding (32-bit)
+
+```
+  R-type (3-source, 2-dest):
+  31      26 25  21 20  16 15   12 11      6 5     0
+ ┌────────┬──────┬──────┬──────┬───────┬────────┐
+ │ funct6 │  Tc  │  Tb  │  Ta  │ Td0/d1│ opcode │
+ │ + xpA  │ (5b) │ (5b) │ (5b) │ (6b)  │ VEC    │
+ │ + xpB  │      │      │      │       │        │
+ │ + xpC  │      │      │      │       │        │
+ │ +mask  │      │      │      │       │        │
+ │ +crole │      │      │      │       │        │  ← v2.1: c_role bit (MASK/VALUE)
+ │ +rmask │      │      │      │       │        │
+ └────────┴──────┴──────┴──────┴───────┴────────┘
+   funct6 packs 6 bits split between op-extension (3 b),
+   has_mask (1 b), is_xpose_A (1 b), is_xpose_B (1 b);
+   is_xpose_C, c_role, and retire_mask travel in the
+   immediate slot of S-/T-types or in a fixed funct7
+   bit pattern.
+```
+
+**Backward compatibility:** v1 vector instructions decode as `has_mask = 0`, `c_role = MASK`, `retire_mask = 2'b01`, `is_xpose_{A,B,C} = 0` — i.e. unmasked, single-result, no-transpose, no-3rd-tile — and produce bit-exact v1 results. `c_role = VALUE` is only generated by a v2.1-aware compiler emitting `VFMA` / `VFNMA` / `VLERP`; v1 binaries cannot express it.
+
+#### 2.2.6 VTG Vector Micro-Instructions (SIMD-Group Execution)
+
+> **(v2.2 BCC vector overlay — Change Point #2)**
+
+In addition to full-tile `T*` vector instructions, Davinci-v2 supports **VTG (Vector Thread Group) vector micro-instructions** — a warp-grouped execution model where one 4 KB tile is partitioned into multiple SIMD-group-sized scheduling units.
+
+**Key concepts:**
+
+| Concept | Definition |
+|---------|------------|
+| **SIMD group** | The 128-lane execution unit inside VEC-4K-v2; one VTG maps to 1 or 2 SIMD group beats (512 B or 256 B). Not software-visible. |
+| **VTG (Vector Thread Group)** | A warp-like scheduling context: 256 B (`G256` mode, 16 VTGs/tile) or 512 B (`G512` mode, 8 VTGs/tile). Each VTG has its own `group_id`, `thread_id`, `iter0..iter3` loop counters, and `active_lanes` in the GVIQ entry prefix. |
+| **Micro-instruction buffer** | A pre-allocated buffer in the vector ALU, shared by all VTGs in the same tile group. Stores the decoded micro-op list keyed by `block_id`. No re-decode at issue time. |
+
+**Architectural naming:**
+
+```text
+VADD.F32     T4.g2, T4.g0, T4.g1, T4.p0    ; tile T4, groups g0+g1 -> g2, pred p0
+VMULS.F16    T7.g5, T7.g5, X12, T7.p1       ; scalar broadcast mul: X12 to all lanes
+VLD.F32      T8.g0, [Xbase + Xoff], T8.p0   ; vector load into VTG g0
+VST.F32      T8.g2, [Xbase + Xoff], T8.p0    ; vector store from VTG g2
+```
+
+**Instruction families (38 total):**
+
+| Category | Instructions | Notes |
+|----------|-------------|-------|
+| Elementwise ALU | `VADD`, `VSUB`, `VMUL`, `VDIV`, `VMIN`, `VMAX`, `VABS`, `VNEG` | Standard lane-wise ops under predicate |
+| Scalar-broadcast ALU | `VADDS`, `VMULS`, `VMAXS` | Scalar from GPR broadcast via SX/SY staging |
+| Compare & Select | `VCMP.{LT/LE/GT/GE/EQ/NE}`, `VSEL`, `VMERGE` | Compare writes predicate VTG; `VSEL` selects between sources |
+| Conversion | `VCVT.dtype.stype`, `VROUND`, `VTRUNC` | Type conversion with saturation/rounding |
+| Math | `VSQRT`, `VEXP`, `VLOG`, `VRELU` | Non-linear elementwise ops |
+| Predicate | `PLT`, `PAND`, `POR`, `PXOR`, `PNOT` | Predicate generation and logic |
+| Memory | `VLD`, `VST`, `VLDSTRIDE`, `VSTSTRIDE`, `PGATHER` | VTG load/store; inactive lanes do not fault |
+| Reduction / Wide | `VREDUCE_ADD`, `VREDUCE_MAX`, `WADD` | Scalar reduction output; wide 2-VTG result |
+
+**GVIQ entry prefix** (before operand fields):
+
+```text
+{ block_id[11:0], pc_index[7:0], group_id[3:0], thread_id[7:0],
+  iter0[15:0], iter1[15:0], iter2[15:0], iter3[15:0],
+  active_lanes[15:0], active_group_mask[15:0] }
+```
+
+**Full-tile vs. VTG coexistence:**
+
+| Execution mode | ISA prefix | Operand unit | Scheduling unit | Typical use |
+|---------------|-----------|-------------|----------------|-------------|
+| Full-tile `T*` | `T` | 4 KB tile | One tile per VEC op | Large matrix GEMM, full-tile reductions |
+| VTG `V*` micro-op | `V` | 256 B or 512 B VTG | One VTG per GVIQ entry; VTG operates behind VEC-4K-v2 staging (SA/SB/SC) | Strip-mined elementwise, inner loops. VTG latency: **25-32 cy minimum** (8-15 cy prologue + 1 cy compute + 16 cy RMW writeback). VTG reuses VEC-4K-v2 ALU and TRegFile ports (R0/R4/W0) via VEC-domain arbitration. |
+
+The two execution paths share the same VEC-4K-v2 ALU and TRegFile ports (R0/R4 for reads, W0 for writeback). They arbitrate for VEC staging and TRegFile ports via the VEC-domain arbiter. Full-tile VEC ops have higher priority than VTG micro-ops. A VTG micro-op uses the **micro-instruction buffer** (pre-decoded VEC beat-word sequences) rather than the Vector RS opcode field.
+
+**TSPLIT / TJOIN metadata transitions:** The compiler may emit `TSPLIT` (declare tile as NxVTG sub-units) or `TJOIN` (declare VTGs form a coherent tile view) as metadata-only transitions. No physical instruction encoding required — the metadata is set via tile metadata fields at rename time., `c_role = MASK`, `retire_mask = 2'b01`, `is_xpose_{A,B,C} = 0` — i.e. unmasked, single-result, no-transpose, no-3rd-tile — and produce bit-exact v1 results. `c_role = VALUE` is only generated by a v2.1-aware compiler emitting `VFMA` / `VFNMA` / `VLERP`; v1 binaries cannot express it.
+
+#### 2.2.4 Masked / predicated variants
+
+Every elementwise op (`VADD`, `VMUL`, `VFMA`, `VSEL`, …), every reduction (`VROWSUM`, `VCOLMAX`, …), and every gather (`VGATHER`, `VGATHERB`) has a masked counterpart. Conventions:
+
+| Encoding | Behaviour |
+|----------|-----------|
+| `has_mask = 0` | Full-tile op; `C` not fetched; lane gate ties to `IMM_ALL_ONES` (every lane participates). Identical to v1 semantics. |
+| `has_mask = 1` | `C` fetched (≤ 2 strips piggybacking on an idle read-port cycle, §8.3.6); per-lane gate `out[lane] = M[lane] ? alu_core_out[lane] : identity[lane]`. The "identity" depends on the op (see [`vector4k_v2.md`](vector4k_v2.md) §5.8): preserves operand A for `TSEL`-style, leaves accumulator unchanged for masked `ACCUM`, etc. |
+
+**Cycle cost.** Masked variants pay **0 extra fetch cycles** in the common case (mask piggybacks on idle port within the value-tile epoch). End-to-end latency is therefore the same as the unmasked variant.
+
+#### 2.2.5 Per-operand transpose (`is_xpose_*`) and per-beat tilelet transpose
+
+Two orthogonal transpose mechanisms, both reusing the chunk-grid transpose algorithm at 64 B sub-chunk granularity:
+
+| Mechanism | Granularity | Control | Where |
+|-----------|-------------|---------|-------|
+| **TRegFile read-port `is_transpose`** | Whole tile, set once at fetch time | `is_xpose_{A,B,C}` bits in the issue packet | TRegFile-4K read port ([`tregfile4k.md`](tregfile4k.md) §7) — costs zero VEC-side hardware |
+| **Staging-side `tilelet_xpose`** | Per 512 B tilelet, per-beat | One bit per operand slot in each microcode beat | Staging register (`SA`, `SB`, `SC`) read datapath — costs ~30 K gate per staging register |
+
+The two combine: a tile may be fetched in row-mode and then read in col-mode beat-by-beat from staging, or fetched pre-transposed (col-mode) and re-read in row-mode. Microcode picks the combination per instruction.
+
+**TRegFile-side rule R2** ([`tregfile4k.md`](tregfile4k.md) §6): the two physical read ports active in any 8-cycle epoch must share the same `is_transpose`. For a 2-value-operand instruction with `is_xpose_A ≠ is_xpose_B`, microcode splits the fetch into two epochs (16 cy instead of 8 cy) — this is the only scheduling cost of the new flag (§8.3.6).
+
+#### 2.2.6a Native 3-source ternary FMA family (`VFMA`, `VFNMA`, `VLERP`) — v2.1 增量
+
+A new family of vector instructions that consume **three independent value tiles** and produce one (or two) result tiles, enabled by the operand-`C` dual-role mechanism (§2.2.2) and the 3rd VEC-side TRegFile read port (R1, [`vector4k_v2.md`](vector4k_v2.md) §3.1, §7.6). v0.16 of [`vector4k_v2.md`](vector4k_v2.md) only supported ternary FMA via the **accumulator feedback path** (`VFMA_ACC D = A·B + Acc`), which is suitable for GEMM-epilogue / FMA-accumulate kernels but **fails** for the canonical FMA pattern `D = A·B + C` where the third operand is **not** the previous accumulator.
+
+| Mnemonic | Operands | Semantics | Encoding | Cycle budget (typical, uniform `is_transpose`) |
+|----------|----------|-----------|----------|-------------------------------------------------|
+| **VFMA** | `Td0, Ta, Tb, Tc` | `Td0 = Ta · Tb + Tc` (single-rounding IEEE-754 FMA, all formats) | R-type, `c_role = VALUE`, `has_mask = 0` | **8 cy fetch (3-port parallel) + 8 compute beats + 1 cy fall-through ≈ 10–12 cy end-to-end** — same as `VADD`/`VMUL` |
+| **VFNMA** | `Td0, Ta, Tb, Tc` | `Td0 = -(Ta · Tb) + Tc` | R-type, `c_role = VALUE`, `funct6.fnma = 1` | same as `VFMA` |
+| **VLERP** | `Td0 [, Td1], Ta, Tb, Tc` | `Td0 = Ta · (1 − Tc) + Tb · Tc` (linear interpolation; optional `Td1 = Tb − Ta` retired in same op) | R-type, `c_role = VALUE`, `funct6.lerp = 1` | 2 fused beats per strip (8 strip × 2 beat = 16 compute beats), `~18 cy` end-to-end |
+
+**Why this family is needed.** From [`FMA指令场景说明.md`](FMA指令场景说明.md):
+
+| Real-world kernel | FMA form | Notes |
+|-------------------|----------|-------|
+| **LayerNorm / RMSNorm final affine** | `y = γ·x̂ + β` | The dominant FMA in transformer normalisation. `γ`, `x̂`, `β` are three independent tile registers — none is the accumulator. Without `VFMA`, every LayerNorm pays 2× cost (`VMUL` + `VADD`). |
+| **Welford incremental update — mean** | `μ_new = δ·inv_n + μ_old` | Streaming variance estimator at the heart of LayerNorm reductions. |
+| **Welford incremental update — M2** | `M2_new = δ·δ_2 + M2_old` | Single-rounding FMA preserves precision against catastrophic cancellation on small variance terms (matters for FP16 / BF16 / FP8). |
+| **Welford state merge** | `μ = δ·factor + μ_A`; `M2 = M2_A + δ·(δ·factor_m2) + M2_B` | Distributed-norm cross-thread merges. |
+| **Activation polynomials** | `gelu`, `swiglu` polynomial / Padé approximations | Multiple FMAs over independent tile inputs. |
+| **Trigonometric polynomials** | `sin(x) ≈ x·(c₁ + x²·(c₃ + x²·c₅))` | Horner-form FMAs. |
+
+**Justification (decisive advantages of FMA over emulated `MUL` + `ADD`):**
+
+1. **Throughput doubling** — one fused instruction instead of two halves the FMA-bound pipeline depth and the issue/RS occupancy.
+2. **Precision preservation** — IEEE-754 FMA performs a *single* rounding after the infinite-precision `A·B` intermediate, eliminating the second-rounding error of `(A·B) + C`. This matters for FP16 / BF16 / FP8 normalisation kernels that re-feed the result into subsequent reductions.
+
+**Hardware delta (vs. v0.16).** The stage (B) per-lane FMA core, microcode beat machinery, and 8-port TRegFile already supported `A·B + Z`. The only structural changes are:
+
+| Block | Δ |
+|-------|---|
+| `MUX_Z` per-lane input MUX (already 6:1) — one source retargeted to `SC` value-mode read | ~0 (same gate count) |
+| `SC` staging — add 512 B/cy value-mode read path alongside the existing 1-bit-mask read path (sub-bank tree reused from [`vector4k_v2.md`](vector4k_v2.md) §4.2.1) | **~5 K gate** |
+| TRegFile read port R1 binding to VEC | **0** (allocation only — TRegFile-4K already has 8R) |
+| Issue-time `c_role` bit through Tile RAT / RS / dispatch | **~1 K gate** (control-path widening) |
+| **Total v2.1 hardware add** | **~6 K gate (~0.2 % of VEC-4K-v2 area)** |
+
+**Pipeline timing** (Davinci-v2.1 vector pipeline, [`vector4k_v2.md`](vector4k_v2.md) §6.2):
+
+| Op | `N_val` | `c_role` | `is_transpose` mix | Fetch | Compute | End-to-end | Throughput |
+|----|--------:|----------|---------------------|------:|--------:|-----------:|------------|
+| **VFMA** (typical) | 3 | VALUE | uniform | **8 cy** | 8 beats | **~10–12 cy** | **1 tile / 8 cy** |
+| VFMA (one xp odd-out) | 3 | VALUE | one-mismatched | 16 cy | 8 beats | ~18 cy | 1 tile / 16 cy |
+| VFMA (all xp different — degenerate) | 3 | VALUE | all distinct | 24 cy | 8 beats | ~26 cy | 1 tile / 24 cy |
+
+**Backward compatibility.** A v1 / v2.0 binary emits `c_role = MASK` exclusively; the new instructions are decoded only when the v2.1-aware compiler sets `c_role = VALUE`. Old binaries see no behaviour change and the R1 read port stays idle and clock-gated.
+
+#### 2.2.6 New PTO instructions
+
+Three instructions native to v2's unified ALU + Acc feedback + microcode pipeline ([`vector4k_v2.md`](vector4k_v2.md) §7.5):
+
+| Mnemonic | Operands | Semantics | Cycle budget |
+|----------|----------|-----------|--------------|
+| **TINV** | `Tdst+, Tsrc+, num_tiles` | Square matrix inverse via in-tile Gauss–Jordan with Newton–Raphson reciprocal refinement. Up to **128×128 FP32 (16 tiles)**, 64×64 FP8 (1 tile), 32×32 FP32 (1 tile). | ≈ 2·N²·`S_row` + N·`S_col` + 3N beats. **33 K beats / ~33 µs for 128×128 FP32 @ 1 GHz**. |
+| **TROWRANGE_MUL** | `Tdst, Tsrc, Xstart, Xend [, Tmask]` | Column-wise product over dynamic row sub-range `[Xstart, Xend)`. `out[c] = ∏_{r=Xstart}^{Xend−1} Tsrc[r, c]`. Optional mask further filters elements. | `1 + S_active + 1 ≤ 10` beats; ~18 cy end-to-end. |
+| **TMRGSORT** | `Td0 (values), Td1 (indices), Tsrc, N` | Full-tile bitonic sort over any `N = 2^p` up to **8192** (FP4 tile). Emits sorted values to `D0` and permutation indices to `D1`. Optional mask = partial-sort. | `p(p+1)/2 × ⌈N/256⌉` beats. **220 beats for N=1024 FP32, 36 beats for N=256.** |
+
+Additional encoding notes:
+
+- **TINV multi-tile** uses a 2-bit `log₂(num_tiles)` field in `funct7` to select `num_tiles ∈ {1, 2, 4, 8, 16}`. Operand register fields then encode the **base** tile of each consecutive range.
+- **TROWRANGE_MUL** sources `Xstart`, `Xend` from scalar registers via the staging slots `SX`, `SY` ([`vector4k_v2.md`](vector4k_v2.md) §4.3) — these are read at issue time from the scalar GPR file, costing **0 vector-side cycles**.
+- **TMRGSORT** uses `N` from a 4-bit immediate field encoding `log₂(N) ∈ {5..13}` (32..8192).
+
+#### 2.2.7 `TSETMETA` (tile metadata setter)
+
+```
+  TSETMETA Td, shape.x_imm, shape.y_imm, format_imm
+```
+
+A single-cycle, tile-RAT-only instruction that **rewrites the metadata word** of the destination tile's *current* physical mapping without touching its 4 KB payload. Handled at the D2 rename stage similarly to `TILE.MOVE`: no RS entry, no execute stage. The new metadata becomes visible to subsequent instructions consuming `Td`.
+
+Use cases: reshaping a tile produced by `CUBE.DRAIN` (which writes payload but not shape), changing format after `VCVT`, or installing scalar-broadcast metadata before a `VEXPAND`.
+
+#### 2.2.8 Updated vector instruction list (highlights)
+
+The **95-instruction v1 vector ISA** carries forward, with three changes:
+
+1. Each instruction gets a "masked" variant (no new mnemonic — encoded by `has_mask`).
+2. `TSORT32` and `TMRGSORT` from v1 are subsumed by the new `TMRGSORT` (§2.2.6). v1's `TSORT32` mnemonic remains as an alias for `TMRGSORT N=32`.
+3. **v2.1 增量:** A new family of native 3-source ternary FMA instructions (`VFMA`, `VFNMA`, `VLERP`) is added under Category O (§2.2.6a), motivated by LayerNorm / Welford / activation / trig kernels (see [`FMA指令场景说明.md`](FMA指令场景说明.md)).
+
+Categories A–M of v1 §2.2.3 are unchanged in semantics. Two new categories are added:
+
+**Category N — Numerical / Reconfigurable Compute (new in v2)**
+
+| Mnemonic | Operands | Semantics | Latency |
+|----------|----------|-----------|---------|
+| TINV | Tdst+, Tsrc+, num_tiles | Matrix inverse (Gauss–Jordan + NR refine) | ~2 K – 33 K beats |
+| TROWRANGE_MUL | Tdst, Tsrc, Xstart, Xend [, Tmask] | Range product per column | ≤ 10 beats |
+| TMRGSORT | Td0, Td1, Tsrc, log2N [, Tmask] | Bitonic sort, value+index dual retire | 36 – 2 912 beats |
+| TSETMETA | Td, shape.x, shape.y, format | Rewrite tile metadata in-place | 0 (rename-only) |
+
+**Category O — Native 3-source Ternary FMA family (new in v2.1; §2.2.6a)**
+
+| Mnemonic | Operands | Semantics | Latency (typical) |
+|----------|----------|-----------|--------------------|
+| **VFMA** | `Td0, Ta, Tb, Tc` | `Td0 = Ta · Tb + Tc` (single-rounding IEEE-754 FMA) | **~10–12 cy** end-to-end (8 cy fetch + 8 compute beats); throughput **1 tile / 8 cy** |
+| **VFNMA** | `Td0, Ta, Tb, Tc` | `Td0 = -(Ta · Tb) + Tc` | same as `VFMA` |
+| **VLERP** | `Td0 [, Td1], Ta, Tb, Tc` | `Td0 = Ta·(1−Tc) + Tb·Tc`; optional `Td1 = Tb − Ta` retired in same instruction | ~18 cy end-to-end (8 cy fetch + 16 compute beats) |
+
+All three issue with `c_role = VALUE`. Mixed `is_transpose_{A,B,C}` adds 8 cy per odd-out (one mismatch → 16 cy fetch; all three different → 24 cy fetch — degenerate). Common kernels (LayerNorm `γ·x̂ + β`, Welford updates) all use uniform `is_transpose` and hit the **8 cy** path.
+
+### 2.3 Cube ISA
+
+> **(v1 → v2: 内容未变更,以下完整复制自 v1 §2.3。)**
+
+Tile-level instructions that drive the outerCube MXU. Each `CUBE.OPA` consumes tile registers and executes all K-loop OPA steps internally.
+
+| Instruction | Operands | Function |
+|-------------|----------|----------|
+| CUBE.CFG | mode, fmt [, Mactive] | Set operating mode (A/B) and data format |
+| CUBE.OPA | zd, Ta, Tb, Rn | Outer product accumulate: iterate over Nb B-tiles |
+| CUBE.DRAIN | zd, Tc | Drain accumulator buffer to tile register(s) |
+| CUBE.ZERO | zd | Zero accumulator buffer (1 cycle) |
+| CUBE.WAIT | zd | Stall until pending drain completes |
+
+Supported formats: FP16, BF16, FP8 (E4M3/E5M2), MXFP4, HiFP4. All accumulate into FP32.
+
+Full cube ISA specification: see [`outerCube.md`](outerCube.md) §6. The Tile RAT renames cube operands exactly as in v1; the outerCube MXU itself is unmodified between v1 and v2.
+
+### 2.4 MTE ISA (Memory Tile Engine)
+
+> **(v1 → v2: ISA 编码与语义 100% 兼容。以下 §2.4.1 / §2.4.2 / §2.4.3 完整复制自 v1 §2.4。v2 实现层增量列在本节末。)**
+
+The MTE bridges three domains: **memory ↔ TRegFile-4K** (bulk tile transfers) and **scalar GPR ↔ TRegFile-4K** (single-element access). All MTE instructions flow through both the Scalar RAT and Tile RAT at rename.
+
+#### 2.4.1 Bulk Tile Transfer Instructions
+
+| Instruction | Operands | Function |
+|-------------|----------|----------|
+| TILE.LD | Td, [Rbase] | Contiguous load: 4 KB from address Rbase → tile Td |
+| TILE.LD | Td, [Rbase], Rs | Strided load: rows at stride Rs → tile Td |
+| TILE.ST | [Rbase], Ts | Contiguous store: tile Ts → 4 KB at address Rbase |
+| TILE.ST | [Rbase], Ts, Rs | Strided store: tile Ts → rows at stride Rs |
+| TILE.GATHER | Td, [Rbase], Tidx | Gather: indexed load using index tile (element offsets in Tidx) |
+| TILE.SCATTER | [Rbase], Ts, Tidx | Scatter: indexed store using index tile (element offsets in Tidx) |
+| TILE.ZERO | Td | Zero tile register Td |
+| TILE.COPY | Td, Ts | Copy tile Ts → Td (allocates new physical tile, copies data) |
+
+#### 2.4.2 Tile Manipulation Instructions
+
+| Instruction | Operands | Function |
+|-------------|----------|----------|
+| TILE.MOVE | Td, Ts | Move tile Ts → Td (rename-only, zero-copy; see move elimination below) |
+| TILE.TRANSPOSE | Td, Ts, fmt | Transpose tile Ts with element format fmt → tile Td |
+
+**TILE.MOVE Td, Ts** — Logically copies tile Ts to Td, but is implemented as **move elimination** at the rename stage: the Tile RAT entry for Td is simply updated to point to the same physical tile as Ts. No data is copied, no physical tile is allocated from the free list, and no execute stage is needed. The instruction completes in **zero cycles** (handled entirely at D2 rename).
+
+```
+  Rename (D2) for TILE.MOVE Td, Ts:
+    1. Read Tile RAT[Ts] → PT_src (current physical tile for Ts)
+    2. Read Tile RAT[Td] → PT_old (old physical tile for Td, becomes orphan)
+    3. Write Tile RAT[Td] ← PT_src  (Td now aliases same physical tile as Ts)
+    4. Increment refcount(PT_src)     (one more architectural name maps to it)
+    5. Mark PT_old as orphan; if refcount(PT_old)==0 → free to tile free list
+    6. No RS entry allocated; no execute stage; instruction retires at D2
+    7. Ready bit for Td inherits ready state of PT_src
+```
+
+After TILE.MOVE, Td and Ts share the same physical tile. This is safe under rename: the next instruction that writes to either Td or Ts will allocate a fresh physical tile at that point, naturally "splitting" the alias. TILE.MOVE is critical for avoiding unnecessary 4 KB copies in tile register spill/fill sequences and data routing between pipeline stages.
+
+**TILE.TRANSPOSE Td, Ts, fmt** — Reads tile Ts, transposes the 2D element matrix according to the element format `fmt`, and writes the result to tile Td. The transpose treats the 4 KB tile as a 2D matrix with dimensions determined by the element width:
+
+| fmt (funct3) | Element width | Tile layout (rows × cols) | Transpose block |
+|-------------|--------------|---------------------------|-----------------|
+| 000 (FP64) | 8 B | 64 × 8 | 8 × 8 (8 blocks of 8 rows) |
+| 001 (FP32) | 4 B | 64 × 16 | 16 × 16 (4 blocks of 16 rows) |
+| 010 (FP16) | 2 B | 64 × 32 | 32 × 32 (2 blocks of 32 rows) |
+| 011 (BF16) | 2 B | 64 × 32 | 32 × 32 (2 blocks of 32 rows) |
+| 100 (FP8) | 1 B | 64 × 64 | 64 × 64 (1 block, full tile) |
+| 101 (INT32) | 4 B | 64 × 16 | 16 × 16 (4 blocks of 16 rows) |
+| 110 (INT16) | 2 B | 64 × 32 | 32 × 32 (2 blocks of 32 rows) |
+| 111 (INT8) | 1 B | 64 × 64 | 64 × 64 (1 block, full tile) |
+
+The transpose operates on **square sub-blocks** whose dimension equals the number of elements per 512-bit row. For FP8/INT8 the entire tile is one 64×64 block and transposes in-place. For FP16/BF16/INT16, the 64 rows are split into two 32-row halves, each transposed as a 32×32 block. In v1, the MTE unit contained a dedicated **transpose buffer** (4 KB SRAM) that accumulated rows during the read epoch and emitted transposed rows during the write epoch. In v2 this buffer shrinks to 512 B (see §8.5.1).
+
+```
+  TILE.TRANSPOSE encoding (32-bit):
+  ┌──────────┬──────┬──────┬──────┬──────┬────────┐
+  │  funct7  │ 00000│  Ts  │ fmt  │  Td  │ opcode │
+  │ 0100010  │ (5b) │ (5b) │(3b)  │ (5b) │ 10xxxxx│
+  └──────────┴──────┴──────┴──────┴──────┴────────┘
+```
+
+#### 2.4.3 Scalar ↔ Tile Element Access Instructions
+
+| Instruction | Operands | Function |
+|-------------|----------|----------|
+| TILE.GET | Rd, Ts, Ridx | Read single element: element at index Ridx in tile Ts → scalar GPR Rd |
+| TILE.PUT | Td, Rs, Ridx | Write single element: scalar GPR Rs → element at index Ridx in tile Td |
+
+**TILE.GET Rd, Ts, Ridx** — Reads one element from tile Ts at the position specified by scalar register Ridx. The element is zero-extended to 64 bits and written to scalar destination GPR Rd. The element data type (FP16, FP32, FP64, INT8, etc.) is encoded in the instruction's `funct3` field, which determines element width and the extraction offset within the 512-bit row. Ridx encodes a linear element index: `row = Ridx / elements_per_row`, `col = Ridx % elements_per_row`.
+
+**TILE.PUT Td, Rs, Ridx** — Writes the lower bits of scalar GPR Rs into tile Td at the element position specified by Ridx. This is a **read-modify-write** operation on the tile: the rename stage treats Td as both source (old mapping, read) and destination (new physical tile, write). The MTE unit copies the source physical tile to the destination physical tile, then overwrites the single element. The element data type is encoded in `funct3`.
+
+```
+  TILE.GET encoding (32-bit):
+  ┌──────────┬──────┬──────┬──────┬──────┬────────┐
+  │  funct7  │ Ridx │  Ts  │funct3│  Rd  │ opcode │
+  │ 0100000  │ (5b) │ (5b) │ type │ (5b) │ 10xxxxx│
+  └──────────┴──────┴──────┴──────┴──────┴────────┘
+       Ts: architectural tile register (T0–T31)
+       Ridx: scalar GPR holding element index
+       Rd: scalar GPR destination
+       funct3: element type (000=FP64, 001=FP32, 010=FP16, 011=BF16, 100=FP8, 101=INT32, 110=INT16, 111=INT8)
+
+  TILE.PUT encoding (32-bit):
+  ┌──────────┬──────┬──────┬──────┬──────┬────────┐
+  │  funct7  │  Rs  │ Ridx │funct3│  Td  │ opcode │
+  │ 0100001  │ (5b) │ (5b) │ type │ (5b) │ 10xxxxx│
+  └──────────┴──────┴──────┴──────┴──────┴────────┘
+       Td: architectural tile register (T0–T31) — read-modify-write
+       Rs: scalar GPR holding element value
+       Ridx: scalar GPR holding element index
+       funct3: element type
+```
+
+Every MTE instruction flows through both the **Scalar RAT** (for address/data operands) and the **Tile RAT** (for tile operands) at the D2 rename stage:
+
+| Instruction | Scalar RAT | Tile RAT source(s) | Tile RAT destination | Result bus |
+|-------------|-----------|---------------------|----------------------|------------|
+| TILE.LD Td, [Rbase] | Rbase → P-reg lookup | — | Td → allocate new PT | TCB |
+| TILE.LD Td, [Rbase], Rs | Rbase, Rs → P-reg lookups | — | Td → allocate new PT | TCB |
+| TILE.ST [Rbase], Ts | Rbase → P-reg lookup | Ts → PT lookup | — | — |
+| TILE.ST [Rbase], Ts, Rs | Rbase, Rs → P-reg lookups | Ts → PT lookup | — | — |
+| TILE.GATHER Td, [Rbase], Tidx | Rbase → P-reg lookup | Tidx → PT lookup | Td → allocate new PT | TCB |
+| TILE.SCATTER [Rbase], Ts, Tidx | Rbase → P-reg lookup | Ts, Tidx → PT lookups | — | — |
+| TILE.ZERO Td | — | — | Td → allocate new PT | TCB |
+| TILE.COPY Td, Ts | — | Ts → PT lookup | Td → allocate new PT | TCB |
+| **TILE.MOVE Td, Ts** | — | Ts → PT lookup | **Td → alias PT(Ts)** (no alloc) | **— (rename-only)** |
+| **TILE.TRANSPOSE Td, Ts, fmt** | — | Ts → PT lookup | Td → allocate new PT | TCB |
+| **TILE.GET Rd, Ts, Ridx** | Ridx → P-reg lookup; **Rd → allocate new P-reg** | Ts → PT lookup | — | **CDB** (scalar) |
+| **TILE.PUT Td, Rs, Ridx** | Rs, Ridx → P-reg lookups | **Td → PT lookup (old)** | **Td → allocate new PT** | TCB |
+
+Key observations:
+- **TILE.MOVE** is handled entirely at D2 rename (**move elimination**): Tile RAT[Td] is pointed to the same physical tile as Ts. No free-list allocation, no RS entry, no execute stage, no result bus. Zero-cycle latency.
+- **TILE.TRANSPOSE** allocates a new physical tile and requires a full read-then-transpose-then-write pass through the MTE's transpose buffer.
+- **TILE.GET** produces a **scalar GPR result** (broadcast on CDB), while consuming a tile source. It requires both a Tile RAT source lookup and a Scalar RAT destination allocation.
+- **TILE.PUT** is a **read-modify-write** on the tile: the rename stage looks up the old physical tile mapping as a source AND allocates a new physical tile as a destination. The MTE unit copies the old tile contents to the new tile, then overwrites the single element.
+
+After rename, MTE RS entries carry physical scalar register tags (from Scalar RAT) and physical tile tags (from Tile RAT). The MTE unit maintains a large outstanding request buffer to maximize memory-level parallelism.
+
+#### 2.4.4 v2 实现层增量(对软件不可见)
+
+1. **`TILE.TRANSPOSE` becomes a software-optional accelerator.** With per-port `is_transpose` on the TRegFile read (§9.2) and per-beat `tilelet_xpose` in the vector unit (§8.3), most "pre-transpose then consume" patterns become single-instruction with `is_xpose_*` set on the consuming op. `TILE.TRANSPOSE` is retained for cases that need a *materialized* transposed tile reused many times across instructions that themselves don't carry the bit; its physical staging buffer shrinks from 4 KB → 512 B (§8.5.1).
+2. **All bulk tile stores (`TILE.ST`, `TILE.SCATTER`) acquire a branch tag at dispatch** and are gated through the **Speculative Tile-Store Queue** (STQ, §11.5) until their tag becomes non-speculative. Invisible at the ISA level; adds 0–6 cycles of latency to a tile store on the speculative path.
+
+### 2.5 Instruction Domain Identification
+
+> **(v1 → v2: 内容未变更,以下完整复制自 v1 §2.5。)**
+
+The 7-bit opcode field encodes the instruction domain:
+
+| Opcode[6:5] | Domain | Decode path |
+|-------------|--------|-------------|
+| 00, 01 | Scalar | Scalar rename → Scalar RS |
+| 10 | Vector / MTE | Tile RAT rename → Vector RS or MTE RS |
+| 11 | Cube | Tile RAT rename → Cube RS |
+
+---
+
+## 3. Top-Level Block Diagram
+
+```
+ ┌──────────────────────────────────────────────────────────────────────────────────────────────┐
+ │  DAVINCI-v2 CORE (BCC Scalar Pipeline)                                                        │
+ │                                                                                              │
+ │  ┌─────────────────────────────── FRONT-END ──────────────────────────────────────────────┐  │
+ │  │   ┌──────────┐    ┌───────────┐    ┌──────────────┐   ┌──────────────┐              │  │
+ │  │   │  Branch   │───▶│  Fetch    │───▶│  IB (8 entries)│───▶│  F4 Register  │              │  │
+ │  │   │ Predictor │    │  (F0-F3)  │    │  4-wide sync  │   │  D1 handoff  │              │  │
+ │  │   │ TAGE+BTB  │    └──────────┘    └──────────────┘   └──────┬───────┘              │  │
+ │  │   │ +RAS      │                                                 │                        │  │
+ │  │   └──────────┘                                                 ▼                        │  │
+ │  │                              ┌─────────────────────────────────────────────┐             │  │
+ │  │                              │  D1: Decode + RID / atag Allocation        │             │  │
+ │  │                              │  - 4-wide decode (domain, opcode, operands)│             │  │
+ │  │                              │  - RID allocation (6-bit program order)      │             │  │
+ │  │                              └─────────────────┬───────────────────────┘             │  │
+ │  │                                                ▼                                       │  │
+ │  │                              ┌─────────────────────────────────────────────┐             │  │
+ │  │                              │  D2: Rename Request                         │             │  │
+ │  │                              │  - SMAP read for source ptags               │             │  │
+ │  │                              │  - ptag allocation from free list            │             │  │
+ │  │                              │  - MapQ push (speculative increment log)    │             │  │
+ │  │                              │  - SMAP live update (intra-group bypass)    │             │  │
+ │  │                              │  - Tile RAT / Tile-Meta RAT unchanged        │             │  │
+ │  │                              └─────────────────┬───────────────────────┘             │  │
+ │  │                                                ▼                                       │  │
+ │  │                              ┌─────────────────────────────────────────────┐             │  │
+ │  │                              │  D3: Rename Complete + Dispatch Prep        │             │  │
+ │  │                              │  - SMAP write (committed state)             │             │  │
+ │  │                              │  - Ready Table init (source ready bits)     │             │  │
+ │  │                              │  - IQ routing (alu_iq / bru_iq / lsu_iq)   │             │  │
+ │  │                              └─────────────────┬───────────────────────┘             │  │
+ │  └────────────────────────────────────────────────┼────────────────────────────────────────┘  │
+ │                                                   │ renamed muops + RID + ptags + MapQ entry      │
+ │  ┌────────────────────────────────────────────────┼────────────────────────────────────────┐  │
+ │  │                                                ▼                                        │  │
+ │  │  ┌──────────────────────────────────┐  ┌──────────────────────────────────────────┐  │  │
+ │  │  │  S1: Dispatch Preparation         │  │  S2: Dispatch Execute                    │  │  │
+ │  │  │  - Free list vacancy check        │  │  - IQ entry write (alu_iq / bru_iq /   │  │  │
+ │  │  │  - MapQ space check             │  │    lsu_iq)                             │  │  │
+ │  │  │  - IQ vacancy per type            │  │  - Free list update                    │  │  │
+ │  │  └──────────────────────────────────┘  │  - MapQ head advance                   │  │  │
+ │  │                                           └───────────────────┬──────────────────┘  │  │
+ │  │                                                               ▼                      │  │
+ │  │  ┌──────────────────────────────────┐  ┌──────────────────────────────────────────┐  │  │
+ │  │  │  P1: Issue Pick                  │  │  I1: RF Read Planning   │ I2: Confirm  │  │  │
+ │  │  │  - Ready Table bitmap query       │  │  - RF port arbitration  │  - IQ entry  │  │  │
+ │  │  │    (O(1) bit-test per ptag)     │  │  - 7-wide across IQs   │    dealloc   │  │  │
+ │  │  │  - Age-matrix cascaded pick       │  │                        │  - Port conf  │  │  │
+ │  │  │    (RID-based sub-head age)       │  │                        │              │  │  │
+ │  │  └──────────────────────────────────┘  └──────────────────────────────────────────┘  │  │
+ │  └───────────────────────────────────────────────────────────────────────────────────┘  │
+ │                                                                                              │
+ │  ┌─────────────────────────── EXECUTE ────────────────────────────────────────────────────┐  │
+ │  │       ▼              ▼             ▼             ▼              ▼                      │  │
+ │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐              │  │
+ │  │  │ 4x ALU   │  │  Load /  │  │  VEC-    │  │outerCube │  │   MTE    │              │  │
+ │  │  │ 1x MUL   │  │  Store   │  │  4K-v2   │  │   MXU    │  │  Engine  │              │  │
+ │  │  │ 1x BRU   │  │  Unit    │  │ 3R/2W    │  │(4096 MAC)│  │(LD/ST/  │              │  │
+ │  │  │ (alu_iq) │  │  + SSB   │  │ tiles    │  │          │  │G/S/MOVE)│              │  │
+ │  │  │          │  │ (lsu_iq) │  │(vec_iq)  │  │          │  │+ STQ(8) │              │  │
+ │  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘              │  │
+ │  └────────┼─────────────┼─────────────┼─────────────┼──────────────┼────────────────────┘  │
+ │           │             │             │             │              │                          │
+ │  ┌────────┼──────────── COMPLETE ─────┼─────────────┼──────────────┼────────────────────┐    │
+ │  │  CDB (6 ports, scalar)  +  TCB (4 ports, tile) + Ready Table (128-bit bitmap)      │    │
+ │  │  Ready Table: set bit[i] = ptag i ready on CDB writeback; clear on ptag alloc      │    │
+ │  └────────────────────────────────────────────────────────────────────────────────────────┘    │
+ │                                                                                              │
+ │  ┌──────────────── REGISTER FILES ───────────────────────────────────────────────────────┐  │
+ │  │  ┌──────────────────┐  ┌─────────────────────────────────────────────────────────────┐│  │
+ │  │  │ Scalar Physical   │  │ TRegFile-4K (with per-port is_transpose)                  ││  │
+ │  │  │ Register File     │  │ 256x4KB = 1MB; 8R+8W @ 512B/cy/port                      ││  │
+ │  │  │ 128x64b          │  │ 8-cycle epoch calendar                                    ││  │
+ │  │  │ 12R+6W ports     │  │ 32 b metadata per physical tile                          ││  │
+ │  │  └──────────────────┘  └─────────────────────────────────────────────────────────────┘│  │
+ │  └──────────────────────────────────────────────────────────────────────────────────────────┘  │
+ │                                                                                              │
+ │  ┌──────────────── RENAME STATE ─────────────────────────────────────────────────────────┐  │
+ │  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────────────────────┐           │  │
+ │  │  │ CMAP [32x7b]   │  │ SMAP [32x7b]   │  │ MapQ [12-entry ring buffer]   │           │  │
+ │  │  │ Committed map    │  │ Speculative map │  │ {atag, old_ptag, new_ptag,    │           │  │
+ │  │  │ Flush target     │  │ Active rename   │  │  rid, is_push_t/u}            │           │  │
+ │  │  └────────────────┘  └────────────────┘  └────────────────────────────────┘           │  │
+ │  └──────────────────────────────────────────────────────────────────────────────────────────┘  │
+ └──────────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+**BCC scalar pipeline deltas (highlighted):**
+
+- **Three-stage rename pipeline**: D1 (decode + RID/atag allocation) -> D2 (SMAP read + ptag allocation + MapQ push) -> D3 (SMAP write + Ready Table init + IQ routing)
+- **Three physical IQs**: `alu_iq` (48 entries, 4-wide), `bru_iq` (16 entries, 1-wide), `lsu_iq` (32 entries, 2-wide) — replace centralized Scalar RS
+- **Ready Table** (128-bit bitmap): replaces CDB comparator arrays for scalar wakeup; O(1) ptag lookup per source
+- **Age-matrix issue picker**: purely combinational cascaded pick using RID-based sub-head age (no per-entry age field)
+- **P1/I1/I2 issue stages**: explicit pick (P1) / RF-read arbitration (I1) / confirm (I2) separation
+- **MapQ** (12-entry ring buffer): replaces RAT checkpoint snapshots; instruction-precise recovery via reverse replay
+- **atag / ptag naming**: architectural registers = atag (0-31), physical registers = ptag (P0-P127)
+- **CMAP + SMAP + MapQ**: three-table rename model (committed / speculative / increment log)
+- Branch-tag allocator at rename (unchanged from v2).
+- Tile Metadata RAT (32 b per physical tile) co-located with the Tile RAT (unchanged).
+- VEC-4K-v2 unit; 3R / 2W tile interface (unchanged).
+- Speculative Store Buffer (SSB, 24 entries) gates scalar stores by branch tag (unchanged).
+- Speculative Tile-Store Queue (STQ, 8 entries) gates MTE bulk stores by branch tag (unchanged).
+
+---
+
+
+---
+
+## 4. Pipeline Overview
+
+The Davinci-v2 scalar pipeline is extended to **17+ stages** with the BCC-style scalar frontend. Branch-tag administration adds zero cycles — tags are allocated at D2 (alongside MapQ entry push) and propagated forward as one extra metadata field per IQ entry. The Tile RAT, Vector RS, Cube RS, MTE RS, and memory subsystems are unchanged by the BCC scalar pipeline change.
+
+```
+F0 → F1 → F2 → F3 → IB → F4 → D1 → D2 → D3 → S1 → S2 → P1 → I1 → I2 → E1 → … → EX_n → W1
+                                     ├── Rename ──┤           ├── Issue ──┤           ├── Execute ──┤
+```
+
+**v2.3 Block-ROB addition:**
+
+```
+  +--------------------------------------------------------------------------------------+
+  |  BROB -- Block Reorder Buffer (v2.3 新增)                                               |
+  |  128-entry; tracks block lifetimes; scalar_done && engine_done gates retire              |
+  |  Block SSB (32) + Block STQ (16) for in-block store commit                            |
+  |  Provides block-granularity precise exception identification                            |
+  +--------------------------------------------------------------------------------------+
+```
+
+### 4.1 Complete Stage List (BCC Scalar Pipeline)
+
+| Stage | Name | Function |
+|-------|------|---------|
+| F0 | Fetch PC Select | PC mux (redirect/sequential/flush) |
+| F1 | I-Cache Lookup + BTB | Tag+BTB lookup, 4-way set-assoc |
+| F2 | I-Cache Response + Predict | Cache data return, TAGE/BTB prediction |
+| F3 | Stitch + BSTART Annotation | Cross-line stitch, BSTART boundary marking (deferred; N/A in this change) |
+| IB | Instruction Buffer | Depth-8 fetch/decode synchronization buffer |
+| F4 | Decode Handoff Register | D1 input register |
+| **D1** | **Decode + RID / atag Allocation** | Decode 4 instr, allocate RID (6-bit program-order ID), resolve atag for sources |
+| **D2** | **Rename Request** | Read SMAP, resolve source ptag, allocate ptag, push MapQ entry, T/U stack push |
+| **D3** | **Rename Complete + Dispatch Prep** | Write SMAP, resolve source readiness from Ready Table, IQ routing assignment |
+| **S1** | **Dispatch Preparation** | IQ resource checks (free list, MapQ space, IQ vacancy per type) |
+| **S2** | **Dispatch Execute** | IQ entry write, free list update, MapQ head advance |
+| **P1** | **Issue Pick** | Ready Table combinational query, age-matrix cascaded pick (RID-based sub-head age) |
+| **I1** | **Operand Read Planning** | Global physical RF read-port arbitration |
+| **I2** | **Issue Confirm** | IQ entry deallocation, RF read-port occupancy confirm |
+| E1–EX_n | Execute | Functional unit execution (variable latency) |
+| W1 | Writeback | CDB/TCB broadcast, Ready Table update, wakeup |
+| **16. BROB Retire** | *(v2.3 Block-ROB only)* Block completion check. If : advance BROB head (commit Block SSB/STQ to SSB/STQ). If exception: deliver and squash. If incomplete: stall. Off the critical execution path. |
+
+**Total pipeline depth**: Fetch-to-WB = **17+ cycles** (~5 cycles longer than the 12-stage v1/v2 baseline, due to D1/D2/D3 rename split and P1/I1/I2 issue separation).
+
+### 4.B Pipeline Timing — Scalar ALU Instruction (BCC Scalar Pipeline)
+
+```
+  Cycle:  0    1    2    3    4    5    6    7    8    9    10   11   12
+  ─────  ────  ────  ────  ────  ────  ────  ────  ────  ────  ────  ────  ────  ────
+  i0:    F0   F1   F2   F3   IB   F4   D1   D2   D3   S1   S2   P1   I1   I2  E1  WB
+  i1:           F0   F1   F2   F3   IB   F4   D1   D2   D3   S1   S2   P1   I1   I2  E1  WB
+  i2:                  F0   F1   F2   F3   IB   F4   D1   D2   D3   S1   S2   P1   I1   I2  E1  WB
+  i3:                         F0   F1   F2   F3   IB   F4   D1   D2   D3   S1   S2   P1   I1   I2  E1  WB
+         └──────────────── 6-cycle fetch-to-D1 ──────────────────┘
+                                        └──── Rename ────┘
+                                                └──── S ──┘  └─ P/I ─┘
+```
+
+**Note:** Pipeline timing is shown for a single-cycle ALU operation. The 17+ stage pipeline means an instruction takes ~11 cycles from D1 to WB. The additional latency does not reduce throughput — all 4 slots of each pipeline stage are occupied every cycle, maintaining 4-wide dispatch rate. Variable-latency ops (MUL, DIV, LD) follow the same stage layout but occupy EX for additional cycles.
+
+### 4.C Execution Latencies by Domain (unchanged from v2)
+
+| Domain | Operation | Stages | Latency (cycles) | Pipelined |
+|--------|-----------|--------|-------------------|-----------|
+| Scalar | ALU (add, logic, shift) | EX1 | **1** | yes |
+| Scalar | MUL | EX1–EX4 | **4** | yes |
+| Scalar | DIV | EX1–EX(12–20) | **12–20** | no |
+| Scalar | Branch resolve | EX1 | **1** | yes |
+| LSU | Load (L1 hit) | EX1–EX4 | **4** | yes |
+| LSU | Load (L2 hit) | EX1–EX(12) | **12** | yes |
+| LSU | Store | EX1–EX4 | **4** (addr+data) | yes |
+| Vector | VADD/VMUL/VFMA (full tile, elementwise) | 2 epochs (16 cy) | **16** (8 read + 8 write, compute hidden) | epoch-pipelined |
+| Vector | Reduce (VROWSUM/VCOLSUM/...) | 1 epoch + reduce | **16** (8 read + reduce + 8 write) | no |
+| Vector (v2) | TINV (128×128 FP32) | multi-epoch | **~33 K beats** (~22 µs @ 1.5 GHz) | no |
+| Vector (v2) | TMRGSORT (1024 FP32) | multi-epoch | **~220 cy** | no |
+| Cube | CUBE.OPA (N steps) | 19 + N | **N + 18** (first tile) | epoch-pipelined |
+| MTE | TILE.LD (contiguous, L2 hit) | mem + 1 write epoch | **72** (64 mem + 8 TRegFile write) | yes (across ports) |
+| MTE | TILE.ST (contiguous, L2) | 1 read epoch + mem | **72** (8 TRegFile read + 64 mem write) | yes (across ports) |
+| MTE | TILE.COPY | 2 epochs | **16** (8 read + 8 write) | epoch-pipelined |
+| MTE | TILE.ZERO | 1 write epoch | **8** (write zeros, no read) | yes |
+| MTE | TILE.GATHER (L2 hit) | mem + 1 write epoch | **72–128** (variable mem + 8 TRegFile write) | partially |
+| MTE | TILE.SCATTER (L2) | 1 read epoch + mem | **72–128** (8 TRegFile read + variable mem) | partially |
+| MTE | TILE.MOVE (rename-only) | — (D2) | **0** (move elimination, no execute) | — |
+| MTE | TILE.TRANSPOSE | 2 epochs | **16** (8 read + 8 write via transpose buffer) | no |
+| MTE | TILE.GET (element → GPR) | 1 read epoch + extract | **9** (8 TRegFile read epoch + 1 extract) | no (port occupied 8 cy) |
+| MTE | TILE.PUT (GPR → element, RMW) | 2 epochs | **16** (8 read + 8 write), **8** with copy elision | no |
+
+---
+
+### 4.3 Per-stage Actions (BCC Scalar Pipeline)
+
+| Stage | Function |
+|-------|---------|
+| **D1** | Decode 4 instr; allocate RID (6-bit program-order ID); resolve atag for each source operand; classify each operand as P (GPR, 32 regs), T (tile, 32 regs), or U (uncore) |
+| **D2** | Read SMAP for source ptags; allocate new ptag for P-dst; push MapQ entry; update SMAP (live for intra-group bypass); T/U stack push for tile/uncore operands |
+| **D3** | Write SMAP to committed state; initialize Ready Table source-ready bits from Ready Table query; assign IQ routing (alu_iq / bru_iq / lsu_iq) |
+| **S1** | Check free list vacancy, MapQ space, IQ vacancy per routing target |
+| **S2** | Write IQ entries; advance free list head; advance MapQ head |
+| **P1** | Ready Table combinational query; age-matrix cascaded pick selects oldest-ready entries per IQ |
+| **I1** | Physical RF read-port arbitration across 7 issue slots (alu_iq x 4 + bru_iq x 1 + lsu_iq x 2) |
+| **I2** | Confirm IQ entry deallocation; confirm RF port occupancy |
+| **E1-EX_n** | Functional unit execution; CDB/TCB broadcast at W1 |
+| **W1** | Ready Table bitmap update: clear allocated ptag bits (D2 to W1 = 8 cycles after rename); set ptag bits on CDB writeback |
+
+### 4.4 Pipeline Timing Notes
+
+- All execution latencies (Section 4.C) are measured from the E1 stage, not from fetch. The extended pipeline does not change the functional unit latencies.
+- Variable-latency ops (MUL, DIV, LD) follow the same D1/D2/D3/S1/S2/P1/I1/I2/E1 stage layout but occupy E1-EX_n for additional cycles.
+- The Ready Table is updated on the clock edge at W1 (D2 dispatch to W1 = 8 cycles). A renamed ptag is cleared from Ready Table at W1, and set again on the CDB writeback cycle.
+
+### 4.5 Branch Misprediction Penalty (BCC Scalar Pipeline)
+
+| Step | v1 | v2 BCC |
+|------|----|--------|
+| Detection (EX1) | 1 cy | 1 cy |
+| MapQ reverse replay + SMAP <- CMAP | -- | 1 cy (parallel with flush) |
+| RAT flash-restore (CMAP snapshot) | 1 cy | 1 cy |
+| SSB / STQ flush | -- | concurrent with RAT restore |
+| Ready Table reset | -- | concurrent (mask=ALL_ONES) |
+| Physical IQ CAM-clear | -- | concurrent (branch_tag match) |
+| Front-end refill | ~6 cy | ~6 cy |
+| **Total mispredict penalty** | **6 cy** | **6-7 cy** |
+
+The MapQ replay and Ready Table reset run in parallel with the RAT restore and branch-tag CAM-clear -- all within the single recovery cycle.
+
+---
+
+## 5. Front-End: Fetch & Branch Prediction
+
+> **(v1 → v2: 子节 5.A / 5.B / 5.C 完整复制自 v1 §5.1 / §5.2 / §5.3,内容未变更。v2 增量为 §5.1 Branch-tag allocator 与 §5.2 Static hint bit。)**
+
+### 5.A Fetch Unit (v1 §5.1, 未变更)
+
+The fetch unit delivers up to **4 aligned instructions per cycle** from the L1 instruction cache.
+
+| Parameter | Value |
+|-----------|-------|
+| Fetch width | **4** instructions / cycle (16 bytes) |
+| Fetch alignment | 16-byte aligned fetch block |
+| Instruction buffer | **16** entries (4-cycle decoupling) |
+| L1-I cache | **64 KB**, 4-way set-associative, 64 B line |
+| L1-I latency | **2** cycles (F1 + F2) |
+| I-TLB | 64 entries, fully associative |
+
+**Fetch pipeline:**
+
+```
+  F1: PC → I-TLB + L1-I tag lookup + BTB lookup + TAGE index
+  F2: L1-I data return (4 instructions) + TAGE prediction + RAS check
+      → push into instruction buffer (up to 16 entries)
+      → if predicted-taken: redirect PC at end of F2
+```
+
+### 5.B Branch Predictor (v1 §5.2, 未变更)
+
+The branch predictor uses a **hybrid scheme** combining three components.
+
+#### 5.B.1 TAGE Predictor (Conditional Branches)
+
+| Parameter | Value |
+|-----------|-------|
+| Base predictor | 4K-entry bimodal (2-bit saturating counters) |
+| Tagged tables | 5 tables: T1(512), T2(512), T3(1K), T4(1K), T5(1K) |
+| History lengths | 4, 8, 16, 32, 64 (geometric series) |
+| Tag width | 8–12 bits per entry |
+| Total storage | ~20 KB |
+| Prediction accuracy | ~95% (typical workloads) |
+
+#### 5.B.2 Branch Target Buffer (BTB)
+
+| Parameter | Value |
+|-----------|-------|
+| Entries | **2048** |
+| Associativity | 4-way set-associative |
+| Tag | partial PC (upper bits) |
+| Target | full 64-bit target address |
+| Hit latency | 1 cycle (available end of F1) |
+
+#### 5.B.3 Return Address Stack (RAS)
+
+| Parameter | Value |
+|-----------|-------|
+| Depth | **16** entries |
+| Push | on JAL/JALR to link register |
+| Pop | on JALR from link register (return pattern) |
+| Speculative management | checkpoint RAS top-of-stack pointer with RAT checkpoints |
+
+### 5.C Fetch Redirect Priorities (v1 §5.3, 未变更)
+
+```
+  Priority (highest to lowest):
+    1. Branch mispredict redirect (from EX1)  — flush + restart
+    2. BTB/TAGE taken-branch redirect (from F2) — next-cycle redirect
+    3. Sequential PC+16 (default)
+```
+
+---
+
+### 5.1 Branch-tag allocator (v2 增量)
+
+A small hardware counter at D2 allocates a 3-bit branch_tag for each newly-decoded branch, drawn from the same 8-slot pool used by the v1 RAT-checkpoint store. Allocation policy is round-robin among free slots; when the pool is empty, the rename stage stalls (same condition as v1's checkpoint-pool exhaustion).
+
+The branch_tag is then attached to:
+
+- The branch's own RS entry.
+- All RS entries dispatched **after** the branch and **before** any older branch resolves.
+- All SSB entries created in the same window.
+- All STQ entries created in the same window.
+- Free-list pointers in the checkpoint snapshot.
+
+When the branch resolves correctly, the tag is freed and propagated as a "tag-clear" event to all consumers (RS / SSB / STQ). When it mispredicts, the tag becomes the "flush key" — every entry tagged with this branch (or any *younger* branch tag) is invalidated atomically (§11.4.4).
+
+### 5.2 Static hint bit (v2 增量)
+
+The compiler may set the conditional-branch funct3's `H` bit (1 = predict taken on TAGE/BTB miss). The hint is consulted only on a predictor cold-miss; once TAGE has trained on the branch, dynamic prediction wins.
+
+---
+
+## 6. Decode & Rename
+
+> **BCC scalar pipeline change: §6.A–§6.F replace the v1/v2 scalar RAT with the three-table model. §6.E (Tile RAT) and §6.F (Tile Metadata RAT) are unchanged from v2.**
+
+### 6.A Decode Stage (D1)
+
+D1 processes **4 instructions per cycle**, allocating a **Rename ID (RID, 6-bit)** per instruction and resolving the architectural register indices (atag) for each source operand.
+
+| Function | Detail |
+|----------|--------|
+| Decode width | **4** instructions / cycle |
+| RID allocation | Unique 6-bit program-order ID per decoded instruction |
+| Domain classification | Opcode[6:5] -> scalar, vector, cube, MTE |
+| atag resolution | Source atag: architectural register index (0-31 for GPRs) |
+| Operand classification | `pclass`: P=GPR (32 regs), T=tile (32 regs), U=uncore, CARG=compile-time arg |
+| Immediate extraction | Sign-extend and format-dependent extraction |
+| Branch detection | Identify branch instructions for branch_tag allocation |
+
+**D1 output uop format:**
+
+| Field | Width | Description |
+|-------|-------|-------------|
+| `valid` | 1 b | Fetch bundle slot valid |
+| `pc` | 64 b | Instruction PC |
+| `opcode` | 12 b | Operation code |
+| `src[i].atag` | 6 b | Source i architectural register index |
+| `src[i].pclass` | 2 b | P=0, T=1, U=2, CARG=3 |
+| `dst.atag` | 6 b | Destination architectural register index |
+| `dst.pclass` | 2 b | Destination operand class |
+| `rid` | 6 b | Rename ID (program order; used for age-based issue pick) |
+| `checkpoint_id` | 4 b | MapQ entry ID (for flush recovery) |
+| `imm` | 64 b | Immediate value |
+
+### 6.B Rename Pipeline: D1 -> D2 -> D3
+
+#### 6.B.1 D2: Rename Request
+
+D2 performs register renaming in a single cycle, operating on the D1 output bundle in program order (slot 0 first, slot 3 last). Within a bundle, later slots can bypass the newly allocated ptag of earlier slots.
+
+**P-register (GPR) rename:**
+
+```
+# Per D2 slot (processed in program order; smap_live accumulates updates):
+smap_live = SMAP.copy()   # initial state from SMAP
+
+for slot in range(4):
+    u = d1_uop[slot]
+    if not u.valid: continue
+
+    # Step 1: Resolve source ptag from SMAP (live state)
+    if u.src[0].pclass == P:
+        src0_ptag = smap_live[u.src[0].atag]   # SMAP lookup or bypass from earlier slot
+    # ... same for src[1], src[2]
+
+    # Step 2: Allocate new ptag for P-destination
+    if u.dst.pclass == P and u.dst.atag != 0:   # atag=0 is r0 (hardwired zero)
+        old_ptag = smap_live[u.dst.atag]        # will become orphan
+        new_ptag = allocate_from_free_list(free_list)  # lowest-numbered free ptag
+
+        # Update SMAP (live for later slots in same group)
+        smap_live[u.dst.atag] = new_ptag
+
+        # Update refcount
+        refcount[old_ptag] -= 1
+        refcount[new_ptag] += 1
+        if refcount[old_ptag] == 0:
+            free_list.push(old_ptag)   # orphan freed immediately
+
+        # Push MapQ entry (for flush recovery)
+        mapq.push(MapQEntry {
+            valid: 1,
+            atag: u.dst.atag,
+            old_ptag: old_ptag,
+            new_ptag: new_ptag,
+            rid: u.rid,
+            is_push_t: 0,
+            is_push_u: 0
+        })
+```
+
+**Intra-group bypass**: If a source atag matches a destination atag allocated in an earlier slot of the same group, the ptag is forwarded directly from `smap_live` without an SMAP lookup. Comparator cost: 3 sources x 3 earlier slots x 6-bit = 54 comparators.
+
+**T/U operands**: Handled by the Tile RAT independently (unchanged from v2).
+
+**D2 uop output** (carried to D3 in a pipeline register):
+
+| Field | Width | Description |
+|-------|-------|-------------|
+| `src0_ptag` | 7 b | Resolved source ptag 0 |
+| `src1_ptag` | 7 b | Resolved source ptag 1 |
+| `src2_ptag` | 7 b | Resolved source ptag 2 (immediate/pc-rel) |
+| `pdst` | 7 b | Newly allocated destination ptag |
+| `dst_atag` | 6 b | Destination atag (for MapQ entry) |
+| `dst_class` | 2 b | P / T / U |
+| `has_dst` | 1 b | Whether this uop writes a register |
+| `src_ready` | 3 b | Ready Table init: {src0_rdy, src1_rdy, src2_rdy} from Ready Table query |
+
+#### 6.B.2 D3: Rename Complete + Dispatch Prep
+
+D3 is the **rename-complete boundary**. It finalizes the SMAP write and initializes the Ready Table state for each dispatched instruction:
+
+```
+# D3: finalize SMAP write
+SMAP = smap_live   # atomic write of live state to SMAP
+
+# Initialize Ready Table for each dispatched entry
+for slot in range(4):
+    u = d3_uop[slot]
+    if u.dst_class == P and u.has_dst:
+        # Clear the newly allocated ptag from Ready Table
+        # (not ready until CDB writeback)
+        ready_table.clear(u.pdst)
+
+    # Source ready bits were pre-computed from Ready Table query in D2
+    # (combinational; registered into IQ entry at S2)
+```
+
+D3 also assigns IQ routing:
+
+| Op type | Target IQ | Issue width |
+|---------|---------|-------------|
+| ALU (ADD/SUB/AND/OR/XOR/SLT/SLL/SRL/SRA/MUL) | `alu_iq` | 4-wide |
+| FSU (floating-point scalar ops) | `alu_iq` | 4-wide |
+| BRU (branch/jump) | `bru_iq` | 1-wide |
+| LSU (load/store) | `lsu_iq` | 2-wide |
+
+#### 6.B.3 Rename Register State Machine
+
+Three parallel structures manage P-reg rename:
+
+```
+CMAP [32 x 7b]:  atag -> committed ptag
+  - Updated when: ptag becomes orphan + refcount=0 (freed)
+  - Flush target: SMAP <- CMAP (full restore)
+
+SMAP [32 x 7b]:  atag -> speculative ptag (active rename view)
+  - Updated on each D2 group (in program order)
+  - Flush: SMAP <- CMAP via MapQ reverse replay
+
+MapQ [12-entry ring buffer]:
+  Fields: {atag, old_ptag, new_ptag, rid, is_push_t, is_push_u}
+  - D2: push entry for each P-dst rename
+  - Flush: reverse replay from tail to head until rid > flush_rid
+  - Undo per entry: SMAP[atag] <- old_ptag; refcount[new_ptag]--; refcount[old_ptag]++
+  - After replay: SMAP == CMAP (exact committed state)
+```
+
+#### 6.B.4 Checkpoint Extensions (BCC Scalar Pipeline)
+
+RAT checkpoints are replaced by MapQ for P-reg rename recovery. The branch-tag allocator and SSB/STQ head-pointer snapshot remain unchanged from v2.
+
+| Component | v2 (RAT checkpoints) | BCC (MapQ) |
+|-----------|---------------------|------------|
+| P-reg recovery | 8 x 224-bit RAT snapshots | 12-entry incremental MapQ |
+| T-reg recovery | 8 x 256-bit Tile RAT snapshots | Unchanged (Tile RAT independent) |
+| Flush precision | Checkpoint at branch time | Instruction-precise via rid cut |
+| SSB/STQ recovery | Head pointer snapshot | Unchanged |
+
+### 6.C Free Lists
+
+| Parameter | Value |
+|-----------|-------|
+| **Scalar free list** | FIFO, 96 entries (128 ptags minus 32 atags) |
+| Scalar dequeue rate | up to 4 per cycle (P-destinations at D2) |
+| Scalar enqueue rate | up to 4 per cycle (orphan + refcount=0) |
+| **Tile free list** | FIFO, 224 entries (256 ptags minus 32 atags) — unchanged |
+
+**Stall condition**: If the free list cannot supply enough ptags for the current D2 group, the rename pipeline stalls at D1.
+
+### 6.D Intra-Group Bypass Logic
+
+When 4 instructions are renamed simultaneously, later instructions in the group may depend on earlier ones. Hardware **priority-encoded comparators** detect these intra-group dependencies:
+
+```
+  Scalar example:
+    D2 slot 0:  X5 -> P40  (destination)
+    D2 slot 1:  reads X5  -> comparator detects match -> bypass P40
+    D2 slot 2:  X5 -> P41  (re-definition)
+    D2 slot 3:  reads X5  -> comparator detects slot 2 match -> bypass P41
+
+    4 slots x 2 sources x 3 older slots = 24 comparators (7-bit each)
+    + 8 bypass MUXes (select forwarded ptag vs SMAP read)
+
+  Tile example (unchanged from v2):
+    D2 slot 0:  TILE.LD T10  -> PT200  (destination)
+    D2 slot 1:  VADD dst=T10 -> PT201  (re-definition)
+    D2 slot 2:  reads T10    -> comparator detects slot 1 match -> bypass PT201
+
+    4 slots x 3 tile sources x 3 older slots = 36 comparators (8-bit each)
+    + 12 bypass MUXes
+```
+
+### 6.E Tile RAT — unchanged from v2 (SS6.2)
+
+All tile-consuming domains (vector, cube, MTE) share a single **Tile RAT** that renames 32 architectural tile registers (T0-T31) to 256 physical tile slots (PT0-PT255) in TRegFile-4K. The Tile RAT is **independent** of the P-reg rename pipeline (SMAP/CMAP/MapQ) and is **unchanged** from v2.
+
+| Parameter | Value |
+|-----------|-------|
+| Architectural tile registers | 32 (T0-T31) |
+| Physical tile registers | **256** (PT0-PT255), 4 KB each in TRegFile-4K |
+| Tile RAT storage | 32 entries x 8 bits = **256 bits** |
+| Read ports | **8** (up to 3 source tiles x 4 decode slots, shared/muxed) |
+| Write ports | **4** (1 destination tile x 4 decode slots) |
+
+### 6.F Tile Metadata RAT — unchanged from v2 (SS6.1)
+
+A **256 x 32 b SRAM** stores the metadata word `(shape.x, shape.y, format)` per physical tile. Access pattern unchanged from v2 SS6.1.
+
+---
+
+## 7. Dispatch & Issue
+
+### 7.1 Dispatch (S1 / S2)
+
+After rename (D1->D2->D3), each uop is dispatched through the **S1 / S2** two-stage dispatch.
+
+**S1 -- Dispatch Preparation**: Checks resource availability before writing IQ entries.
+
+**S2 -- Dispatch Execute**: Writes IQ entries and updates free lists.
+
+**S1 resource checks:**
+
+| Check | Condition | Recovery |
+|-------|-----------|----------|
+| Scalar free list | `free_mask` has >= N free ptags for dispatched P-dst ops | Stall at D1 |
+| MapQ space | `mapq.count < 11` (keep 1 slot safety margin) | Stall at D1 |
+| `alu_iq` space | >= N alu-class slots in current dispatch group | Stall at S1 |
+| `bru_iq` space | >= N bru-class slots in current dispatch group | Stall at S1 |
+| `lsu_iq` space | >= N lsu-class slots in current dispatch group | Stall at S1 |
+| Tile free list | >= N free tile ptags | Stall at D1 |
+
+**IQ routing (from D3 assignment):**
+
+| Op type | Target IQ | Issue width |
+|---------|---------|-------------|
+| ALU (ADD/SUB/AND/OR/XOR/SLT/SLL/SRL/SRA/MUL) | `alu_iq` | 4-wide |
+| FSU (floating-point scalar ops) | `alu_iq` | 4-wide |
+| BRU (branch/jump) | `bru_iq` | 1-wide |
+| LSU (load/store) | `lsu_iq` | 2-wide |
+| Vector ops | Vector RS | 1-wide |
+| Cube ops | Cube RS | 1-wide |
+| MTE ops | MTE RS | 2-wide |
+
+**S2 dispatch execute:**
+
+```
+for slot in dispatched_slots:
+    iq_type = s1_iq_route[slot]
+    entry_idx = iq_alloc.allocate(iq_type)
+
+    iq[entry_idx].valid = 1
+    iq[entry_idx].src0_ptag = d3_uop[slot].src0_ptag
+    iq[entry_idx].src1_ptag = d3_uop[slot].src1_ptag
+    iq[entry_idx].src2_ptag = d3_uop[slot].src2_ptag
+    iq[entry_idx].pdst = d3_uop[slot].pdst
+    iq[entry_idx].src_ready = d3_uop[slot].src_ready
+    iq[entry_idx].rid = d3_uop[slot].rid
+    iq[entry_idx].lsid = d3_uop[slot].lsid
+    iq[entry_idx].checkpoint_id = d3_uop[slot].checkpoint_id
+    iq[entry_idx].branch_tag = current_branch_tag
+
+free_mask &= ~allocated_ptags
+```
+
+### 7.2 Physical IQ Entry Formats
+
+#### ALU IQ Entry (48 entries, ~95 bits each)
+
+| Field | Width | Description |
+|-------|-------|-------------|
+| `valid` | 1 b | Entry occupied |
+| `rid` | 6 b | Rename ID (program order, used for age) |
+| `op` | 12 b | Operation code |
+| `imm` | 64 b | Immediate value |
+| `src0_ptag` | 7 b | Source 0 ptag |
+| `src1_ptag` | 7 b | Source 1 ptag |
+| `src2_ptag` | 7 b | Source 2 ptag |
+| `pdst` | 7 b | Destination ptag |
+| `has_dst` | 1 b | Whether this uop writes a register |
+| `src_ready` | 3 b | Ready bits: {src0_rdy, src1_rdy, src2_rdy} |
+| `checkpoint_id` | 4 b | MapQ entry ID |
+| `branch_tag` | 3 b | Speculation branch tag |
+
+**Total: 48 x ~95 b approx 570 B**
+
+#### BRU IQ Entry (16 entries, ~120 bits each)
+
+| Field | Width | Description |
+|-------|-------|-------------|
+| `valid` | 1 b | Entry occupied |
+| `rid` | 6 b | Rename ID (program order) |
+| `op` | 12 b | Operation code |
+| `pc` | 64 b | Instruction PC |
+| `src0_ptag` | 7 b | Source 0 ptag |
+| `src1_ptag` | 7 b | Source 1 ptag |
+| `pdst` | 7 b | Destination ptag (branch target register) |
+| `has_dst` | 1 b | Whether this uop writes a register |
+| `src_ready` | 2 b | Ready bits: {src0_rdy, src1_rdy} |
+| `checkpoint_id` | 4 b | MapQ entry ID |
+| `branch_tag` | 3 b | Speculation branch tag |
+| `pred_taken` | 1 b | Branch prediction direction |
+
+**Total: 16 x ~120 b approx 240 B**
+
+#### LSU IQ Entry (32 entries, ~104 bits each)
+
+| Field | Width | Description |
+|-------|-------|-------------|
+| `valid` | 1 b | Entry occupied |
+| `rid` | 6 b | Rename ID (program order) |
+| `op` | 12 b | Operation code |
+| `lsid` | 32 b | Load-Store ID |
+| `src0_ptag` | 7 b | Base register ptag |
+| `src1_ptag` | 7 b | Offset register ptag |
+| `pdst` | 7 b | Destination ptag |
+| `has_dst` | 1 b | Whether this uop writes a register |
+| `src_ready` | 2 b | Ready bits: {src0_rdy, src1_rdy} |
+| `checkpoint_id` | 4 b | MapQ entry ID |
+| `branch_tag` | 3 b | Speculation branch tag |
+| `addr_ready` | 1 b | AGU address computation complete |
+
+**Total: 32 x ~104 b approx 416 B**
+
+**Key difference from v1 RS entries**: No per-entry `age` field -- age is encoded in the `rid` (6-bit Rename ID, program order). No per-entry `rdy1`/`rdy2` ready bits -- ready state is maintained in the **Ready Table** and checked at issue time.
+
+### 7.3 Ready Table
+
+The Ready Table is a **128-bit bitmap** that tracks which ptags have valid values. It replaces the `O(iq_depth x issue_w x pregs)` CDB comparator array with `O(1)` bit-tests.
+
+```
+Ready Table: bit[i] = 1 means ptag i has a valid value
+
+set(ptag):    mask |= (1 << ptag)     # Called on CDB writeback
+clear(ptag):  mask &= ~(1 << ptag)    # Called on ptag allocation at D2
+is_ready(i):  return (mask >> i) & 1
+read(i):       return is_ready(i)       # Combinational read for can_issue
+```
+
+**Update rules per cycle:**
+
+| Event | Action |
+|-------|--------|
+| D2 dispatch: ptag allocated | `ready.clear(pdst)` -- set bit=0 |
+| CDB writeback | `ready.set(wb.ptag)` -- set bit=1 |
+| Flush | `ready.mask <= ALL_ONES` -- conservative reset |
+
+**Can_issue computation (P1, combinational):**
+
+```
+for each IQ entry e:
+    src0_rdy = ready_table.read(e.src0_ptag)   # O(1) bit-test
+    src1_rdy = ready_table.read(e.src1_ptag)
+    src2_rdy = ready_table.read(e.src2_ptag)
+    e.can_issue = e.valid & src0_rdy & src1_rdy & src2_rdy
+```
+
+**Wakeup timing:**
+
+| Cycle | Event |
+|-------|-------|
+| W1 | CDB broadcasts ptag P40 is ready |
+| W1 (end) | `ready_next = ready | {P40}` registered |
+| Clock edge | Ready Table Register <= ready_next |
+| N+1 P1 | `can_issue` recomputed with new Ready Table |
+| N+1 P1 | Age-matrix pick selects winners |
+| N+1 I1 | RF read-port arbitration |
+| N+2 I2 | Issue confirm; IQ entry deallocated |
+
+**Total wakeup latency: 2 cycles** (Ready Table register -> can_issue -> pick -> RF read).
+
+### 7.4 Age-Matrix Issue Picker
+
+The issue picker is **purely combinational logic** (no state). For each physical IQ, a cascaded priority encoder selects the oldest-ready entries.
+
+**Age encoding**: RID is a 6-bit program-order counter. Sub-head age:
+```
+age = (entry.rid - head_rid) mod 64
+```
+Smaller age = older instruction = higher priority. Mod-64 arithmetic handles wrap correctly.
+
+**Cascaded pick (alu_iq, 4-wide):**
+
+```
+selected = []
+excluded = set()
+for lane in range(4):   # alu_w = 4
+    winner = None
+    best_age = 0x3F     # Max RID value = youngest
+    for entry in alu_iq.entries:
+        if entry not in excluded and entry.can_issue:
+            age = (entry.rid - head_rid) & 0x3F   # mod-64 wrap-friendly
+            if age < best_age:
+                best_age = age
+                winner = entry
+    if winner:
+        selected.append(winner)
+        excluded.add(winner)
+    else:
+        selected.append(None)   # Lane empty
+```
+
+**Per-IQ issue widths:**
+
+| IQ | Issue width | Description |
+|----|-------------|-------------|
+| `alu_iq` | 4 | 4x ALU + FSU |
+| `bru_iq` | 1 | 1x BRU |
+| `lsu_iq` | 2 | 1x Load + 1x Store |
+| Vector RS | 1 | 1x VEC-4K-v2 op |
+| Cube RS | 1 | 1x outerCube op |
+| MTE RS | 2 | 2x MTE ops |
+
+### 7.5 Issue Stages: P1 / I1 / I2
+
+#### P1 -- Issue Pick
+
+The P1 stage performs the **Ready Table query and age-matrix cascaded pick** (described in SS7.3 and SS7.4). Results are registered at the end of P1.
+
+#### I1 -- Operand Read Planning
+
+The I1 stage arbitrates **physical RF read-port access** across all 7 issue slots (alu_iq x 4 + bru_iq x 1 + lsu_iq x 2). The scalar RF has 12 read ports (8 from rename + 4 from issue). Port conflicts are resolved by priority.
+
+#### I2 -- Issue Confirm
+
+The I2 stage **deallocates IQ entries** for the selected instructions and confirms RF read-port occupancy. The physical RF read operation itself begins in I2 (registered input to RF) with data available at the start of E1.
+
+#### 7.4 GVIQ — Grouped Vector Issue Queue (VTG Micro-Instructions)
+
+> **(v2.2 BCC vector overlay — Change Point #2)**
+
+The GVIQ holds VTG vector micro-instructions waiting for their source VTGs to become ready. It is **separate from** the existing Vector RS (24 entries, full-tile `T*` ops). The two paths share the VEC-4K-v2 ALU, TRegFile ports (R0/R4 for reads, W0 for writeback), and SA/SB/SC staging registers.
+
+**VEC-domain arbitration:** The VEC-4K-v2 ALU is 1-wide. The VEC-domain arbiter grants ALU access based on readiness and priority. Full-tile VEC ops (from Vector RS) have **higher priority** than VTG micro-ops (from GVIQ) because they are coarser-grain and hold the prologue longer.
+
+**GVIQ entry** (before operand fields):
+
+| Field | Width | Description |
+|-------|-------|-------------|
+| `valid` | 1 b | Entry is live |
+| `block_id` | 12 b | Index into micro-instruction buffer (shared by all VTGs in this block) |
+| `pc_index` | 8 b | Current micro-instruction within block (0..63) |
+| `tile_group` | 5 b | Architectural tile T0..T31 |
+| `phys_tile` | 8 b | Physical tile PT0..PT255 (after Tile RAT rename) |
+| `group_id` | 4 b | VTG index: 0..15 (`G256`) or 0..7 (`G512`) |
+| `group_mode` | 1 b | `G256` (0) or `G512` (1) |
+| `thread_id` | 8 b | Scheduler context (usually = `group_id`) |
+| `iter0..iter3` | 4x16 b | Loop counters for VTG's current iteration |
+| `active_lanes` | 16 b | Active lane count or mask |
+| `active_group_mask` | 16 b | Which VTG groups are active in this block |
+
+**GVIQ operand fields** (after prefix):
+
+| Field | Width | Description |
+|-------|-------|-------------|
+| `src0_ptag` | 8 b | Physical tile tag for source VTG 0 |
+| `src1_ptag` | 8 b | Physical tile tag for source VTG 1 |
+| `src2_ptag` | 8 b | Physical tile tag for source VTG 2 |
+| `pred_ptag` | 8 b | Physical tile tag for predicate VTG |
+| `dst_ptag` | 8 b | Physical tile tag for destination VTG |
+| `has_dst` | 1 b | Whether this micro-op writes a VTG |
+| `src_ready` | 4 b | VTG-ready bits: src0/1/2/pred ready |
+| `vtg_ready` | 1 b | All source VTGs ready + loop counters ready |
+| `branch_tag` | 3 b | Branch tag for speculation gating |
+
+**Micro-instruction buffer** (in vector ALU):
+
+The micro-instruction buffer is a **set-associative buffer** (16 entries, 2-way) keyed by `block_id`. Each entry stores the pre-decoded micro-op list for a tile group:
+
+```text
+BufferEntry {
+  valid:     1 b
+  block_id:  12 b  [tag]
+  pc_limit:  8 b   [max pc_index]
+  micro_ops: array[64] of MicroOpEntry
+}
+
+MicroOpEntry {
+  opcode:     12 b   // VADD / VMUL / VCMP / VLD / VST / ...
+  elem_type:  4 b   // FP32 / FP16 / FP8 / FP4 / ...
+  pred_mode:  1 b   // 0=zeroing, 1=merging
+  src0_ref:   VTGRef | Scalar | Imm
+  src1_ref:   VTGRef | Scalar | Imm | None
+  src2_ref:   VTGRef | Scalar | Imm | None
+  dst_ref:    VTGRef
+  pred_ref:   VTGRef | implicit_all_true
+  imm:        32 b  // immediate or address offset
+}
+```
+
+At P1/I1 issue time, the GVIQ winner's `{block_id, pc_index}` drives a single-cycle buffer lookup. The `MicroOpEntry` drives VEC staging control, ALU opcode, and writeback routing — no re-decode needed.
+
+**VTG wakeup:** VTG readiness is tracked by a **256-bit VTG Ready Table** (one bit per physical tile PT0..PT255). When a VTG micro-op writes back, its `dst_ptag` sets the corresponding bit. When a VTG is dispatched, its source `ptag` bits are cleared.
+
+**VTG rotation scheduling:**
+
+```python
+while any VTG active in block:
+    winner = gviq.pick_oldest_ready()          # age = (rid - head_rid) mod 64
+    micro_op = buffer.lookup(winner.block_id, winner.pc_index)
+    SA = TRegFile.read(winner.src0_ptag)       # full 4 KB tile
+    SB = TRegFile.read(winner.src1_ptag)
+    SA_vtg = select_vtg(SA, winner.group_id, winner.group_mode)   # 256/512 B sub-range
+    result = vec_alu.execute(micro_op, SA_vtg, SB_vtg)
+    TRegFile.write_vtg(winner.dst_ptag, winner.group_id, result)
+    winner.pc_index++
+    if loop_end(winner): winner.iterN--, winner.pc_index = loop_start
+    if all_iters_done(winner): winner.valid = 0
+```
+
+**Issue rules:**
+
+| Rule | Description |
+|------|-------------|
+| GVIQ-1 | `pc_index <= pc_limit` for the given `block_id` |
+| GVIQ-2 | All source VTG `src_ready` bits set |
+| GVIQ-3 | Active loop counter (`iter*`) non-zero |
+| GVIQ-4 | GVIQ is 1-wide: one VTG micro-op per cycle |
+| GVIQ-5 | VEC-4K-v2 ALU is single-ported per VTG: one VTG per VEC beat |
+| GVIQ-6 | Paired `G256` issue (optional): two independent 256 B VTGs share one 512 B SIMD group beat if `{opcode, elem_type, pred_mode}` match |
+
+### 7.6 Dispatch Stall Conditions
+
+| Condition | Recovery |
+|-----------|----------|
+| Target IQ is full | Wait for entry to be freed |
+| Scalar free list empty | Wait for refcount-driven free |
+| MapQ space exhausted | Wait for MapQ eviction on branch resolve |
+| All branch-tag slots occupied | Wait for an in-flight branch to resolve |
+| SSB full | Wait for SSB entry to drain |
+| STQ full | Wait for STQ entry to drain |
+
+---
+
+## 8. Execution Units
+
+### 8.1 Scalar Unit
+
+> **(v1 → v2: 内容未变更,以下完整复制自 v1 §8.1。)**
+
+The scalar unit contains **6 functional units** sharing the Scalar RS.
+
+#### 8.1.1 ALU (×4) — (v1 §8.1.1, 未变更)
+
+Four identical single-cycle ALUs handle integer arithmetic, logic, shift, and compare operations.
+
+| Parameter | Value |
+|-----------|-------|
+| Count | **4** symmetric ALUs |
+| Operations | ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU, LUI, AUIPC |
+| Latency | **1** cycle |
+| Throughput | **4** ops / cycle |
+| Input width | 64-bit |
+
+#### 8.1.2 MUL/DIV Unit (×1) — (v1 §8.1.2, 未变更)
+
+| Parameter | Value |
+|-----------|-------|
+| MUL latency | **4** cycles (pipelined, 1 MUL issued/cycle) |
+| MUL operations | MUL, MULH, MULHU, MULHSU, MULW |
+| DIV latency | **12–20** cycles (non-pipelined, blocks MUL during execution) |
+| DIV operations | DIV, DIVU, REM, REMU, DIVW, DIVUW |
+
+#### 8.1.3 Branch Unit (×1) — (v1 §8.1.3, 未变更)
+
+| Parameter | Value |
+|-----------|-------|
+| Latency | **1** cycle (compare + resolve) |
+| Operations | BEQ, BNE, BLT, BGE, BLTU, BGEU, JAL, JALR |
+| On correct prediction | Deallocate checkpoint; no pipeline impact |
+| On mispredict | Flash-restore RAT; flush pipeline stages F1–IS; redirect fetch |
+| Mispredict penalty | **6** cycles (front-end refill) |
+
+### 8.2 Load/Store Unit (LSU)
+
+> **(v1 → v2: §8.2.1 架构与 §8.2.2 参数完整复制自 v1。v2 用 SSB §8.2.3 替换 v1 §8.2.3 简化提交。)**
+
+The LSU handles all scalar memory operations with a **simplified** design enabled by the no-exception guarantee. The LSU pipeline is identical to v1 in terms of address calculation, TLB access, cache lookup, and L1-D MSHRs. The **store path** is the only structural change: stores no longer commit directly to L1-D; instead, they pass through a **Speculative Store Buffer** (SSB) that gates them by branch tag.
+
+#### 8.2.A Architecture (v1 §8.2.1, 未变更)
+
+```
+  ┌──────────────────────────────────────────────────────────┐
+  │  Load/Store Unit                                          │
+  │                                                          │
+  │  LSU RS (24 entries) ──┬──▶ Load Pipeline  (EX1–EX4)    │
+  │                        └──▶ Store Pipeline (EX1–EX4)    │
+  │                                                          │
+  │  ┌─────────────────┐    ┌─────────────────┐             │
+  │  │ Load Queue       │    │ Store Buffer →  │             │
+  │  │ (16 entries)     │    │  SSB (24 ent.)  │  ← v2       │
+  │  │ addr + tag       │    │ addr + data +   │             │
+  │  │                  │    │ branch_tag      │             │
+  │  └────────┬────────┘    └────────┬────────┘             │
+  │           │  store-to-load       │                       │
+  │           │◀─ forwarding ────────┘                       │
+  │           ▼                      ▼                       │
+  │      ┌──────────────────────────────┐                    │
+  │      │        L1-D Cache (64 KB)    │                    │
+  │      │    4-way, 64B line, 8 MSHRs  │                    │
+  │      └──────────────────────────────┘                    │
+  └──────────────────────────────────────────────────────────┘
+```
+
+#### 8.2.B Key Parameters (v1 §8.2.2, 未变更 except store buffer entries)
+
+| Parameter | v1 | v2 |
+|-----------|----|----|
+| Load pipeline latency | **4** cycles (address calc + TLB + cache access + align) | **4** cycles (unchanged) |
+| Store pipeline latency | **4** cycles (address calc + TLB + write to store buffer) | **4** cycles (unchanged) |
+| Load queue entries | **16** | **16** (unchanged) |
+| Store buffer entries | **16** | **24** (now SSB §11.4) |
+| Store-to-load forwarding | Full forwarding when address and size match | Full forwarding (now from SSB §11.4.3) |
+| L1-D MSHRs | **8** (non-blocking, 8 outstanding misses) | **8** (unchanged) |
+| D-TLB | 64 entries, fully associative | unchanged |
+
+#### 8.2.1 Speculative Store Buffer (SSB) — overview
+
+```
+  ┌────────────────────────────────────────────────────────────┐
+  │  Speculative Store Buffer  (SSB, 24 entries)              │
+  │                                                            │
+  │  Each entry:                                               │
+  │   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┐       │
+  │   │valid │ btag │ addr │ data │ size │ alloc│ drain│       │
+  │   │ (1b) │ (3b) │ (40b)│(128b)│ (3b) │_age  │_rdy  │       │
+  │   │      │      │      │      │      │ (6b) │ (1b) │       │
+  │   └──────┴──────┴──────┴──────┴──────┴──────┴──────┘       │
+  │                                                            │
+  │   Total: 24 × ~182 b ≈ ~550 B                              │
+  └────────────────────────────────────────────────────────────┘
+```
+
+| Field | Width | Purpose |
+|-------|-------|---------|
+| `valid` | 1 b | Slot is occupied |
+| `btag` | 3 b | Branch tag inherited from the producing store µop |
+| `addr` | 40 b | Physical address (post-TLB) |
+| `data` | 128 b | Up to 16 B of store data (covers byte/half/word/dword) |
+| `size` | 3 b | 1/2/4/8/16 B store width |
+| `alloc_age` | 6 b | Sequence number for in-order drain |
+| `drain_rdy` | 1 b | Set when `btag = 0xFF` (non-speculative); the entry can drain to L1-D |
+
+**Capacity:** 24 entries — a 50% increase over v1's 16-entry "store buffer". The increase is driven by the speculation window: at maximum, all 8 branch tags can have their stores in flight, and each branch may generate several stores. With branch-prediction accuracy ~95% and a typical kernel mix of 20–30% memory ops, 24 entries provide ~10 cycles of buffering at peak issue (2 stores/cycle into a smaller pool would risk dispatch stall).
+
+#### 8.2.2 SSB drain policy
+
+```
+  Tag-clear from speculation tracker (when branch resolves correctly):
+    For each SSB entry e:
+      if e.btag == cleared_tag:
+         e.btag ← (next-older unresolved branch tag, or 0xFF if none)
+         if e.btag == 0xFF: e.drain_rdy ← 1
+
+  Drain to L1-D (1 store/cycle, oldest-first among drain_rdy entries):
+    Pick oldest e with valid && drain_rdy
+    Issue write to L1-D pipeline (with 4-cy occupancy, like v1 store)
+    On completion: e.valid ← 0; SSB entry returned to free pool
+
+  Mispredict (tag invalidation):
+    For each SSB entry e:
+      if e.btag is younger than (or equal to) mispredicted_tag:
+         e.valid ← 0   (entry invalidated; never reaches L1-D)
+```
+
+The SSB head pointer is the **next allocation slot** (FIFO discipline for in-flight ordering); the head is snapshotted into each branch checkpoint at D2.
+
+#### 8.2.3 Store-to-load forwarding from SSB
+
+Loads still forward from the SSB on address match (same semantic as v1's store-to-load forwarding). Only loads with the **same or younger** branch_tag are eligible to forward — a load on a different speculation path must NOT forward from a store on its own path; instead, the load goes to L1-D.
+
+```
+  Load (addr, btag_load) forwards from SSB entry e iff:
+    e.valid && e.addr == addr && e.size matches &&
+    e.btag is "ancestor" of btag_load
+            ──────────────────────
+            i.e. e.btag is in the chain of unresolved branches
+            that branch_tag_load also depends on.
+```
+
+Implementation: the speculation tracker (§11.3) maintains a 8 × 8 ancestry bitmap; load-forwarding eligibility is one bitmap-lookup + AND.
+
+#### 8.2.4 SSB area
+
+| Block | Area |
+|-------|------|
+| 24 × 182 b flip-flop array | ~24 × 1.8 K gate ≈ ~45 K gate |
+| Address CAM (24 × 40 b for forwarding) | ~30 K gate |
+| Branch-tag ancestry bitmap (8 × 8 b) | ~1 K gate |
+| Drain FSM | ~2 K gate |
+| **Total** | **~80 K gate** (~0.02 mm² @ 5 nm) |
+
+### 8.3 Vector Unit — VEC-4K-v2
+
+The vector unit is a full re-architecting of v1's vector unit, specified in detail in [`vector4k_v2.md`](vector4k_v2.md). This section summarizes the integration into the Davinci-v2 core; refer to the standalone document for full datapath, microcode, and worked microcode examples.
+
+#### 8.3.1 High-level structure
+
+```
+  ┌──────────────────────────────────────────────────────────────────────┐
+  │  VEC-4K-v2 Unit                                                       │
+  │                                                                      │
+  │  TRegFile-4K read ports R0, R4 (with is_transpose)                   │
+  │       │  512 B/cy each                                               │
+  │       ▼                                                              │
+  │  ┌──────────┐  ┌──────────┐  ┌──────────┐                            │
+  │  │ SA       │  │ SB       │  │ SC       │  ← staging registers       │
+  │  │ 4 KB     │  │ 4 KB     │  │ 4 KB     │     (24 × 1R1W SRAM        │
+  │  │ SRAM     │  │ SRAM     │  │ SRAM     │      macros total, §9.1    │
+  │  │ +meta    │  │ +meta    │  │ +meta    │      of vector4k_v2.md)    │
+  │  └─────┬────┘  └─────┬────┘  └────┬─────┘                            │
+  │        │ 512 B/cy    │ 512 B/cy   │ 1-bit-per-lane mask              │
+  │        ▼             ▼            ▼                                  │
+  │  ┌───────────────────────────────────────────────────────────┐       │
+  │  │  Stage (A): align / unpack / permute (per operand slot)    │       │
+  │  └─────────────────────────────┬───────────────────────────────┘       │
+  │                                ▼                                      │
+  │  ┌───────────────────────────────────────────────────────────┐       │
+  │  │  Stage (B): 128 compute groups (FP32-equivalent FMA core   │       │
+  │  │             with intra-group format widener), Acc feedback,│       │
+  │  │             per-lane mask gate, 256-lane shuffle+CAS       │       │
+  │  │             primitive (for TMRGSORT)                        │       │
+  │  └───────────────────────────────────────────────────────────┘       │
+  │                                                                       │
+  │  ┌───────────────────────────────────────────────────────────┐       │
+  │  │  Acc (256 × 32 b × 2 ping-pong, parity-banked LO/HI)       │       │
+  │  └───────────────────────────────────────────────────────────┘       │
+  │                                                                       │
+  │  ┌──────────────┐    ┌──────────────┐                                 │
+  │  │ Pack → D0    │    │ Pack → D1    │                                 │
+  │  │ → W0 (512B/cy)│    │ → W4 (512B/cy)│                               │
+  │  └──────────────┘    └──────────────┘                                 │
+  │                                                                       │
+  │  Microcode ROM: ~64 b × 64 beats × 256 programs ≈ 128 Kb              │
+  └──────────────────────────────────────────────────────────────────────┘
+```
+
+#### 8.3.2 Key parameters
+
+| Parameter | Value | Note |
+|-----------|-------|------|
+| Compute datapath width | 512 B / 128-lane FP32-equivalent | unchanged from v1 |
+| TRegFile read ports used | R0 (Port A), R4 (Port B) | per epoch; 8 cy/epoch |
+| TRegFile write ports used | W0 (D0), W4 (D1) | per epoch |
+| Staging: SA, SB, SC payload | 4 KB each; 1R1W SRAM (recommended) | 24 × `512 b × 16 × 1R1W` macros |
+| Compute groups | 128 (format-independent) | Per-format SIMD width: FP32×1, FP16×2, FP8×4, FP4×4(×2 sub-beats) |
+| Accumulator | 256 × 32 b × 2 ping-pong, LO/HI parity-banked | `N_run = 512` |
+| Microcode ROM | ~128 Kb (64 b × ~64 beats × ~256 programs) | regenerable per ISA version |
+
+#### 8.3.3 Tile metadata flow
+
+Each VEC-4K-v2 instruction at D2 rename consults the **Tile Metadata RAT** (§6.1) for each source tile and forwards `(shape.x, shape.y, format)` to the VEC RS entry. At issue, the metadata is propagated into the `SOP` staging register ([`vector4k_v2.md`](vector4k_v2.md) §4.4) where it stays stable for the entire compute phase. The destination tile's metadata is **derived** from the operation's semantics and `retire_format_*` fields (e.g. `VCVT` produces metadata with `format = retire_format_0`, same shape as source).
+
+For backward-compatibility with v1 binaries, a tile that has never been written by a v2 instruction (and therefore has no explicit metadata) defaults to `(shape.x, shape.y, format) = (16, 64, FP32)` — the canonical v1 4 KB tile interpretation.
+
+#### 8.3.4 Microcode beat machine
+
+Compute is driven beat-by-beat from a microcode ROM (`SOP.ucode_base`, `SOP.ucode_len`). Each beat word (~64 b) names per ALU operand slot:
+
+- `src` ∈ {SA, SB, ACC_READ_LO, ACC_READ_HI, SX_broadcast, SY_broadcast, IMM_ZERO}
+- `s` (strip index 0..7)
+- `tilelet_xpose` (per-beat, primary transpose mechanism)
+- `mask_src` ∈ {SC_mask, IMM_ALL_ONES, IMM_FROM_SOP}
+- `alu_op` ∈ {ADD, SUB, MUL, FMA, FNMA, MAX, MIN, CMP, AND, XOR, PASS_A, PASS_B, SELECT, RECIP, RSQRT, SHUFFLE_CAS_UP, SHUFFLE_CAS_DOWN, …}
+- `acc_op` ∈ {NONE, INIT, ACCUM, MERGE_STAGE, READOUT}
+- `wr_en_D0`, `wr_en_D1`
+
+Microcode is keyed by `(opcode, format, W-regime, R-regime)`; the ROM is regenerable in software per ISA version (no RTL change).
+
+#### 8.3.5 Operand-fetch prologue
+
+Operand fetch is **variable-length** ([`vector4k_v2.md`](vector4k_v2.md) §6):
+
+| `N_val` | `is_xpose` mix | `has_mask` | `T_fetch` (best/worst) |
+|--------:|----------------|-----------:|------------------------:|
+| 1 | any | 0 or 1 | **8 / 15 cy** |
+| 2 | uniform | 0 or 1 | **8 / 15 cy** |
+| 2 | mixed (R2 penalty) | 0 or 1 | **16 / 23 cy** |
+
+The mask fetch (1–2 strips of `SC`) piggybacks on an idle port cycle within a value-tile epoch and never extends the prologue in the common case.
+
+#### 8.3.6 Per-operand `is_transpose` and per-beat `tilelet_xpose`
+
+VEC-4K-v2 forwards each operand's `is_xpose` bit (latched at D2 rename, present in the RS entry §7.2) to the assigned TRegFile read port at issue time. The TRegFile delivers either row-mode or col-mode strips; the staged content reflects the requested mode.
+
+Per-beat `tilelet_xpose` (microcode bit per operand slot per beat) re-transposes the staged tile inside the staging register's diagonal-skew read datapath at no scheduling cost. Most reduction kernels use `tilelet_xpose` exclusively (TRegFile-side bit defaults to 0); the TRegFile-side bit is reserved for cases where the same transposed view is reused many times.
+
+#### 8.3.7 Latency table (selected ops)
+
+| Op | Format | Shape | Latency |
+|----|--------|-------|---------|
+| VADD / VMUL / VFMA_ACC (binary or Acc-feedback ternary) | any | 1024..8192 elements | 16 cy (8 fetch + 8 retire); throughput 1/8 cy |
+| **VFMA / VFNMA** (native 3-source, `c_role=VALUE`, §2.2.6a) | any | 1024..8192 elements | **16 cy (8 fetch on R0/R4/R1 + 8 retire); throughput 1/8 cy** — same as binary thanks to 3-port parallel fetch |
+| **VLERP** (native 3-source, dual retire D0/D1, §2.2.6a) | any | 1024..8192 elements | **24 cy (8 fetch + 16 retire); throughput 1/16 cy** |
+| VADD masked | any | any | same as unmasked (mask piggybacks) |
+| VFMA_ACC with `is_xpose_A ≠ is_xpose_B` | any | any | **24 cy** (16 fetch + 8 retire); throughput 1/16 cy |
+| **VFMA with one mismatched `is_xpose_*`** | any | any | **24 cy** (16 fetch + 8 retire); throughput 1/16 cy |
+| **VFMA with all three `is_xpose_*` distinct** (degenerate) | any | any | **32 cy** (24 fetch + 8 retire); throughput 1/24 cy |
+| VROWSUM (wide, R=8 C=128 FP32) | FP32 | 8×128 | 16 cy fetch + 13 compute + 8 retire ≈ **37 cy** (recommended baseline, no cross-group tree) |
+| VROWSUM (alt config with cross-group tree) | FP32 | 8×128 | 16 + 9 + 8 = **33 cy** |
+| VCOLSUM (wide) | FP32 | 8×128 | 8 + 9 + 8 = **25 cy** |
+| TMRGSORT N=256 FP32 | FP32 | 256 elements | 8 + 36 + 8 = **52 cy** |
+| TMRGSORT N=1024 FP32 (1 tile) | FP32 | 1024 elements | 8 + 220 + 8 = **236 cy** |
+| TINV 32×32 FP32 (1 tile) | FP32 | 32×32 | 8 + ~2 200 + 8 = **~2.2 K cy** |
+| TINV 128×128 FP32 (16 tiles) | FP32 | 128×128 | ~120 cy fetch + ~33 K + 8 retire = **~33 K cy** |
+| TROWRANGE_MUL 8 strips | FP32 | 8×128 | 8 + 10 + 8 = **26 cy** |
+
+#### 8.3.8 TRegFile-4K port usage
+
+A single VEC-4K-v2 instruction occupies:
+
+- 1–2 TRegFile read ports for `T_fetch` cycles (R0 + optional R4)
+- 1–2 TRegFile write ports for the 8-cycle retire epoch (W0 + optional W4 if `retire_mask = 2'b11`)
+
+**Epoch-pipelined throughput** is preserved: the retire epoch of instruction N overlaps with the fetch epoch of instruction N+1 on independent ports. The per-port allocation table (§9.2.4) is unchanged from v1 in topology — VEC-4K-v2 happens to use the same R0/R4 + W0/W4 binding as v1.
+
+#### 8.3.9 Speculation handling
+
+VEC-4K-v2 instructions are speculation-safe **inherently**: their only "external" side effect is a tile write to TRegFile-4K, which is renamed and reference-counted (§10.5). On branch mispredict:
+
+- VEC RS entries with `btag` in the misprediction's flush set are invalidated.
+- In-flight compute beats are flushed within 1 cycle (the staging-register state is overwritten by the next instruction's operand fetch).
+- Destination physical tiles (allocated at D2) are returned to the tile free list via free-list-head-pointer restore from the checkpoint.
+
+#### 8.3.10 VTG Vector Micro-Instruction Execution Mode (v2.2)
+
+> **(Change Point #2 — VTG / SIMD-group overlay)**
+
+The VEC-4K-v2 datapath supports two execution modes. **VTG operates behind VEC-4K-v2's staging registers** -- VTGs share SA/SB/SC with the full-tile path and do not introduce new staging structures or new TRegFile ports. VTG operands are sub-ranges of tiles that VEC-4K-v2 has already staged in SA/SB/SC via its 8-cycle operand-fetch prologue. VTG reads from SA/SB at the **ALU input mux** after the prologue completes.
+
+| Mode | Execution unit | VTG size | SIMD groups | TRegFile access |
+|------|---------------|----------|-------------|----------------|
+| **Full-tile `T*`** | One VEC tile op | 4 KB | Full epoch (8 cycles) | Full 4 KB read/write |
+| **VTG `V*` micro-op** | One VTG micro-op | 256 B or 512 B | 1 or 2 SIMD group beats | VTG sub-range via Group Read/Write Adapters. **VTG latency: 25-32 cy minimum** (8-15 cy prologue + 1 cy compute + 16 cy RMW writeback). VTG operates behind VEC-4K-v2 staging (SA/SB/SC) and reuses TRegFile ports (R0/R4/W0). |
+
+**VEC staging reuse for VTG micro-ops:**
+
+| Staging Register | VTG Micro-Op Use |
+|-----------------|-----------------|
+| `SA` | VEC-4K-v2 operand A staging (4 KB, filled by R0). VTG reads 256/512 B sub-range from SA at ALU input mux via Group Read Adapter. |
+| `SB` | VEC-4K-v2 operand B staging (4 KB, filled by R4). VTG reads 256/512 B sub-range from SB at ALU input mux. Scalar broadcast via SX/SY unchanged. |
+| `SC` | Predicate VTG (or third source VTG for wide ops) |
+| `SX / SY` | Scalar operand broadcast / loop counter broadcast |
+| `SOP` | VEC beat word from micro-instruction buffer (pre-decoded VEC beat-word sequence generated by VTG microassembler at decode time). One `VECBeatWord` drives VEC ALU for one cycle. |
+
+**SIMD lane mapping:**
+
+| Element type | Lanes per 512 B VTG | Lanes per 256 B VTG |
+|-------------|---------------------|---------------------|
+| FP32 / INT32 | **128** | 64 |
+| FP16 / BF16 | **256** | 128 |
+| FP8 | **512** | 256 |
+| FP4 | **1024** | 512 |
+
+**Group Read Adapter** (ALU input mux -- after prologue):
+
+```text
+input:  SA_full[4096 B] (from VEC prologue), SB_full[4096 B], group_id, group_mode
+G256:   vtg_A[256 B] = SA_full[group_id * 256 : (group_id+1) * 256]
+        vtg_B[256 B] = SB_full[group_id * 256 : (group_id+1) * 256]
+G512:   vtg_A[512 B] = SA_full[group_id * 512 : (group_id+1) * 512]
+        vtg_B[512 B] = SB_full[group_id * 512 : (group_id+1) * 512]
+output: vtg_A, vtg_B -> VEC ALU operand mux
+Note:  Group Read Adapter reads from SA/SB (4 KB), NOT directly from TRegFile.
+      VEC prologue fills SA/SB over 8-cycle epoch before this mux activates.
+```
+
+**Group Write Adapter** (VEC result -> TRegFile, full-tile RMW):
+
+```text
+input:  vtg_result[256/512 B], dst_ptag, group_id, group_mode
+
+// Step 1: Read full current tile (occupies W0 for 8 cycles)
+TRegFile.submit_read(dst_ptag)       -- occupies W0 for 8-cycle epoch --
+wait 8 cycles
+old_tile = TRegFile.read_data        -- 4 KB --
+
+// Step 2: Merge VTG result into correct sub-range
+if group_mode == G256:
+    start = group_id * 256; end = start + 256
+else:  -- G512 --
+    start = group_id * 512; end = start + 512
+new_tile = old_tile
+new_tile[start:end] = vtg_result     -- merge sub-range --
+
+// Step 3: Write merged tile back (occupies W0 for 8 cycles)
+TRegFile.submit_write(dst_ptag, new_tile)
+wait 8 cycles
+TRegFile.write_complete()
+
+-- Total RMW latency: 16 cycles minimum (8 read + 8 write) --
+// update VTG_metadata[dst_ptag][group_id] = {valid=1, defined=1, dirty=1}
+```
+
+> **Note:** TRegFile has no partial-write mechanism. All writes are full-tile, 512 B/cy x 8 cycles. The Group Write Adapter must read the current tile, merge the VTG sub-range, and write the full tile back. W0 is occupied for the full 16-cycle RMW cycle, blocking other tile writes.
+
+**Micro-instruction buffer integration:** At decode time, the VTG microassembler generates a pre-decoded `VECBeatWord` sequence for each VTG micro-op and writes it into the buffer. At P1/I1, `beat_word = buffer.lookup(block_id, pc_index)` drives VEC ALU for one cycle. The buffer lookup happens in parallel with the GVIQ pick -- both are 1-cycle combinational operations. The beat word is the same format as VEC-4K-v2's SOP beat word, enabling seamless integration with the existing VEC datapath without new microcode structures.
+
+**Paired `G256` issue (optional):** When two 256 B VTGs have matching `{opcode, elem_type, pred_mode}`, the VTG rotation scheduler may issue them together, filling the full 512 B SIMD group beat. GVIQ-6 and the VEC-domain arbiter must still resolve port conflicts (both VTGs need W0 for writeback).
+
+**No special speculation hardware is needed inside the vector unit.** The flush converges to a quiescent state in `T_fetch + max_beat_count` cycles in the worst case (a long-running `TINV` or `TMRGSORT` taking the full hit), but this is bounded by the number of in-flight vector ops (≤ 24 RS entries) and does not stall the front-end's recovery.
+
+### 8.4 Cube Unit (outerCube MXU)
+
+> **(v1 → v2: §8.4.1 / §8.4.2 完整复制自 v1。v2 增量见本节末:cube 现可消费 TRegFile-4K 的 col-mode 读出。)**
+
+The cube unit is the outerCube Matrix Unit, a large-scale outer-product accumulation engine. Full specification is in [`outerCube.md`](outerCube.md).
+
+#### 8.4.1 Summary (v1 §8.4.1, 未变更)
+
+| Parameter | Value |
+|-----------|-------|
+| Base MAC units | **4096** (8 banks × 8 rows × 64 columns) |
+| Modes | **Mode A** (K-parallel, 8-bank reduction) / **Mode B** (M-parallel, independent) |
+| Formats | FP16, BF16, FP8 (E4M3/E5M2), MXFP4, HiFP4 |
+| MAC scaling | FP16: 4096 / FP8: 8192 / MXFP4: 32768 MACs/cycle |
+| Accumulator | 32-bit FP32, ping-pong (2 × 16 KB = 32 KB) |
+| Pipeline | 19 stages: 8 (OF) + 1 (MUL) + 1 (RED) + 1 (ACC) + 8 (AD) |
+| Staging SRAM | A double-buffer (8 KB) + B double-buffer (32 KB) = 40 KB baseline |
+| Peak FP16 @ 1.5 GHz | **12.3 TFLOPS** |
+| Peak FP8 @ 1.5 GHz | **24.6 TOPS** |
+| Peak MXFP4 @ 1.5 GHz | **98.3 TOPS** |
+
+#### 8.4.2 Cube Instruction Dispatch (v1 §8.4.2, 未变更)
+
+Cube instructions (CUBE.OPA, CUBE.DRAIN, etc.) are dispatched to the **Cube RS** (4 entries) after Tile RAT rename. Each CUBE.OPA is a long-running instruction that occupies the MXU for many cycles (N + 18, where N = Nb × S OPA steps). While the MXU is busy, the Cube RS holds subsequent cube instructions until the current one completes.
+
+A CUBE.OPA may reference a range of architectural tile registers (e.g., T[Tb]..T[Tb+Na−1]). At dispatch, the Tile RAT translates each architectural tile index to a physical tile index. For multi-tile operands, the cube RS stores a base physical tile index plus a **tile address table** (up to 16 entries) holding the physical indices of all tiles in the range; the cube pipeline controller uses these physical indices to program TRegFile-4K port addresses.
+
+The cube unit reads tile data from TRegFile-4K ports R0 (A operand) and R1–R4 (B operand), and drains results via W0 (C output). Port interactions are managed by the cube pipeline controller, which issues epoch-aligned physical tile addresses to TRegFile-4K's pending registers (see [`tregfile4k.md`](tregfile4k.md) §3).
+
+#### 8.4.3 v2 增量(对硬件不可见)
+
+The cube unit benefits indirectly from the TRegFile-4K `is_transpose` enhancement: software can now feed the cube either row-major or col-major B-operand tiles by setting `is_xpose` on the cube's B-operand tile-RAT entries (the cube pipeline controller propagates the bit to TRegFile read ports R1–R4), eliminating the need for `TILE.TRANSPOSE` predecessors in many GEMM kernels. The cube ALU, accumulator, and pipeline are otherwise unchanged.
+
+### 8.5 MTE Unit
+
+> **(v1 → v2: §8.5.A / §8.5.B / §8.5.C / §8.5.D 完整复制自 v1 §8.5.1 / §8.5.2 / §8.5.3 / §8.5.4 / §8.5.5。v2 增量集中在 §8.5.1 (TRANSPOSE 缩减) 与 §8.5.2 (STQ)。)**
+
+The MTE unit is the **bridge between three domains**: memory ↔ TRegFile-4K (bulk tile transfers) and scalar GPR ↔ TRegFile-4K (single-element access via TILE.GET/TILE.PUT). All MTE instructions go through full **dual-RAT rename** at D2: scalar operands are renamed via the Scalar RAT, and tile operands are renamed via the Tile RAT. Instructions that produce a new tile (TILE.LD, TILE.ZERO, TILE.COPY, TILE.GATHER, TILE.PUT) allocate a fresh physical tile from the tile free list. TILE.GET produces a scalar GPR result and broadcasts on the CDB.
+
+#### 8.5.A Architecture (v1 §8.5.1, 未变更)
+
+```
+  ┌──────────────────────────────────────────────────────────────────┐
+  │  Memory Tile Engine (MTE)                                        │
+  │                                                                  │
+  │  MTE RS (16 entries) ──┬──▶ Load Tile Pipeline                  │
+  │                        ├──▶ Store Tile Pipeline ──▶ STQ (v2)    │
+  │                        ├──▶ Gather Pipeline                     │
+  │                        ├──▶ Scatter Pipeline ──▶ STQ (v2)       │
+  │                        ├──▶ TILE.GET Pipeline (tile→GPR)        │
+  │                        └──▶ TILE.PUT Pipeline (GPR→tile, RMW)   │
+  │                                                                  │
+  │  ┌──────────────────────────────┐                                │
+  │  │ Outstanding Request Buffer   │  Tracks up to 32 in-flight    │
+  │  │ (32 entries)                 │  tile transfers for MLP        │
+  │  └──────────────┬───────────────┘                                │
+  │                 │                                                │
+  │  ┌──────────────▼───────────────┐  ┌─────────────────────────┐  │
+  │  │ Address Generation Unit      │  │ Data Assembly / Scatter  │  │
+  │  │ (contiguous, strided, index) │  │ (pack / unpack for G/S)  │  │
+  │  └──────────────┬───────────────┘  └──────────┬──────────────┘  │
+  │                 │                              │                 │
+  │                 ▼                              ▼                 │
+  │  ┌──────────────────────────────────────────────────┐           │
+  │  │  L2 / Memory Interface (high-bandwidth path)     │           │
+  │  │  64 B/cy (1 cache line/cy) sustained              │           │
+  │  └──────────────────────────────────────────────────┘           │
+  │                 │                              │                 │
+  │                 ▼                              ▼                 │
+  │  ┌──────────────────────────────────────────────────┐           │
+  │  │  TRegFile-4K Write Ports (W1–W7 for TILE.LD)    │           │
+  │  │  TRegFile-4K Read Ports (R5–R7 for TILE.ST)     │           │
+  │  └──────────────────────────────────────────────────┘           │
+  │                                                                  │
+  │  ┌──────────────────────────────────────────────────┐           │
+  │  │  Scalar GPR ↔ Tile Element Path                  │           │
+  │  │  TILE.GET: TRegFile read port → extract → CDB    │           │
+  │  │  TILE.PUT: CDB snoop → tile copy + insert → write│           │
+  │  └──────────────────────────────────────────────────┘           │
+  └──────────────────────────────────────────────────────────────────┘
+```
+
+#### 8.5.B Key Parameters (v1 §8.5.2, 未变更)
+
+| Parameter | Value |
+|-----------|-------|
+| TILE.LD TRegFile write | **8 cycles** per write port (512 B/cy × 8 cy = 4 KB) |
+| TILE.LD total latency (L2 hit) | **72 cycles** (64 cy memory fetch + 8 cy TRegFile write epoch) |
+| TILE.ST TRegFile read | **8 cycles** per read port (512 B/cy × 8 cy = 4 KB) |
+| TILE.ST total latency (L2) | **72 cycles** (8 cy TRegFile read epoch + 64 cy memory write) |
+| Available write ports | W1–W7 (**7** ports, minus ports used by cube drain) |
+| Available read ports | R5–R7 (**3** ports, minus ports used by cube operands) |
+| Max concurrent TILE.LD | up to **7** (1 per write port), limited by memory BW |
+| Max concurrent TILE.ST | up to **3** (1 per read port) |
+| Outstanding request buffer | **32** entries (supports deep memory-level parallelism) |
+| Gather/scatter | Uses index tile (Tidx) for non-contiguous access patterns |
+| L2 → MTE bandwidth | **64 B/cy** (1 cache line/cy) → 1 tile in **64 cycles** from L2 |
+| TILE.COPY / TILE.TRANSPOSE latency | **16 cycles** (8 cy TRegFile read epoch + 8 cy write epoch) |
+| TILE.ZERO latency | **8 cycles** (1 write epoch, no read needed) |
+| **TILE.GET latency** | **9 cycles** (8 cy TRegFile read epoch + 1 cy element extract → CDB) |
+| **TILE.PUT latency** | **16 cycles** (8 cy read epoch + 8 cy write epoch); **8 cy** with copy elision |
+| TILE.GET throughput | **1 per 8 cycles** (read port occupied for full epoch even for single element) |
+| TILE.PUT throughput | **1 per 16 cycles** (read + write port, 2 epochs); **1 per 8 cy** with elision |
+
+#### 8.5.C MTE Rename → Issue → Execute Flow (Bulk Transfer) — (v1 §8.5.3, 未变更)
+
+```
+  D2 (Rename):
+    TILE.LD T10, [X5]
+      Scalar RAT: X5 → P40 (physical scalar for base address)
+      Tile RAT:   T10 → PT200 (allocate new physical tile from tile free list)
+                  old mapping PT10 marked orphan
+      Tile RAT ready[PT200] ← 0
+
+  DS (Dispatch):
+    MTE RS entry: {op=TILE.LD, pscalar=P40, srdy=<from Scalar RAT>, ptdst=PT200, ckpt=...}
+
+  IS (Issue):
+    Wait for pscalar P40 ready (CDB wakeup from scalar ALU)
+    → read base address from scalar physical RF
+
+  EX (Execute — memory fetch + 1 TRegFile write epoch):
+    Memory phase (≈64 cycles from L2):
+        MTE Address Gen: compute contiguous address range from base address
+        MTE Data Path:   request 64 cache lines from L2 (64 B/cy)
+        MTE Buffer:      accumulate 4 KB in outstanding request buffer
+    TRegFile write epoch (8 cycles):
+        Reserve write port, program reg_idx = PT200
+        Write 512 B/cy × 8 cy = 4 KB to physical tile slot PT200
+    Total TILE.LD latency (L2 hit): 64 + 8 = **72 cycles**
+
+  Complete:
+    Tile RAT ready[PT200] ← 1
+    TCB broadcast: PT200
+    → wake dependent instructions in Vector RS, Cube RS, MTE RS
+    Decrement tile refcount for any source tiles
+```
+
+MTE bulk operations incur both **memory latency** and **TRegFile epoch latency**. For TILE.LD, the MTE first fetches 4 KB from memory (64 cache lines at 64 B/cy = 64 cycles from L2), buffers the data, then writes to TRegFile-4K in one 8-cycle write epoch using the **physical tile index** (from Tile RAT) as the `reg_idx` address — total latency: **memory + 8 cycles**. For TILE.ST, the MTE first reads the tile from TRegFile in one 8-cycle read epoch, then writes the data to memory — total: **8 cycles + memory**. The MTE controller issues physical `reg_idx` addresses to port pending registers and sequences data transfer across each 8-cycle epoch.
+
+#### 8.5.D TILE.GET / TILE.PUT Execution Flow (Element Access) — (v1 §8.5.4, 未变更)
+
+**TILE.GET Rd, Ts, Ridx** — scalar ← tile element:
+
+```
+  D2 (Rename):
+    Scalar RAT: Ridx → P50 (lookup index);  Rd → P60 (allocate new scalar dest)
+    Tile RAT:   Ts → PT180 (lookup source tile)
+
+  DS (Dispatch):
+    MTE RS entry: {op=TILE.GET, pscalar=P50(Ridx), srdy, pdst=P60(Rd), ptsrc1=PT180(Ts), trdy}
+
+  IS (Issue):
+    Wait for P50 ready (CDB wakeup) AND PT180 ready (TCB wakeup)
+    → read index value from scalar RF; compute row_group = row / 8, row_off, col
+
+  EX (Execute, 9 cycles):
+    Cycles 1–8: TRegFile read epoch — reserve read port for physical tile PT180
+                port reads 512 B/cy × 8 cy (full tile streamed out);
+                capture the 512-B chunk at cycle (row_group+1) containing target row
+    Cycle 9:    extract element from captured 512-bit row based on col and
+                funct3 (element type), zero-extend to 64 bits
+
+  Complete:
+    CDB broadcast: (tag=P60, data=element_value)
+    → wakeup dependent scalar RS entries; write to scalar physical RF
+    Decrement tile refcount for PT180
+```
+
+**TILE.PUT Td, Rs, Ridx** — tile element ← scalar (read-modify-write):
+
+```
+  D2 (Rename):
+    Scalar RAT: Rs → P70 (lookup data), Ridx → P71 (lookup index)
+    Tile RAT:   Td old mapping → PT180 (source, for tile copy)
+                Td new mapping → PT210 (allocate from tile free list)
+                PT180 marked orphan; ready[PT210] ← 0
+
+  DS (Dispatch):
+    MTE RS entry: {op=TILE.PUT, pscalar=P70(Rs), pscalar2=P71(Ridx),
+                   ptsrc1=PT180(Td_old), ptdst=PT210(Td_new)}
+
+  IS (Issue):
+    Wait for P70, P71 ready (CDB) AND PT180 ready (TCB)
+
+  EX (Execute, 16 cycles — 2 full TRegFile epochs):
+    Read epoch (cycles 1–8):
+        Reserve read port for physical tile PT180
+        Read 512 B/cy × 8 cy = 4 KB (full source tile)
+        Buffer tile data in MTE internal SRAM; overwrite target element
+        at (row, col) derived from Ridx with scalar value from Rs
+    Write epoch (cycles 9–16):
+        Reserve write port for physical tile PT210
+        Write modified tile 512 B/cy × 8 cy = 4 KB to PT210
+
+    Copy elision optimisation (8 cycles):
+        When PT180 refcount=0 and is orphaned at rename, the copy is
+        skipped. PT210 reuses PT180's storage. Only the target element
+        is overwritten in-place during a single write epoch (8 cy).
+
+  Complete:
+    Tile RAT ready[PT210] ← 1
+    TCB broadcast: PT210
+    → wake dependent tile-domain RS entries
+    Decrement tile refcount for PT180; if orphan and refcount=0 → free PT180
+```
+
+TILE.GET occupies a TRegFile read port for a full 8-cycle epoch (even though only one 512-B chunk is needed), plus 1 cycle for element extraction — **9 cycles** total. TILE.PUT requires two full epochs (8 cy read + 8 cy write = **16 cycles**) because it is a read-modify-write on the tile. With copy elision (PT_old orphaned, refcount=0), the read epoch is skipped and only the write epoch is needed — reducing latency to **8 cycles**.
+
+#### 8.5.E TILE.MOVE (Move Elimination) — (v1 §8.5.5, 未变更)
+
+**TILE.MOVE Td, Ts** — Handled entirely at the D2 rename stage with **zero-cycle latency**:
+
+```
+  D2 (Rename):
+    TILE.MOVE T5, T10
+      Tile RAT[T10] → PT180 (source physical tile)
+      Tile RAT[T5]  → PT50  (old destination mapping, marked orphan)
+      Tile RAT[T5]  ← PT180 (Td now aliases same physical tile as Ts)
+      refcount(PT180) += 1   (extra architectural name)
+      ready[T5] = ready[PT180]  (inherit readiness)
+      → No RS entry allocated. No execute. No TCB broadcast.
+      → Instruction completes immediately at D2.
+
+  If PT50 is orphan and refcount==0 → free PT50 to tile free list
+```
+
+TILE.MOVE does not consume any execute-stage resources, TRegFile-4K ports, or memory bandwidth. It is the preferred way to "rename" tiles between software pipeline stages (e.g., double-buffering schemes where the next iteration's input tiles become the current iteration's operand tiles). Because Td and Ts share the same physical tile after TILE.MOVE, the next write to either architectural register will naturally allocate a new physical tile at rename time.
+
+---
+
+**v2 增量(下面 §8.5.1 / §8.5.2 / §8.5.3 / §8.5.4):**
+
+#### 8.5.1 `TILE.TRANSPOSE` — reduced footprint
+
+Because TRegFile-4K can deliver col-mode reads directly (§9.2), most "transpose then consume" patterns are subsumed by the consumer's `is_xpose` bit. The dedicated 4 KB MTE transpose buffer of v1 shrinks to a small **512 B element-level fixup buffer** used only for the non-aligned `W ∈ {128, 256, 1024, 2048, 4096}` regimes that [`tregfile4k.md`](tregfile4k.md) §7.5 leaves to downstream consumers. (For these regimes, the chunk-grid transpose at 64 B granularity is not element-level valid, and `TILE.TRANSPOSE` materializes an element-correct transpose in a new physical tile.)
+
+| MTE TILE.TRANSPOSE behaviour | v1 | v2 |
+|------------------------------|-----|-----|
+| 4 KB transpose buffer | required | replaced by 512 B fixup SRAM |
+| Latency | 16 cy | 16 cy (unchanged) |
+| Use case | universal | rare — only when materializing a transposed tile for reuse across many instructions that don't carry `is_xpose` |
+
+#### 8.5.2 Speculative Tile-Store Queue (STQ)
+
+`TILE.ST` and `TILE.SCATTER` allocate an STQ entry at dispatch (alongside the regular MTE RS entry):
+
+```
+  STQ entry (8 entries total):
+    valid (1b) | btag (3b) | base_addr (40b) | tile_phys_idx (8b) |
+    stride (40b) | scatter_idx_phys (8b) | size_log2 (3b) | drain_rdy (1b)
+    + meta_v (1b)
+```
+
+Total STQ size: 8 × ~110 b ≈ ~110 B. Drain logic mirrors the SSB:
+
+- **Tag clear** (branch resolves correctly): `btag` is updated; if it reaches `0xFF`, `drain_rdy` is set, and the STQ controller can issue the actual memory write.
+- **Mispredict**: STQ entries with `btag` younger or equal to the mispredicted tag are invalidated. The corresponding tile data — still resident in TRegFile-4K — is freed via the normal Tile RAT refcount path; no memory write was issued.
+- **Drain**: oldest `drain_rdy` entry begins streaming the tile from TRegFile through the MTE memory pipeline. Drain is overlapped with subsequent MTE operations.
+
+**Why only 8 entries?** Bulk tile stores are infrequent compared to scalar stores: a typical kernel issues 1 `TILE.ST` per ~10–50 scalar stores. The 8 entries provide ~64 cycles of buffering at peak issue (1 TILE.ST/4 cy), enough to absorb a burst at the end of a kernel without causing dispatch stall.
+
+**Why STQ is separate from SSB:** the TILE.ST data payload (4 KB) cannot reasonably be captured in the SSB's flip-flop register — it must remain resident in TRegFile-4K. The STQ stores only the *intent* (address + tile-pointer + branch_tag) and triggers the memory write on commit.
+
+#### 8.5.3 STQ area
+
+| Block | Area |
+|-------|------|
+| 8 × 110 b flip-flop array | ~10 K gate |
+| Branch-tag ancestry check (shared with SSB) | ~0 (reuses SSB's bitmap) |
+| Drain controller FSM | ~2 K gate |
+| **Total** | **~12 K gate** (~0.003 mm² @ 5 nm) |
+
+#### 8.5.4 MTE RS entry per instruction (v2 update)
+
+The `TILE.ST` and `TILE.SCATTER` entries gain a 4 b STQ index. Other MTE instructions are unaffected.
+
+| Instruction | STQ allocation | Notes |
+|-------------|----------------|-------|
+| TILE.LD, TILE.GATHER, TILE.ZERO | — | tile-write only, refcount-managed; no memory side effect, fully recoverable via Tile RAT |
+| TILE.ST, TILE.SCATTER | **STQ slot** | held in STQ until non-speculative |
+| TILE.COPY, TILE.MOVE, TILE.TRANSPOSE | — | tile-internal, fully recoverable |
+| TILE.GET | — | scalar GPR result; recoverable via Scalar RAT + ref-count |
+| TILE.PUT | — | tile-write (RMW); recoverable via Tile RAT |
+
+---
+
+## 9. Register Files
+
+### 9.1 Scalar GPR Physical Register File
+
+> **(v1 → v2: 内容未变更,以下完整复制自 v1 §9.1。)**
+
+| Parameter | Value |
+|-----------|-------|
+| Physical registers | **128** (P0–P127), 64-bit each |
+| Total storage | 128 × 8 B = **1 KB** |
+| Read ports | **12** (8 from rename lookup + 4 from issue/execute) |
+| Write ports | **6** (4 ALU + 1 MUL/LSU + 1 TILE.GET), matched to CDB ports |
+| Implementation | Flip-flop array (small enough for full-speed multi-port) |
+| Bypass network | 6-source → 12-sink forwarding MUXes |
+
+**Bypass network:** When a result is broadcast on the CDB in the same cycle that an issuing instruction reads the physical RF, the bypass network forwards the CDB data directly to the execution unit input, avoiding a 1-cycle read-after-write penalty.
+
+**Register lifecycle:**
+
+```
+  Allocate:  free list dequeue → assigned as destination at D2
+  Write:     execution unit writes result at WB stage
+  Read:      issuing instructions read at IS stage (or snoop from CDB)
+  Orphan:    a later instruction remaps the same architectural register
+  Free:      orphan AND reference count = 0 → return to free list
+```
+
+### 9.2 TRegFile-4K (with per-port `is_transpose`)
+
+The TRegFile-4K is the physical tile register file for vector, cube, and MTE. Full specification: [`tregfile4k.md`](tregfile4k.md). v2 highlights below.
+
+#### 9.2.1 Tile metadata storage
+
+A new **256 × 32 b SRAM** sits alongside the 1 MB tile data array, holding `(shape.x, shape.y, format)` per physical tile. Read ports: 4 (decode) + 1 (TCB completion). Write ports: 2 (retire + `TSETMETA`). Storage: 1024 B = ~10 K gate.
+
+The metadata is **physically distinct** from the 4 KB tile payload but is read together at the **first strip** of an operand fetch (§4.4 of [`vector4k_v2.md`](vector4k_v2.md)) so that the consumer's microcode program can be selected based on `(format, R, C)` before the second strip arrives.
+
+#### 9.2.2 Per-port `is_transpose` flag
+
+Each of the 8 read ports (R0–R7) accepts a 1-bit `is_transpose` flag double-registered alongside the 8-bit `reg_idx`. The flag is latched at the epoch boundary and held constant for the entire 8-cycle epoch.
+
+| `is_transpose` | Strip delivery | Bank pattern per cycle |
+|----------------|----------------|------------------------|
+| 0 (row-mode) | strip `s` = chunk-grid row `s` (contiguous bytes `s·512 .. s·512+511`) | 8 banks of one group |
+| 1 (col-mode) | strip `s` = chunk-grid column `s` (8 × 64 B chunks across all 8 groups along the wrapped diagonal) | 1 bank per group |
+
+Both modes deliver **512 B/cy** through the 8-cycle epoch — same throughput, no extra latency. The diagonal skew layout `bank_id = 8·g + ((l + g) mod 8)` is what makes col-mode bank-conflict-free.
+
+**Scheduling rule R2** ([`tregfile4k.md`](tregfile4k.md) §6): the 8 active read ports of any 8-cycle epoch must share the same `is_transpose` value. Mixed-mode reads in the same epoch collide on the 1R-port SRAM banks. This rule is enforced at the dispatch / port-allocation stage; if a vector instruction's operand `is_xpose_A ≠ is_xpose_B`, the operand-fetch prologue automatically splits into two epochs (16 cy instead of 8 cy, §8.3.6).
+
+#### 9.2.3 Hardware delta vs. v1 TRegFile-4K
+
+| Component | v1 | v2 |
+|-----------|----|----|
+| 64 SRAM banks (256×512b each) | yes | yes (unchanged) |
+| Diagonal skew bank map | (introduced together) | yes |
+| Per-port pending+active address registers | 1 reg_idx × 2 | (1 reg_idx + 1 is_transpose) × 2 — adds **1 b × 2 × 8 ports = 16 FF** |
+| Read-port datapath: bank-select mux | 1 option (row-mode) | **2 options (row OR col)** — small 2-to-1 mux per port + col-mode address generator (`bank_i = 8·i + (p+cy+i) mod 8`) |
+| Read-port output rotator | 8-way 64 B (always active in row-mode) | **8-way 64 B, active only in row-mode**; bypassed in col-mode |
+| Metadata SRAM (256 × 32 b, 4R/2W) | — | **+10 K gate** |
+| Write-side | unchanged | unchanged |
+| Latency / throughput per port | 1 reg_idx / 8 cy | 1 reg_idx + 1 is_transpose / 8 cy (same epoch) |
+
+**Total v2 delta**: ~12 K gate (mostly metadata SRAM + col-mode address generator). The transpose capability adds **zero SRAM duplication** and **zero latency** to the basic read path.
+
+#### 9.2.4 Port allocation
+
+Port assignment across vector, cube, and MTE units is **identical to v1 §9.2** (table reproduced below for reference). The introduction of `is_transpose` does not change which port serves which client; it only changes the data delivery order on each read port.
+
+| Port | Cube active — MXFP4/HiFP4 | Cube active — FP16/BF16/FP8 | Cube idle |
+|------|----------------------------|------------------------------|-----------|
+| R0 | Cube A (1 tile/epoch) | Cube A (1 tile/epoch) | VEC-4K-v2 / MTE — free |
+| R1–R4 | Cube B operands | Cube B (R1–R2) | VEC-4K-v2 / MTE — free |
+| R5–R7 | Vector / MTE | Vector / MTE | Vector / MTE — free |
+| W0 | Cube C drain | Cube C drain | VEC-4K-v2 / MTE — free |
+| W1–W7 | Vector / MTE | Vector / MTE | Vector / MTE — free |
+
+VEC-4K-v2 binding: **R0 (Port A, with `is_xpose_A`)**, **R4 (Port B, with `is_xpose_B`)**, **W0 (D0)**, **W4 (D1)**. Mask `C` rides on whichever value-tile read port is idle (1–2 strips per fetch).
+
+---
+
+#### 8.3.11 Worked Examples: TSOFTMAX (Full-Tile ROM) and TSOFTMAX_VTG (VTG Variant)
+
+This section walks through two concrete instantiations of the same algorithm: (A) as a **full-tile VEC-4K-v2 instruction** driven by the microcode ROM, and (B) as a **VTG micro-instruction** driven by the micro-instruction buffer. Both execute the same five-pass TSOFTMAX algorithm; the difference is the scheduling context and where the beat-word sequence comes from.
+
+##### 8.3.11.1 Algorithm: TSOFTMAX Along the Row Axis
+
+For a tile shaped **8 x 128 FP32** (W = 512 B, R = 8), softmax along each row is:
+
+```
+Pass 1 -- row_max[i] = max(input[i][*])                              [col-reduce]
+Pass 2 -- diff[i][j]  = exp(input[i][j] - row_max[i])              [elementwise SUB + EXP]
+Pass 3 -- row_sum[i]  = SIGMA_j diff[i][j]                              [col-reduce]
+Pass 4 -- inv_sum[i]   = 1.0 / row_sum[i]                           [scalar RECIP]
+Pass 5 -- output[i][j] = diff[i][j] * inv_sum[i]                   [elementwise MUL]
+```
+
+##### 8.3.11.2 ROM Entry for Full-Tile TSOFTMAX
+
+The VEC-4K-v2 microcode ROM is keyed by `(opcode, format, W-regime, R-regime)`. The TSOFTMAX ROM entry for FP32, W = 512 B (one strip), R = 8 is:
+
+```
+ROM[TSOFTMAX, FP32, W=512B, R=8] --> {
+  ucode_base:  <addr>
+  ucode_len:   42    <-- 9+8+9+1+8+7 = 42 beats
+  shape:       (R=8, C=128, E=4, format=FP32)
+  N_strips:    8
+  elem_per_strip: 128  (512 B / 4 B)
+}
+```
+
+**VECBeatWord format** (from `vector4k_v2.md` SS5.4):
+
+```
+VECBeatWord {
+  src_A, src_B, src_Z : 3x3 b   <-- {SA, SB, SX, SY, ACC_LO, ACC_HI, IMM_ZERO, --}
+  s_A, s_B              : 2x3 b   <-- strip index 0..7
+  xp_A, xp_B            : 2x1 b   <-- 0 (row-mode; TSOFTMAX is purely row-axis)
+  mask_src               : 2 b     <-- {SC_mask, IMM_ALL_ONES, IMM_FROM_SOP}
+  mask_strip             : 3 b
+  alu_op                 : 5 b     <-- {ADD, SUB, MUL, MAX, PASS_A, RECIP, ...}
+  acc_op                 : 3 b     <-- {NONE, INIT, ACCUM, MERGE_STAGE, READOUT}
+  acc_slot               : 4 b     <-- 0..15
+  wr_en_D0, wr_en_D1    : 1 b each
+  wr_strip_D0, wr_strip_D1 : 3 b each
+}
+```
+
+##### 8.3.11.3 Beat-Word Sequence (Full-Tile TSOFTMAX, 42 Beats)
+
+```
+-- === PASS 1: row_max = max(input[i][*]) ===
+-- Beats 0-8: col-reduce via MAX on accumulator slot 0
+
+beat  0: INIT   src_A=SA, s_A=strip0,
+              src_B=ACC_LO, alu_op=PASS_A,
+              acc_op=INIT, acc_slot=0
+              -- acc[0] <-- SA[strip0] (128 FP32 elements)
+
+beats 1-7: ACCUM src_A=SA, s_A=strip[1..7],
+              src_B=ACC_LO, alu_op=MAX,
+              acc_op=ACCUM, acc_slot=0
+              -- acc[0] <-- MAX(acc[0], SA[strip_j]) pairwise
+
+beat  8: READOUT src_A=ACC_LO, alu_op=PASS_A,
+              acc_op=READOUT, acc_slot=0
+              -- broadcast row_max to all 128 lanes via SX
+
+-- === PASS 2: diff = exp(input - row_max) ===
+-- Beats 9-16: elementwise SUB, write to scratch tile T_scratch
+
+beats 9-16: src_A=SA, s_A=strip[0..7],
+              src_B=SX, alu_op=SUB,
+              mask_src=IMM_ALL_ONES,
+              acc_op=NONE,
+              wr_en_D0=1, wr_strip_D0=strip[0..7]
+              -- diff_strip[j] = SA[strip_j] - SX(row_max)
+              -- written to T_scratch via D0, one strip per beat
+              -- actual ROM folds SUB+EXP into a single EXP beat
+              -- with a preceding subtract (two beats per strip)
+
+-- === PASS 3: row_sum = SIGMA_j diff[i][j] ===
+-- Beats 17-25: col-reduce via ADD on accumulator slot 1
+
+beat 17: INIT   src_A=T_scratch, s_A=strip0,
+              src_B=ACC_LO, alu_op=PASS_A,
+              acc_op=INIT, acc_slot=1
+              -- acc[1] <-- T_scratch[strip0]
+
+beats 18-24: ACCUM src_A=T_scratch, s_A=strip[1..7],
+              src_B=ACC_LO, alu_op=ADD,
+              acc_op=ACCUM, acc_slot=1
+              -- acc[1] <-- ADD(acc[1], T_scratch[strip_j]) pairwise
+
+beat 25: READOUT src_A=ACC_LO, alu_op=PASS_A,
+              acc_op=READOUT, acc_slot=1
+              -- broadcast row_sum to all 128 lanes via SX
+
+-- === PASS 4: inv_sum = 1.0 / row_sum ===
+-- Beat 26: RECIP
+
+beat 26: src_A=SX, alu_op=RECIP,
+          acc_op=READOUT, acc_slot=1
+          -- inv_sum = RECIP(row_sum), broadcast via SY
+
+-- === PASS 5: output = diff * inv_sum ===
+-- Beats 27-34: elementwise MUL, retire to D0
+
+beats 27-34: src_A=T_scratch, s_A=strip[0..7],
+              src_B=SY, alu_op=MUL,
+              mask_src=IMM_ALL_ONES,
+              acc_op=NONE,
+              wr_en_D0=1, wr_strip_D0=strip[0..7]
+              -- out[strip_j] = T_scratch[strip_j] x SY(inv_sum)
+              -- retired to D0, one strip per beat
+
+-- === Finalize ===
+-- Beats 35-41: flush pending retire
+beats 35-41: wr_en_D0=1, wr_strip_D0=strip[0..7]
+
+Pipeline timing (full-tile TSOFTMAX):
+  Fetch prologue:     8 cycles  (TRegFile R0 + R4, 1 epoch)
+  Compute:           42 cycles  (5 passes)
+  Retire:             8 cycles  (W0, 1 epoch)
+  End-to-end:        ~58 cycles
+```
+
+##### 8.3.11.4 VTG Variant: TSOFTMAX_VTG
+
+The VTG variant operates on **one VTG at a time** (one 256 B sub-range of the tile in G256 mode). It **reuses the same ROM entry** -- the microassembler parameterizes the beat-word template for the VTG's group context at decode time and caches it in the micro-instruction buffer.
+
+**VTG GVIQ entry at dispatch (D1/D2):**
+
+```
+TSOFTMAX_F32 Td.gN, Ts.gM
+  gviq.push({
+    block_id:    allocate_micro_block(),   -- 12 b
+    pc_index:   0,
+    tile_group:  Td,           -- architectural tile
+    phys_tile:   PTd,           -- renamed via Tile RAT
+    group_id:    N,            -- VTG index gN
+    group_mode:   G256,         -- 256 B per VTG, 16 VTGs per tile
+    thread_id:   0,
+    iter0..iter3: loop counters,
+    active_lanes: 2048,        -- 256 B / 4 B per FP32
+    src0_ptag:   PTs,          -- source tile renamed via Tile RAT
+    src1_ptag:   PTs_scratch,  -- scratch tile renamed
+    dst_ptag:    PTd,          -- destination tile renamed
+    vtg_ready:   0,
+    branch_tag:  current_btag
+  })
+```
+
+**Microassembler at decode (D1/D2):**
+
+```
+-- Consult Tile Metadata RAT for source tile Ts --
+shape   = TileMetadataRAT[PTs].shape      -- (R=8, C=128)
+format  = TileMetadataRAT[PTs].format     -- = FP32
+W       = shape.C x E(format)              -- = 512 B
+R       = shape.R                          -- = 8
+
+-- Parameterize ROM entry for this VTG --
+-- G256 mode: tile split into 16 VTGs x 256 B each
+-- W=256B means: only 4 strips active (256 B / 512 B per strip)
+-- R=4 means:    4 rows, 4 strips
+-- Result: 26 beats instead of 42
+rom_key = (TSOFTMAX, format=FP32, W-regime=256B, R-regime=4)
+rom_entry = ROM.lookup(rom_key)
+
+for i in 0..25:
+    bw = rom_entry.beat_words[i]
+    bw.group_id       = N
+    bw.group_mode     = G256
+    bw.dst_ptag       = PTd
+    bw.src0_ptag      = PTs
+    bw.src1_ptag      = PTs_scratch
+    bw.output_range   = (N x 256, (N+1) x 256)
+    buffer.write(block_id=allocate_micro_block(), pc_index=i, beat_word=bw)
+```
+
+**GVIQ issue and execution (P1/I1):**
+
+```
+-- VTG rotation scheduler (one VTG at a time) --
+winner = gviq.pick_oldest_ready()     -- age = (rid - head_rid) mod 64
+bw     = buffer.lookup(winner.block_id, winner.pc_index)
+
+-- Wait for VEC prologue to fill SA/SB with full 4 KB tile --
+-- (VEC-4K-v2 operand-fetch prologue: 8 cycles)
+-- VTG sub-range selector: byte-mux from SA/SB (4 KB) to 256 B --
+SA_full  = TRegFile.read(winner.src0_ptag)    -- 8 cy epoch
+SB_full  = TRegFile.read(winner.src1_ptag)  -- 8 cy epoch (scratch tile)
+
+-- At I2 (after prologue): select VTG sub-range at ALU input mux --
+vtg_A = SA_full[winner.group_id x 256 : (winner.group_id+1) x 256]
+vtg_B = SB_full[winner.group_id x 256 : (winner.group_id+1) x 256]
+
+-- Drive VEC ALU with this beat word (1 cycle) --
+result = VEC_alu.execute(bw, vtg_A, vtg_B)
+
+-- Group Write Adapter: full-tile RMW --
+--   Step 1: read old tile (8 cy)
+--   Step 2: merge VTG sub-range (combinational)
+--   Step 3: write merged tile (8 cy)
+--   Total: 16 cy minimum
+submit_group_write(winner.dst_ptag, winner.group_id, result, 256)
+
+winner.pc_index++
+if winner.pc_index > 25:
+    winner.valid = 0       -- retire after beat 25
+```
+
+**Comparison: Full-Tile vs. VTG TSOFTMAX:**
+
+| Aspect | Full-Tile TSOFTMAX | VTG TSOFTMAX_VTG |
+|--------|---------------------|-------------------|
+| Input size | 4 KB (1024 FP32) | 256 B (64 FP32) per VTG |
+| ROM key | `(TSOFTMAX, FP32, W=512B, R=8)` | `(TSOFTMAX, FP32, W=256B, R=4)` |
+| Beat count | 42 | 26 |
+| VTG count | 1 (single tile) | 16 (loop over all VTGs via GVIQ) |
+| Prologue | 8 cy (full epoch) | 8 cy (shared with VEC) |
+| Writeback | 8 cy (full tile write) | 16 cy (full-tile RMW) |
+| Total per op | ~58 cy | ~50 cy + 16 cy RMW = ~66 cy per VTG |
+| Throughput | 1 tile / 58 cy | 1 VTG / 66 cy; 16 VTGs sequentially via GVIQ |
+
+**Key architectural points illustrated by this example:**
+
+1. **ROM is the source of truth.** Both full-tile and VTG TSOFTMAX execute beat-word sequences that originate from the same ROM entry. VTG microassembler parameterizes the template at decode time and caches it in the micro-instruction buffer; full-tile does a ROM lookup at issue time. No separate VTG microcode path is needed.
+
+2. **Prologue is shared.** VTG does not introduce a new operand-fetch path. It submits a tile read request through the same R0/R4 ports as VEC-4K-v2, and the 8-cycle prologue fills SA/SB. VTG then reads from the already-staged data at the ALU input mux.
+
+3. **Group Write Adapter RMW is the writeback tax.** Every VTG write must perform a full-tile read-modify-write: 8 cy to read the old tile, merge the VTG sub-range, 8 cy to write the merged tile. This 16-cycle overhead is amortized across all 16 VTGs.
+
+4. **GVIQ rotation schedules one VTG at a time.** After TSOFTMAX_VTG finishes one VTG (beat 25), the GVIQ scheduler picks the next ready VTG (or the same VTG's next iteration if iterN > 1). Loop counters in the GVIQ entry prefix drive strip-mined iterations without re-entering the GVIQ.
+
+5. **`format` from ROM = `format` from Tile Metadata RAT.** The microassembler reads `format` from the Tile Metadata RAT at decode time to select the correct ROM entry. No separate VTG `elem_type` field is needed -- confirming the metadata overlay design (SS9.2.5).
+
+---
+
+#### 9.2.5 VTG Sub-Unit and VTG Metadata Table (v2.2) VTG Sub-Unit and VTG Metadata Table (v2.2)
+
+> **(Change Point #2 -- hardware-revised)**
+
+Each 4 KB tile register is partitionable into **Vector Thread Groups (VTGs)** for SIMD-group execution. The VTG metadata **overlays the Tile Metadata RAT entry** (from §6.1) rather than being a separate table. The Tile Metadata RAT provides `shape.x`, `shape.y`, `format`; VTG-specific fields are added as extensions.
+
+**Unified metadata structure** (overlays Tile Metadata RAT):
+
+```text
+TileMetadataEntry (extended, 46+ b per physical tile):
+  -- From Tile Metadata RAT (§6.1):
+  shape.x:   14 b   -- columns C
+  shape.y:   14 b   -- rows R
+  format:     4 b   -- FP32/FP16/FP8/FP4 (same encoding as VTG elem_type -- NOT duplicated)
+  flags:      4 b   -- arg_tile, scalar_tile, prefetch_hint
+
+  -- VTG additions (overlay on Tile Metadata RAT):
+  group_mode:  1 b   -- G256=0, G512=1
+  pred_granule: 2 b  -- 8/16/32-bit lane grouping
+  -- Per-VTG validity (16 entries per tile, G256 mode):
+  vtg_meta[16]: {
+    valid:    1 b
+    defined:  1 b
+    dirty:    1 b
+    kind:     3 b   -- VEC | PRED | WIDE_LO | WIDE_HI | ALIGN_LD | SCRATCH | UNDEF
+  }
+```
+
+> **Note (v1.1 fix):** `elem_type` is NOT a separate VTG field -- it is the **same `format` field** from the Tile Metadata RAT. VTG uses `format` directly. `active_bytes` is computed from `shape.x x shape.y x E` and the VTG's position in the tile.
+
+**VTG byte mapping** (`G256` mode, 16 VTGs / tile):
+
+| VTG | Byte range | VTG | Byte range |
+|-----|-----------|-----|-----------|
+| `g0` | `[0, 255]` | `g8` | `[2048, 2303]` |
+| `g1` | `[256, 511]` | `g9` | `[2304, 2559]` |
+| `g2` | `[512, 767]` | `g10` | `[2560, 2815]` |
+| `g3` | `[768, 1023]` | `g11` | `[2816, 3071]` |
+| `g4` | `[1024, 1279]` | `g12` | `[3072, 3327]` |
+| `g5` | `[1280, 1535]` | `g13` | `[3328, 3583]` |
+| `g6` | `[1536, 1791]` | `g14` | `[3584, 3839]` |
+| `g7` | `[1792, 2047]` | `g15` | `[3840, 4095]` |
+
+In `G512` mode: `g0`=`[0,511]`, `g1`=`[512,1023]`, ..., `g7`=`[3584,4095]`.
+
+**VTG Metadata Table** (16 entries per physical tile):
+
+```text
+VTGMeta {
+  valid:        1 b,   // VTG contains defined data
+  kind:         3 b,   // VEC | PRED | WIDE_LO | WIDE_HI | ALIGN_LD | SCRATCH | UNDEF
+  group_mode:   1 b,   // G256=0, G512=1
+  elem_type:    4 b,   // FP32/FP16/FP8/FP4/INT32/...
+  active_bytes: 10 b,  // 0..256 (G256) or 0..512 (G512)
+  pred_granule: 2 b,  // 8/16/32-bit lane grouping
+  pred_mode:    1 b,   // 0=zeroing, 1=merging (default)
+  defined:      1 b,
+  dirty:        1 b,
+}
+```
+
+The VTG Metadata Table is read by the Group Read/Write Adapters to determine VTG validity and predicate granularity. It is updated on every VTG write.
+
+**Rename:** The Tile RAT maps architectural tile `Tg` -> physical tile `PT`. VTG `group_id` is a sub-location index into the renamed `PT`. A VTG micro-instruction writing `Tg.gN` may either update `PT.gN` in place (if uniquely owned and no older readers) or allocate a fresh physical tile and merge unchanged VTGs (copy-on-write policy, v1 conservative policy).
+
+**Writeback rename:** VTG writeback performs a full-tile read-modify-write (16 cy minimum, see §8.3.10). The destination `ptag` remains unchanged -- only the VTG sub-range content is modified. No Tile RAT update is needed at writeback; the `dirty` bit in `vtg_meta[dst_ptag][group_id]` is set.
+
+---
+
+## 10. Out-of-Order Execution Model — Foundations
+
+> **(v1 → v2: 本章基础部分完整复制自 v1 §10。v2 增量为 Tile Metadata RAT(§10.3 引用)与 §10.6 指向 §11 的扩展投机恢复机制。)**
+
+The Davinci-v2 core implements a **ROB-less out-of-order** execution model. Because the core does not need to maintain precise architectural state (no interrupts, no exceptions in run-to-completion kernels), it dispenses with the Reorder Buffer entirely. This section describes how instructions flow through the core and how correctness is maintained.
+
+### 10.1 Core Principles (BCC Scalar Pipeline)
+
+1. **OoO dispatch, OoO execution, OoO completion.** An instruction's result is committed to the physical register file as soon as execution completes. There is no in-order retirement stage.
+2. **False dependencies (WAW, WAR) eliminated by register renaming.** The SMAP (speculative map) maps each atag to a unique ptag, so no instruction ever overwrites another's live data. The Tile RAT (32->256) and Tile Metadata RAT are unchanged from v2.
+3. **True dependencies (RAW) resolved by Ready Table tag-based wakeup.** Instructions wait for source ptag readiness via the Ready Table bitmap (O(1) lookup per ptag). Tile-domain instructions wait for Tile RAT ready bits signaling physical tile completion.
+4. **Branch recovery via MapQ reverse replay.** On mispredict, the SMAP is restored to CMAP state by replaying MapQ entries in reverse order from the flush_rid. All younger instructions are flushed from physical IQs via branch_tag CAM-clear.
+5. **Physical registers freed by reference counting.** No ROB means no retirement-based freeing; instead, a ptag is freed when it is both *orphaned* (no longer mapped by SMAP) and its refcount reaches zero.
+6. **Ready Table provides O(1) ptag readiness.** The 128-bit Ready Table bitmap replaces the 384-entry CDB comparator array for scalar wakeup. Each ptag's readiness is a single bit-test.
+7. **Speculative memory side effects gated by branch tag.** Speculative scalar stores live in the SSB and speculative bulk tile stores live in the STQ until their `branch_tag` resolves to non-speculative. See Section 11.
+8. **(v2.3 Block-ROB 增量) Block-granularity precise exceptions via BROB.** The Block Reorder Buffer (BROB) tracks instruction blocks (BSTART to BSTOP) and provides precise exception identification: when a fault is detected, the faulting block is identified, younger blocks are squashed, and register state is recovered via MapQ reverse replay from the faulting RID. See SS11.11.
+8. **(v2.3 Block-ROB 增量) Block-granularity precise exceptions via BROB.** The Block Reorder Buffer (BROB) tracks instruction blocks (BSTART to BSTOP) and provides precise exception identification: when a fault is detected, the faulting block is identified, younger blocks are squashed, and register state is recovered via MapQ reverse replay from the faulting RID. See SS11.11.### 10.2 Instruction Lifecycle (BCC Scalar Pipeline)
+
+```
+  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐
+  │Fetch│──>│Decode│──>│Rename│──>│ Disp│──>│Issue│──>│ Exec│──>│  WB │
+  │F0-F4│   │  D1  │   │D2  D3│   │ S1 S2│   │P1I1I2│  │E1-n│   │
+  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘   └─────┘   └─────┘
+                         │                                         │
+                    D1: RID/atag + BROB entry alloc (BSTART)      W1: Ready Table
+                    D2: SMAP read + ptag alloc + MapQ push      update + CDB / TCB
+                    D3: SMAP write + RT init + BID stamp          broadcast
+                                                              + Block SSB/STQ transfer
+                                                              BROB Retire (off critical path)
+```
+
+Detailed per-stage actions:
+
+| Stage | Actions |
+|-------|---------|
+| **Fetch (F0-F4)** | PC -> I-cache + branch predictor; receive 4 instructions; stitch + IB buffer; F4 handoff |
+| **Decode (D1)** | Decode opcode, identify domain, allocate RID (6-bit program order), resolve atag for sources, classify operands (P/T/U) |
+| **Rename (D2)** | Read SMAP for source ptags; allocate new ptag from free list; update SMAP (live); push MapQ entry; T/U stack push for tile operands. If branch: allocate branch_tag from 8-slot pool. |
+| **Rename (D3)** | Write SMAP (committed state); initialize Ready Table source-ready bits; assign IQ routing |
+| **Dispatch (S1)** | Check free list vacancy, MapQ space, IQ vacancy per routing target |
+| **Dispatch (S2)** | Write IQ entries (alu_iq / bru_iq / lsu_iq); update free list; advance MapQ head |
+| **Issue (P1)** | Ready Table bitmap query (O(1) per ptag); age-matrix cascaded pick selects oldest-ready per IQ |
+| **Issue (I1)** | Physical RF read-port arbitration across 7 issue slots |
+| **Issue (I2)** | Confirm IQ entry deallocation; confirm RF port occupancy |
+| **Execute (E1-EX_n)** | Compute result (variable latency). Scalar stores deposit into SSB; MTE bulk stores deposit into STQ. |
+| **Writeback (W1)** | CDB/TCB broadcast; write to physical RF; Ready Table update (set ready bit); wakeup dependents; free orphans |### 10.3 Rename Register State (atag / ptag / SMAP / CMAP / MapQ)
+
+The BCC scalar pipeline replaces the Scalar RAT with a **three-table model**: CMAP (committed map), SMAP (speculative map), and MapQ (speculative rename increment log). The Tile RAT and Tile Metadata RAT are unchanged from v2.
+
+**Rename example (4-wide, 3-stage D1/D2/D3):**
+
+```
+  Instruction stream:
+    i0:  ADD  X5, X2, X3
+    i1:  MUL  X6, X5, X7
+    i2:  SUB  X5, X8, X9
+    i3:  ADD  X10, X5, X6
+
+  Before D1:
+    CMAP: X2->P2, X3->P3, X5->P5, X6->P41, X7->P7, X8->P8, X9->P9
+    SMAP: (matches CMAP initially)
+    Ready Table: all allocated ptags ready
+
+  D1 (Decode + RID/atag allocation):
+    i0: RID=0, atag=X5,X2,X3;  i1: RID=1, atag=X6,X5,X7; ...
+
+  D2 (Rename Request):
+    i0: src0=P2, src1=P3, dst=P80 (new);  SMAP[X5]<=P80;  P5 orphan;  MapQ<= {X5, P5, P80, RID=0}
+    i1: src0=P80 (bypass), src1=P7, dst=P81 (new);  SMAP[X6]<=P81;  MapQ<= {X6, P41, P81, RID=1}
+    i2: src0=P8, src1=P9, dst=P82 (new);  SMAP[X5]<=P82;  P80 orphan;  MapQ<= {X5, P80, P82, RID=2}
+    i3: src0=P82 (bypass), src1=P81 (bypass), dst=P83 (new);  SMAP[X10]<=P83
+
+  D3 (Rename Complete):
+    SMAP committed: X2->P2, X3->P3, X5->P82, X6->P81, X7->P7, X8->P8, X9->P9, X10->P83
+
+  Ready Table at S2 dispatch:
+    Clear bits: P80, P81, P82, P83 (not yet ready)
+
+  On CDB writeback (e.g. i0 completes):
+    Ready Table.set(P80)  -> P80 ready
+    -> P1 can_issue recomputed -> P1 picks i1 (src0=P80 now ready)
+
+  On orphan detection (i2 renamed X5->P82; i0's result P80 now orphan):
+    refcount[P80]-- -> if refcount[P80]==0: free_list.push(P80)
+```
+
+**Tile RAT rename example** (unchanged from v2, shown for completeness):
+
+```
+  Instruction stream:
+    i0:  TILE.LD  T10, [X5]
+    i1:  TILE.LD  T20, [X6]
+    i2:  VADD     T10, T10, T20
+    i3:  TILE.ST  [X7], T10
+
+  Before rename:
+    Tile RAT: T10->PT10, T20->PT20
+
+  Rename (D2, unchanged from v2):
+    i0: scalar src=<from SMAP>, tile dst=PT100 (new);  Tile RAT: T10->PT100;  PT10 orphaned
+    i1: scalar src=<from SMAP>, tile dst=PT101 (new);  Tile RAT: T20->PT101;  PT20 orphaned
+    i2: tile src1=PT100 (bypass), tile src2=PT101 (bypass), tile dst=PT102 (new);  Tile RAT: T10->PT102
+    i3: tile src=PT102 (bypass);  no tile dst (store)
+
+  After rename:
+    Tile RAT: T10->PT102, T20->PT101
+```
+
+**Tile Metadata RAT** (unchanged from v2): A 256 x 32 b SRAM holding (shape.x, shape.y, format) per physical tile. Updated by retire-time writes; read together with the tile's first strip during operand fetch.### 10.4 Ready Table + CDB / TCB
+
+The BCC scalar pipeline introduces the **Ready Table** (128-bit bitmap) as the primary scalar wakeup mechanism. The CDB and TCB remain for result broadcast and tile completion.
+
+**Ready Table** (described in full in Section 7.3):
+
+```
+Ready Table: 128-bit bitmap
+  bit[i] = 1: ptag i has a valid value (ready)
+  bit[i] = 0: ptag i is waiting for writeback
+
+set(ptag):    mask |= (1 << ptag)     # On CDB writeback
+clear(ptag):  mask &= ~(1 << ptag)    # On ptag allocation at D2
+read(i):       return (mask >> i) & 1  # Combinational, O(1)
+```
+
+**CDB / TCB** (unchanged from v2 except for Ready Table integration):
+
+| Parameter | Value |
+|-----------|-------|
+| CDB ports | **6** (4 ALU + 1 MUL/LSU + 1 TILE.GET) |
+| Broadcast width | 7-bit ptag + 64-bit data per port |
+| Snoop points | Ready Table update (not per-RS-entry comparison) |
+
+When the CDB broadcasts a result (ptag, data):
+1. **Ready Table update**: `ready.set(ptag)` -- set bit in bitmap
+2. **Physical RF write**: capture data at destination ptag
+3. **Tile Completion Bus (TCB)**: unchanged from v2, 4 ports, 8-bit tile tag
+
+TCB port allocation (unchanged from v2):### 10.5.1 VTG Dependency and VTG-Ready Bits (v2.2)
+
+> **(Change Point #2 -- hardware-revised)**
+
+VTG vector micro-instructions have a two-level dependency model. VTG writeback is a **full-tile read-modify-write** (16 cy minimum), during which the destination tile's VTG ready bit is set only after writeback completion.
+
+| Level | Token | Purpose |
+|-------|-------|---------|
+| **Physical tile tag** | `ptag` (8 b) | Coarse readiness: entire tile is ready |
+| **VTG ready bit** | per-VTG sub-location bit in VTG Metadata Table | Fine readiness: specific VTG inside the tile is ready |
+
+The **VTG Ready Table** is a **256-bit bitmap** (one bit per physical tile PT0..PT255), similar in structure to the scalar Ready Table. It tracks tile-level writeback readiness. At VTG dispatch, the source `ptag` bits are cleared. At VTG writeback, the destination `ptag` bit is set.
+
+Inside the GVIQ, each entry's `src_ready` field tracks per-VTG readiness independently from tile-level readiness:
+
+```text
+vtg_ready = src_ready[0] & src_ready[1] & src_ready[2] & src_ready[3]
+           & loop_counters_ready
+```
+
+This two-level model allows a tile to contain some VTGs that are ready and others that are not — essential for warp-rotated scheduling where different VTGs are at different loop iterations.
+
+**VTG write policies:**
+
+| Policy | Trigger | Action |
+|--------|---------|--------|
+| In-place VTG write | Unique tile ownership, no older readers | Group Write Adapter reads old tile, merges VTG sub-range, writes merged tile. W0 occupied for 16 cy (RMW). |
+| Copy-on-write tile group | Shared tile or speculative update | Allocate fresh PT; Group Write Adapter reads old tile, merges all unchanged VTGs, writes to fresh PT. W0 occupied for 16 cy. |
+| Read-modify-write VTG | Merging predication | Fetch old destination VTG via Group Read Adapter; merge inactive lanes |
+| Fresh group define | Load or producer overwrites enough VTGs | Allocate fresh tile; mark all VTGs defined |
+
+**Scalar ordering without `VWAIT`:** No software-visible `VWAIT` instruction exists in v2.2. Ordering between VTG micro-ops and scalar ops is enforced by normal `ptag` / `src_ready` dependency tracking through the GVIQ.
+
+### 10.5 Physical Register Freeing (Reference Counting)
+
+> **(v1 -> v2 BCC: SMAP replaces Scalar RAT for the P-reg freeing path; Tile RAT unchanged. The refcount mechanism itself is identical to v1.)**
+
+```
+  Per physical scalar register (128 entries):
+    orphan (1 bit) | refcount (4 bits)
+
+  Per physical tile register (256 entries):
+    orphan (1 bit) | refcount (3 bits)
+
+  State machine (same for both):
+    MAPPED:   SMAP/Tile RAT points to this register; refcount tracks in-flight readers
+    ORPHAN:   SMAP/Tile RAT no longer points here (remapped); refcount may be > 0
+    FREE:     orphan AND refcount == 0 -> returned to free list
+```
+
+**Lifecycle events:**
+
+| Event | orphan | refcount | Action |
+|-------|--------|----------|--------|
+| Allocated as destination at D2 | 0 | 0 | Added to SMAP mapping |
+| Instruction reads this register (dispatched to IQ) | -- | +1 | Reader registered |
+| Reader completes execution (at I1/I2) | -- | -1 | Reader done |
+| SMAP remaps atag to new ptag | 1 | -- | Old mapping becomes orphan |
+| refcount reaches 0 while orphan=1 | 1 | 0 | **Free**: return to free list |
+
+**Branch misprediction and ref-counts:** When a mispredict occurs, all instructions younger than the branch are flushed. MapQ reverse replay restores SMAP to CMAP state, reclaiming all speculatively allocated ptags. Physical registers allocated as destinations by flushed instructions are returned directly to their respective free lists. Ready Table is reset to ALL_ONES (conservative).### 10.6 Branch Recovery (BCC Scalar Pipeline)
+
+On a branch mispredict, the BCC scalar pipeline recovers via **MapQ reverse replay** + **branch-tag CAM-clear** on all physical IQs.
+
+```
+  ┌────────────────────────────────────────────────────────────┐
+  │  Branch Misprediction Recovery (BCC Scalar Pipeline)           │
+  │                                                            │
+  │  Cycle 0: Branch resolves as MISPREDICTED at EX1          │
+  │    -> flush_rid = branch.rid (from IQ entry)              │
+  │    -> flush_btag = branch.branch_tag (3-bit)              │
+  │                                                            │
+  │  Cycle 1: Recovery actions (all in parallel):              │
+  │                                                            │
+  │    (a) MapQ reverse replay:                                │
+  │        for each MapQ entry (youngest to oldest):           │
+  │          if entry.rid > flush_rid:                        │
+  │            SMAP[entry.atag] = entry.old_ptag              │
+  │            refcount[entry.new_ptag]--                    │
+  │            refcount[entry.old_ptag]++                     │
+  │            entry.valid = 0                                 │
+  │          else: break                                      │
+  │        -> SMAP == CMAP (exact committed state)              │
+  │                                                            │
+  │    (b) Physical IQ CAM-clear:                             │
+  │        for each alu_iq / bru_iq / lsu_iq entry:           │
+  │          if entry.checkpoint_id > flush_checkpoint:        │
+  │            entry.valid = 0                                 │
+  │                                                            │
+  │    (c) Ready Table: mask <= ALL_ONES                     │
+  │       (conservative; all ptags become temporarily untrusted)│
+  │                                                            │
+  │    (d) SSB flush: entries with btag >= flush_btag invalid │
+  │                                                            │
+  │    (e) STQ flush: entries with btag >= flush_btag invalid  │
+  │                                                            │
+  │    (f) Tile RAT: unchanged from v2 (independent domain)   │
+  │                                                            │
+  │    (g) Free list: restore head from CMAP state           │
+  │                                                            │
+  │  Cycle 2: Redirect fetch PC to correct branch target       │
+  │                                                            │
+  │  Cycle 3+: New instructions begin entering F0             │
+  │                                                            │
+  │  Total penalty: 6-7 cy (MapQ replay parallel with others)  │
+  └────────────────────────────────────────────────────────────┘
+```
+
+The MapQ replay is O(depth) = 12 iterations maximum. All recovery actions run in parallel within the single recovery cycle.
+
+## 11. Speculative Execution Recovery Without a Reorder Buffer
+
+> **Question:** The v1 design eliminates the Reorder Buffer (ROB) by leveraging the no-precise-exception envelope, using the Reservation Station + reference-counting + RAT-checkpoint trio for OoO execution. v2 adds branch-prediction-driven **speculative execution** to extend the OoO window past unresolved branches. Can we do this safely — i.e., guarantee that a misspeculated path **never** corrupts architectural state — **without** introducing a ROB?
+>
+> **Answer: Yes, with two additional structures (Speculative Store Buffer and Speculative Tile-Store Queue) plus a small Branch-Tag Ancestry Tracker. This section proves it and details the mechanisms.**
+
+### 11.1 What the ROB traditionally does
+
+Textbook OoO processors use a Reorder Buffer to provide three services bundled together:
+
+1. **Precise architectural state** — every instruction completes (writes to architecturally-visible state) **in program order**. On exception, the ROB lets the processor identify the exact instruction that faulted and discards everything younger.
+2. **Speculative recovery for memory-side effects** — stores remain in the ROB (or a coupled store queue) until they retire in order; mispredicted-path stores are simply not retired.
+3. **Resource freeing in program order** — physical registers / tiles are returned to the free list when the consuming arch-reg's old mapping retires.
+
+The Davinci-v1 design noted that **service (1) is unneeded** in the run-to-completion AI-kernel envelope, that **service (3) is replaced by reference counting**, and that **service (2) does not arise** because the processor doesn't speculate past an unresolved branch (it instead flushes the pipeline at every mispredict — only flushing **non-stored** state, since v1 stores commit to L1-D OoO once the producing branch has resolved).
+
+The v1 caveat — *"don't speculate past an unresolved branch"* — is restrictive: it limits the OoO window to the time between branches, which is typically only 5–10 instructions in tight scalar loops. v2 lifts this restriction by allowing instructions younger than an unresolved branch to dispatch, issue, and execute speculatively. **Doing this correctly requires service (2) to be re-introduced — but only service (2), not (1) or (3).**
+
+### 11.2 Categorizing speculative state
+
+When an instruction executes speculatively (i.e. depends on an unresolved branch), its effects fall into one of three classes:
+
+| Class | What it touches | Who recovers it on flush | Currently handled by |
+|-------|-----------------|---------------------------|----------------------|
+| **A. Renamed register / tile state** | Writes to a ptag (P0–P127), a physical tile (PT0–PT255), or a metadata RAT entry | Returns to free list once orphan + refcount=0 (no ROB needed) | **MapQ reverse replay + SMAP restore + refcount + free-list restore** (BCC scalar pipeline) |
+| **B. Pipeline state** | Occupies an RS slot; in flight in EX stages; consumes CDB/TCB cycles | Branch-tag CAM-clear invalidates the RS entry; in-flight EX is flushed at WB | Branch-tag stamping at D2 (§5.1, §6.2) |
+| **C. Externally-visible state** | Writes to L1-D / L2 cache, scatters to memory, MMIO accesses, fences/barriers, cross-core observable ordering | **Cannot be recovered** once it leaves the core | **NEW: SSB (§11.4) and STQ (§11.5) gate these to never *reach* memory until non-speculative** |
+
+Class A is fully handled by the v1 mechanisms — the RAT flash-restore + refcount path is **logically equivalent** to a ROB's "retire on commit" for register/tile state, because:
+
+- A misspeculated destination is simply *never* the architectural mapping (RAT restore overwrites it).
+- A misspeculated source-register read either consumed a valid value (which produces no side effect, just a wasted compute) or stalled until the branch resolved (in which case the RS entry is flushed before the read).
+
+Class B is fully handled by the branch-tag CAM-clear: every RS / reservation-station entry tagged with the mispredicted branch (or any *younger* branch in its dependency chain) is invalidated in one cycle, and the corresponding physical-register/tile destinations roll back to the free list.
+
+**Class C is the only class that needs a new mechanism**, and the new mechanism only needs to ensure that any class-C effect *of a speculative instruction* is **delayed** until the instruction is known to be on the correct path. This is the core insight: **we don't need a ROB to track instruction order globally; we only need to gate the externally-visible side effects by branch tag**.
+
+### 11.3 The Branch-Tag Speculation Tracker
+
+The processor maintains a **branch-tag tracker** — a small structure with three components:
+
+```
+  ┌───────────────────────────────────────────────────────────────┐
+  │  Speculation Tracker  (always 8 active branch tags max)       │
+  │                                                               │
+  │  (a) Tag-state vector: 8 × 2 b state                          │
+  │       0 = free                                                │
+  │       1 = speculative (branch not yet resolved)               │
+  │       2 = correct (branch resolved correctly; tag draining)   │
+  │       3 = wrong (branch mispredicted; tag flushing)           │
+  │                                                               │
+  │  (b) Ancestry bitmap: 8 × 8 b symmetric matrix                │
+  │       anc[i][j] = 1 iff branch i is an ancestor of branch j   │
+  │       (i.e. j was fetched while i was unresolved)             │
+  │       Allocated at D2 when each new branch is renamed.        │
+  │                                                               │
+  │  (c) Instruction-tag map: maintained in RS entries' btag      │
+  │       field (3 b each). Already part of every RS entry §7.2.  │
+  └───────────────────────────────────────────────────────────────┘
+```
+
+#### 11.3.1 Tracker operation
+
+- **Branch enters D2:** allocate a free tag `t`. Set `state[t] = speculative`. Snapshot ancestry: `anc[t][:] = anc[parent_t][:] | (1 << parent_t)` where `parent_t` is the youngest still-speculative branch's tag (or all zeros if none). The branch's RS entry stamps `btag = t`. All instructions following until the next branch also stamp `btag = t`.
+- **Branch resolves correctly at EX1:** set `state[t] = correct`. The drain logic propagates this to SSB / STQ entries, which then clear their `btag` (replacing it with the next-older speculative tag, if any). When `btag = 0xFF` (no older speculative branches remain), the entry becomes `drain_rdy`.
+- **Branch mispredicts at EX1:** set `state[t] = wrong`. **Atomically** clear all RS / SSB / STQ entries with `btag` matching `t` *or* matching any descendant of `t` (computed from `anc[t][:]`). The RAT flash-restore from checkpoint completes in the same cycle.
+- **Tag freed:** when an entry transitions from speculative to non-speculative *and* fully drains (or is invalidated), the corresponding tag's allocation in the tracker is released. A tag is freed when no in-flight entry references it.
+
+#### 11.3.2 Tracker area
+
+| Component | Size | Gate count |
+|-----------|------|------------|
+| Tag-state vector | 8 × 2 b = 16 FF | ~200 gate |
+| Ancestry bitmap | 64 b register | ~700 gate |
+| State-update FSM (allocate, resolve, mispredict) | ~3 K gate |
+| Drain-broadcast wiring (to SSB, STQ, vector RS, etc.) | ~1 K gate |
+| **Total** | | **~5 K gate** (negligible — ~0.001 mm² @ 5 nm) |
+
+### 11.4 Speculative Store Buffer (SSB) — Scalar Memory Path
+
+The SSB is the **gate** between the LSU's store pipeline and the L1-D cache. Already introduced in §8.2.1; this subsection details the speculation-recovery mechanism it enables.
+
+#### 11.4.1 Allocation and population
+
+```
+  D2 (rename) of a scalar store instruction:
+    Allocate SSB slot k from the free pool (FIFO order)
+    Set SSB[k] = {valid=1, btag=current_btag, addr=⊥, data=⊥, drain_rdy=0}
+    Stamp the LSU-RS entry with ssb_idx = k
+
+  IS / EX (issue + execute):
+    LSU computes addr and reads data from physical RF (or captures from CDB)
+    SSB[k].addr ← addr
+    SSB[k].data ← data
+    SSB[k].size ← size
+    LSU-RS entry retires (released; no CDB broadcast for stores)
+```
+
+At this point the store's address and data are **fully resolved**, but the store has **not** committed to L1-D. The store's effect on memory is **isolated within the SSB**.
+
+#### 11.4.2 Drain to L1-D (when non-speculative)
+
+```
+  When tag t becomes correct:
+    For each SSB[k] with valid=1 && btag=t:
+      btag_new = next-older speculative tag in this entry's history (from §11.3 ancestry)
+      if btag_new == 0xFF (no older speculation): SSB[k].drain_rdy ← 1
+      else: SSB[k].btag ← btag_new (still speculative)
+
+  Drain pump (1 store per cycle):
+    Pick oldest SSB[k] with valid && drain_rdy
+    Issue write to L1-D pipeline (same path as v1 store commit)
+    On L1-D ack: SSB[k].valid ← 0, slot returned to free pool
+```
+
+#### 11.4.3 Forwarding to loads
+
+Loads can forward from the SSB on address match, with the **branch-ancestry constraint**:
+
+```
+  For a load (addr, btag_load):
+    For each SSB entry e with e.valid && addr_match(e.addr, addr) && size_compatible(e):
+      if e.btag == 0xFF:                      # store is non-speculative — always OK
+         forward
+      elif anc[e.btag][btag_load]:           # store is speculative on load's ancestry chain
+         forward
+      else:                                   # store is on a different speculation path
+         skip (don't forward; load goes to L1-D, where it sees the pre-mispredict view)
+```
+
+Address-ambiguous stores (older stores still computing addresses) cause the load to wait, as in v1.
+
+#### 11.4.4 Mispredict invalidation
+
+```
+  When tag t becomes wrong (mispredict at EX1):
+    descendants = {j : anc[t][j] = 1} ∪ {t}
+    
+    For each SSB[k] with valid && btag ∈ descendants:
+      SSB[k].valid ← 0   (entry invalidated; never reaches L1-D)
+    SSB free-list head pointer ← restored from checkpoint[t]
+```
+
+Critically, **invalidated stores are silently discarded** — no memory write was issued. Memory state is unaffected by the misspeculated path.
+
+#### 11.4.5 Capacity sizing
+
+- **Min capacity:** at peak, all 8 active speculation tags can have stores in flight. Each tag's "store density" is bounded by the speculation window between branches (typically 5–10 instructions, of which 20–30% are stores → ~2 stores per tag per branch).
+- **Sized for:** 8 tags × ~3 stores/tag = 24 entries, matching the v2 SSB capacity.
+- **Stall behavior:** if the SSB fills, dispatch stalls at the next store. This is rare in practice (97th-percentile occupancy is ~12 entries on typical kernels) but the mechanism is correct under any occupancy — the front-end simply waits for an SSB slot to drain.
+
+### 11.5 Speculative Tile-Store Queue (STQ) — MTE Memory Path
+
+The STQ is the analogue of the SSB for **bulk tile stores** issued by the MTE unit (`TILE.ST`, `TILE.SCATTER`). Already introduced in §8.5.2; this subsection emphasizes the speculation-recovery semantics.
+
+Key differences from SSB:
+
+- **Data does not reside in the STQ.** The 4 KB tile payload stays in TRegFile-4K, referenced by `tile_phys_idx`. The STQ holds only the *intent* (address, source phys-tile, branch tag).
+- **Smaller capacity (8 entries).** Bulk tile stores are infrequent compared to scalar stores.
+- **Drain triggers a memory-bound transfer (8-cy TRegFile read epoch + ~64 cy memory write).** Unlike a single-cycle SSB drain, STQ drains take dozens of cycles and overlap with subsequent MTE operations.
+
+#### 11.5.1 Allocation, drain, invalidation
+
+The flow mirrors §11.4.1–§11.4.4 with these adaptations:
+
+```
+  Allocation: D2 of TILE.ST or TILE.SCATTER → STQ slot
+  Population: at MTE issue, fields {base_addr, tile_phys_idx, stride or scatter_idx_phys} fill in
+  Drain: when btag becomes 0xFF, drain_rdy ← 1; oldest drain_rdy entry begins
+         streaming the tile from TRegFile (8-cy read epoch) to memory (64-cy write)
+  Invalidation: on tag becoming wrong, matching STQ entries set valid ← 0
+                The source physical tile is freed via the normal Tile RAT refcount path
+                (the entry's allocation incremented refcount; invalidation decrements it)
+```
+
+#### 11.5.2 Why 8 entries
+
+- **Min capacity:** 1 STQ entry per active speculation tag = 8.
+- **Worst case:** A kernel with 1 TILE.ST per ~10 instructions issues ~1 STQ entry per ~50 ns at 1.5 GHz; the drain rate is ~1 entry per ~72 cy = ~50 ns. The STQ stays at ~1–2 entries average.
+- **Stall behavior:** STQ-full is rare; when it occurs, dispatch stalls at the next bulk tile store.
+
+### 11.6 What about other externally-visible operations?
+
+The full taxonomy of externally-visible side effects in the Davinci-v2 ISA is:
+
+| Operation | External effect | Speculation-safe via |
+|-----------|-----------------|----------------------|
+| Scalar store (SB/SH/SW/SD) | Write to L1-D / L2 / DRAM | **SSB** (§11.4) |
+| TILE.ST | 4 KB write to memory | **STQ** (§11.5) |
+| TILE.SCATTER | Indexed memory write | **STQ** (§11.5) |
+| Scalar load | Reads L1-D, fills physical register; *no external state change* | Recovered via RAT/refcount; load has no externally-visible effect on commit (other than caching, which is not architectural) |
+| TILE.LD, TILE.GATHER | Reads memory into TRegFile-4K; *no external state change* | Recovered via Tile RAT/refcount |
+| FENCE | Memory ordering barrier | Allocated at D2; **does not execute until btag = 0xFF**. On mispredict, flushed like any other RS entry. |
+| AMO atomics (future) | Read-modify-write to memory | Would need to allocate a SSB-like slot held until non-speculative; fits naturally in the SSB. |
+| Branch resolve | Updates predictor tables | Predictor update is **conditional on branch correctness**: predictor write fires only when the branch tag becomes "correct" (i.e., on the same drain trigger as SSB/STQ). |
+| TCB / CDB broadcast | Wakes dependent RS entries | Internal to core; not externally visible. Mispredicted broadcasts are absorbed by the RS branch-tag CAM-clear. |
+| Performance-counter increment | Updates a CSR | Counters are explicitly architectural; v2 reuses the FENCE-style "execute-only-when-non-speculative" gate (§11.6.1). |
+| MMIO load/store (future) | I/O side effect | Would need explicit speculation barrier; recommended to gate at the SSB level with a "MMIO" qualifier that forces in-order completion. |
+
+#### 11.6.1 FENCE and CSR semantics under speculation
+
+```
+  D2 (rename) of FENCE / CSRRW:
+    Allocate RS entry as usual; stamp btag
+  
+  IS (issue):
+    The instruction is held in RS until btag == 0xFF (i.e. no older
+    speculation). This is a small extension to the issue-ready
+    predicate: in addition to "all source operands ready", we add
+    "btag is non-speculative".
+  
+  EX:
+    Execute as normal. By construction, the instruction is on the
+    correct path.
+```
+
+This adds latency to FENCE / CSR ops (they wait for older branches to resolve) but is correct. In practice, FENCE is rare in tight kernels.
+
+#### 11.6.2 Cache-line state changes (loads)
+
+Speculative loads can pull cache lines into L1-D / L2 that wouldn't have been pulled on the correct path. This is a **microarchitectural** effect (cache-state pollution), not an architectural one — it doesn't violate program semantics. Modern OoO processors all have this property; the well-known Spectre-class side-channel concerns apply equally to v1 (speculative loads were already permitted because they only fill physical registers, not memory). Recovery against side-channel leakage is **out of scope** for this document; standard mitigations (cache partitioning, speculation barriers) are orthogonal to the no-ROB recovery scheme.
+
+### 11.7 What this scheme does NOT provide (and why that's OK)
+
+The branch-tag + SSB + STQ scheme guarantees: **no misspeculated instruction's externally-visible effect ever reaches memory or any architecturally-visible state, regardless of how deep the speculation goes (up to 8 levels) or how many instructions execute speculatively.**
+
+It does **not** provide:
+
+1. **Precise exceptions.** If a load page-faults (in a hypothetical paged Davinci variant), the processor cannot identify the program-order point of the fault — only the renamed-register view. This is the same v1 limitation. The kernel-execution envelope assumes faults don't happen mid-kernel; any fault is treated as a fatal error.
+2. **Precise hardware breakpoints / single-step.** Without ordered retirement, hitting a breakpoint at instruction *i* may have already executed a few instructions past *i*. Debug support is degraded but not broken: breakpoints fire at the granularity of the ROB-less retirement cluster (~10-instruction window).
+3. **Strict in-order memory ordering for I/O semantics.** Loads and stores still drain to L1-D in store-buffer-FIFO order, so total-store-order (TSO) within a single thread is preserved; but if the core were extended to support SC (sequential consistency) across threads, additional ordering machinery would be needed.
+4. **Recovery from non-branch misspeculation.** Memory-disambiguation misspeculations (where a younger load reads from L1-D before an older store's address resolves, then the older store's address turns out to alias) are not handled — the LSU's address-disambiguation logic still requires older stores' addresses to be known before the load issues, as in v1. **Memory dependence prediction is not implemented.** Removing this restriction is an orthogonal extension that would require an additional tag for "load misspeculation" similar to the branch tag.
+
+These limitations are consistent with the v1 design envelope (run-to-completion AI kernel).
+
+### 11.8 Comparison with a ROB-based design
+
+| Aspect | ROB-based design | Davinci-v2 (no ROB) |
+|--------|------------------|---------------------|
+| Recovery for register / tile state | ROB walks back, undoes mappings in order | **MapQ reverse replay + SMAP <- CMAP + refcount free**; instruction-precise, more efficient |
+| Recovery for memory stores | Stores in ROB / coupled SQ; release on retire | **SSB / STQ** with branch-tag gating |
+| Recovery for fences / CSR | ROB serializes; instruction retires in order | Issue gated on `btag = 0xFF`; correct but adds 0–6 cy latency |
+| Speculative depth | bounded by ROB capacity (typically 64–256 entries) | bounded by **8 active branch tags** (≈ 8-deep nested branches) |
+| Storage overhead | ROB ≈ 256 entries × ~150 b = ~5 KB + retirement logic | Branch-tag tracker (5 K gate) + SSB/STQ (~2.5 KB) + MapQ (144 B) + Ready Table (16 B) ≈ **~3 KB total** |
+| Wakeup logic | RS does dependency tracking; ROB independent | **Ready Table (128-bit bitmap) replaces CDB comparators**: 0 comparators vs. 384 for scalar RS; O(1) ptag lookup |
+| Mispredict penalty | ROB walk-back + flush ≈ 5–10 cy | **MapQ replay + Ready Table reset + IQ CAM-clear + SSB/STQ flush**: all parallel in 1 cy + 6-cy refill = 7 cy |
+| Precise exceptions | yes (free) | block-granularity via BROB (SS11.11) |
+| Single-thread TSO memory ordering | yes | yes (FIFO drain through SSB) |
+
+**The key insight:** in environments where precise exceptions are not required (the AI-kernel envelope), the ROB's three bundled services unbundle naturally. Service (1) is free if you don't need it. Service (3) is replaced by reference counting. Service (2) — **the only remaining service** — is implemented by the SSB + STQ + branch-tag tracker at a fraction of a ROB's cost.
+
+**v2.3 extends this:** by lifting exception handling to block granularity via BROB, the design achieves the ROB's service (1) for kernel-entry traps without a full flat ROB. The block boundary is the commit point; the faulting block is identified; younger blocks are squashed; MapQ reverse replay recovers register state to the faulting instruction.
+
+**v2.3 extends this:** by lifting exception handling to block granularity via BROB, the design achieves the ROB's service (1) for kernel-entry traps without a full flat ROB. The block boundary is the commit point; the faulting block is identified; younger blocks are squashed; MapQ reverse replay recovers register state to the faulting instruction.
+
+### 11.9 Cycle-by-cycle example: speculative store followed by mispredict (BCC Scalar Pipeline)
+
+> **Block-ROB note:** This example uses the pre-BROB model. For cycle-by-cycle examples with BROB and block-level commit, see SS11.11.
+
+```
+  Cycle  Action
+  ───  ────────────────────────────────────────────────────────────────────────────
+   0     Branch B1 enters D1; allocated tag t=3, state[3] = speculative
+         B1 pushes MapQ entry {atag=B1_dst, old_ptag, new_ptag, rid=B1.rid}
+   1     SD X5, [X8]+0    — younger than B1 — D2 allocates SSB[7] tagged 3
+         MapQ push: {atag=X5, old=P5, new=P80, rid=next}
+   2     SD X6, [X8]+8    — younger than B1 — D2 allocates SSB[8] tagged 3
+   3     SD X7, [X8]+16   — younger than B1 — D2 allocates SSB[9] tagged 3
+   4-8   ... more instructions dispatched to physical IQs ...
+   9     B1 reaches EX1: mispredicted!  flush_rid = B1.rid; flush_btag = 3
+  10     MapQ reverse replay (all in parallel):
+           (1) SMAP restored to CMAP (entries with rid > flush_rid undone)
+           (2) Physical IQ CAM-clear: entries with checkpoint_id > flush_btag invalidated
+           (3) Ready Table: mask <= ALL_ONES (conservative reset)
+           (4) SSB[7..9] valid ← 0
+           (5) Free list head restored from CMAP state
+  11     Fetch redirected to correct branch target
+  12-16  Front-end refill (5 cy)
+  17     First correct-path instruction enters EX1 (total mispredict penalty = 7 cy)
+
+  Architectural state at cycle 17:
+    Memory: NEVER wrote SSB[7..9]. Cache lines unaffected. Correct.
+    SMAP == CMAP: MapQ replay restored all speculative renames.
+    P80, P81, P82, P83: freed (orphan + refcount=0 after replay).
+    Ready Table: all bits reset; instructions re-query on next cycle.
+```
+
+**Key BCC scalar pipeline differences:**
+
+- MapQ entries are pushed at D2 for each P-dst rename, carrying `{atag, old_ptag, new_ptag, rid}`.
+- On flush, MapQ reverse replay (max 12 iterations) restores SMAP to exact CMAP state.
+- Physical IQ entries are CAM-cleared by `checkpoint_id` (MapQ entry ID), not just `btag`.
+- Ready Table reset ensures no stale ready bits survive the flush.
+
+The mispredict is recovered in 7 total cycles. The MapQ replay runs in parallel with the IQ CAM-clear and Ready Table reset — all within the single recovery cycle.
+
+### 11.10 Hardware cost summary
+
+| Block | BCC scalar pipeline change | Gate count |
+|-------|--------------------------|------------|
+| Branch-tag tracker (state vector + ancestry bitmap + FSM) | unchanged | ~5 K |
+| Speculative Store Buffer (24 entries, 182 b each) | unchanged | ~80 K |
+| Speculative Tile-Store Queue (8 entries, 110 b each) | unchanged | ~12 K |
+| Branch-tag stamping in IQ entries | unchanged (now across 3 IQs) | ~2 K |
+| MapQ (12-entry ring buffer, ~96 b/entry) | **replaces RAT checkpoint store** | ~1.5 K |
+| Ready Table (128-bit bitmap + control) | **new** | ~1 K |
+| CDB comparator reduction | **384 -> 0** (Ready Table replaces all) | ~-50 K (saves area) |
+| Tile Metadata RAT (256 × 32 b SRAM) | unchanged | ~10 K |
+| **Total v2 BCC speculation hardware** | | **~113 K gate** |
+
+**v2.3 Block-ROB new hardware (added on top of v2 BCC):**
+
+| Block | Change | Gate count |
+|-------|--------|------------|
+| BROB (128 entries x ~120 b) | **new** | ~150 K |
+| Block SSB (32 entries x ~200 b) | **new** | ~60 K |
+| Block STQ (16 entries x ~100 b) | **new** | ~20 K |
+| BID tagging in iROB / GVIQ / IQ / SSB / STQ | **new** | ~8 K |
+| BROB allocate FSM + complete check | **new** | ~20 K |
+| Exception delivery logic | **new** | ~10 K |
+| **Total v2.3 Block-ROB hardware** | | **~268 K gate** |
+| **Total v2.3 with Block-ROB** | | **~381 K gate** |
+
+**v2.3 Block-ROB new hardware (added on top of v2 BCC):**
+
+| Block | Change | Gate count |
+|-------|--------|------------|
+| BROB (128 entries x ~120 b) | **new** | ~150 K |
+| Block SSB (32 entries x ~200 b) | **new** | ~60 K |
+| Block STQ (16 entries x ~100 b) | **new** | ~20 K |
+| BID tagging in iROB / GVIQ / IQ / SSB / STQ | **new** | ~8 K |
+| BROB allocate FSM + complete check | **new** | ~20 K |
+| Exception delivery logic | **new** | ~10 K |
+| **Total v2.3 Block-ROB hardware** | | **~268 K gate** |
+| **Total v2.3 with Block-ROB** | | **~381 K gate** |
+
+The v2 BCC speculation hardware is **~3.5%** of the ~3.26 mm² total core area — the same as v2 with RAT checkpoints. The Ready Table (~1 K gate) and MapQ (~1.5 K gate) add negligible area. The key win is the **CDB comparator elimination** (~50 K gate saved) and the **IQ split** (simpler, more scalable). The net gate count for the scalar wakeup/issue path is approximately equal or slightly lower than v1.
+
+---
+
+## 11.11 Block-ROB -- Block-Granularity Precise Exception Support (v2.3 新增)
+
+### 11.11.1 Motivation
+
+The pre-BROB Davinci-v2 model explicitly excludes precise exception support, treating all faults as fatal. Block-ROB relaxes this to **block-granularity precise exceptions**:
+
+- The faulting instruction block is identified.
+- All younger blocks are squashed (BID-order flush).
+- Register state is recovered via MapQ reverse replay from the faulting RID.
+- Memory side effects within squashed blocks are discarded (SSB/STQ invalidation).
+- After OS/kernel handler restores context, the faulting block is re-executed.
+
+The block boundary (BSTART to BSTOP) is the commit point. This matches the design principle from LinxCore: block structure enables natural ROB-bounded commit without a flat instruction-level ROB.
+
+### 11.11.2 Instruction Block Definition
+
+An **instruction block** is a contiguous sequence of decoded micro-operations bounded by:
+
+- **BSTART** (inclusive start): first uop in the block; triggers BROB entry allocation.
+- **BSTOP** (inclusive end): last uop in the block; gates retirement.
+
+Block boundaries are compiler-generated at natural control-flow join points. Block size: 4-64 uops (typical AI kernel: 16-32 uops).
+
+**Block types:**
+
+| Block Type | Scalar-only | Engine-backed | Notes |
+|------------|-------------|---------------|-------|
+| `STD` | Yes | No | Pure scalar execution |
+| `VTG` | Yes | VTG micro-instructions | GVIQ sub-schedule within block |
+| `VEC` | No | Full-tile VEC-4K-v2 | `T*` tile operations |
+| `CUBE` | No | outerCube MXU | CUBE.OPA, CUBE.DRAIN |
+| `MTE` | Yes | Memory Tile Engine | TILE.LD, TILE.ST |
+
+### 11.11.3 Block ID (BID)
+
+Each block receives a **64-bit BID** at BSTART:
+
+```
+BID[7:0]  -- BROB slot index (0..127)
+BID[63:8] -- Monotonically increasing sequence number
+```
+
+The 8-bit slot index directly maps to the BROB entry. Full-width BID enables flush by ordering: **keep `bid <= flush_bid`, kill `bid > flush_bid`**.
+
+### 11.11.4 BROB Structure
+
+| Parameter | Value |
+|-----------|-------|
+| `BROB_ENTRIES` | 128 |
+| `BROB_ALLOC_PER_CYCLE` | 1 |
+| `BROB_COMPLETE_PER_CYCLE` | 1 |
+| `BROB_RETIRE_PER_CYCLE` | 1 |
+| `BID_W` | 8 b (slot) + 56 b (sequence) |
+
+**Per-BROB-entry state:**
+
+```
+BROBEntry {
+  valid:          1 b     -- entry is allocated
+  state:          2 b     -- ALLOC | ISSUED | COMPLETE
+  bid:            64 b    -- full-width Block ID
+  block_type:     4 b     -- STD | VTG | VEC | CUBE | MTE
+  head_rid:       7 b     -- RID of first uop (BSTART's iROB slot)
+  tail_rid:       7 b     -- RID of last uop (BSTOP's iROB slot)
+  n_uops:         6 b     -- number of uops in block (1..64)
+  checkpoint_id:   4 b     -- RAT checkpoint active for this block
+  needs_scalar:   1 b     -- block has scalar uops (BSTOP must retire)
+  needs_engine:   1 b     -- block has engine ops (GVIQ/Vector/Cube RS)
+  engine_done:     1 b     -- engine completion signal received
+  scalar_done:    1 b     -- BSTOP retired from iROB
+  has_exception:   1 b     -- exception detected within this block
+  exception_cause: 16 b    -- trap / exception cause code
+  fault_rid:       7 b     -- RID of faulting uop (if has_exception)
+  n_stores:        5 b     -- number of scalar stores in this block
+  n_vtg_ops:       5 b     -- number of VTG micro-instructions
+  block_ssb_base:  5 b     -- index into Block SSB RAM for first store
+  block_stq_base:  4 b     -- index into Block STQ RAM for first tile store
+}
+```
+
+**State machine:**
+
+```
+FREE --[allocate]--> ALLOC --[dispatched]--> ISSUED --[complete]--> COMPLETE
+                                                                   |
+                                                            [retire: advance head]
+                                                                   |
+                                                                  FREE
+```
+
+**Completion rule:**
+
+```
+complete = scalar_done && (needs_engine ? engine_done : 1)
+```
+
+### 11.11.5 Instruction Block Lifecycle
+
+**BSTART at D2:**
+1. Allocate BROB entry `k` from free pool (tail pointer).
+2. Set `bid = {seq_num++, k[7:0]}`.
+3. Set `block_type` from BSTART metadata.
+4. Set `checkpoint_id` = current RAT checkpoint snapshot.
+5. Set `head_rid` = current iROB head.
+6. Set `needs_scalar = 1`, `needs_engine = 0`, `scalar_done = 0`, `engine_done = 0`, `has_exception = 0`.
+7. Stamp all uops in block with `bid` (stored alongside `branch_tag` in IQ/GVIQ/iROB entries).
+8. BSTART retires immediately (bypasses IQ, EX, WB).
+
+**Subsequent uops (D3):**
+1. Allocate iROB entry; stamp `bid`.
+2. Set `iROB[rid].brob_slot = k`.
+3. Increment `n_uops`.
+4. If `is_store`: allocate Block SSB slot, increment `n_stores`.
+5. If `is_vtg_op`: increment `n_vtg_ops`; set `needs_engine = 1`.
+6. Execute normally through BCC pipeline.
+
+**BSTOP at D2:**
+1. Set `tail_rid` = current iROB entry index.
+2. Set `needs_engine = (n_vtg_ops > 0) || (block_type == VEC) || (block_type == CUBE)`.
+3. BSTOP enters iROB but **retirement is gated** (see below).
+
+### 11.11.6 BSTOP Retire Gate and Block Completion
+
+The iROB commit logic is extended with a **BSTOP retire gate**:
+
+```
+BSTOP can retire when ALL of:
+  1. BROB[bid_slot].state == COMPLETE
+  2. !BROB[bid_slot].has_exception
+
+On BSTOP retire:
+  1. Set scalar_done = 1 in BROB[bid_slot]
+  2. If complete && !has_exception: advance BROB head to k+1
+  3. If complete && has_exception: trigger exception delivery
+```
+
+**Engine completion:** Engines (VEC-4K-v2, Cube, MTE LSU, GVIQ) signal `engine_done` to the BROB via the existing TCB (Tile Completion Bus) with `bid` in the response. On match: `BROB[bid_slot].engine_done = 1`.
+
+### 11.11.7 Block Retire
+
+Only the **oldest block** (BROB head) retires per cycle:
+
+```
+1. If head.has_exception:
+     Report exception (see SS11.11.9)
+     Do NOT commit side effects
+     Squash all younger blocks (bid > head.bid)
+2. Else if head.state == COMPLETE:
+     Commit side effects:
+       a. Transfer Block SSB entries to SSB (drain_rdy = 1, btag = 0xFF)
+       b. Transfer Block STQ entries to STQ (drain_rdy = 1, btag = 0xFF)
+       c. Advance head pointer
+       d. Free BROB entry
+3. Else: stall (wait for completion)
+```
+
+### 11.11.8 Precise Exception Mechanism
+
+**Exception detection:**
+- **Scalar exception**: EX1 stage sets `iROB[rid].trap_valid = 1`.
+- **Engine exception**: TCB response arrives with `trap_valid=1`; BROB marks `has_exception=1`, `fault_rid=faulting_rid`.
+
+**Exception reporting flow:**
+
+```
+Step 1: Detection
+  Scalar: EX1 sets iROB[rid].trap_valid = 1
+  Engine: TCB arrives with trap_valid=1
+
+Step 2: Blocking
+  BROB does NOT retire the block
+  BSTOP retire is blocked (has_exception == TRUE)
+
+Step 3: Squash of Younger Blocks
+  flush_bid = BROB[head].bid
+  In parallel (1 cycle):
+    a. iROB: invalidate entries with bid > flush_bid
+    b. BROB: set valid = 0 for entries with bid > flush_bid
+    c. GVIQ: invalidate entries with bid > flush_bid
+    d. IQ: invalidate entries with bid > flush_bid
+    e. SSB: valid = 0 for entries with bid > flush_bid
+    f. STQ: valid = 0 for entries with bid > flush_bid
+    g. Block SSB: invalidate entries with bid > flush_bid
+    h. Block STQ: invalidate entries with bid > flush_bid
+
+Step 4: Register State Recovery
+  MapQ reverse replay from faulting RID backward:
+    for each MapQ entry from tail down to faulting RID:
+      undo SMAP write, restore orphan ptag, pop MapQ
+  Tile RAT: restore from BROB[head].checkpoint_id
+
+Step 5: Exception Delivery
+  EPC   = BSTART_PC (of faulting block)
+  Cause = BROB[head].exception_cause
+  Fault RID = BROB[head].fault_rid
+  OS/kernel handler restores context and re-executes the block.
+```
+
+**Within-block instruction precision:** MapQ already provides instruction-precise P-reg recovery. On exception, MapQ is replayed in reverse from `fault_rid` (captured at detection), not from the block boundary. The faulting uop is precisely identified and all younger uops in the same block are undone.
+
+### 11.11.9 Worked Example: Page Fault in Block
+
+```
+Block B: BSTART, u0 (ADD r1, r2, r3), u1 (TILE.LD), u2 (MUL r6, r4, r7), BSTOP
+
+u1 executes: TILE.LD triggers page fault.
+  LSU sets iROB[rid1].trap_valid = 1
+  LSU sets BROB[5].has_exception = 1
+  LSU sets BROB[5].fault_rid = rid1
+  LSU sets BROB[5].exception_cause = PAGE_FAULT
+
+BSTOP cannot retire (blocked on has_exception).
+Block B is at BROB head, blocked.
+
+Next cycle:
+  flush_bid = B.bid  (no younger blocks)
+  MapQ replay from rid1 backward:
+    undo SMAP writes from u1, u0
+    restore ptags for r4, r1
+  Tile RAT restore from checkpoint_id = 3
+  Deliver: EPC = BSTART_PC, Cause = PAGE_FAULT
+  OS handler restores context.
+  Block B is re-fetched and re-executed after handler returns.
+```
+
+### 11.11.10 Store Commit Within Blocks
+
+**Block SSB:** 32-entry structure shared across BROB entries. Each entry tracks a scalar store within a block.
+
+**Block SSB entry:**
+
+```
+valid:    1 b   -- entry is valid
+addr:    40 b   -- cache-line address (filled at EX1)
+data:    128 b -- store data (filled at EX2)
+size:     3 b   -- 1/2/4/8 B
+bid:      8 b   -- which block this store belongs to
+ssb_idx:  5 b   -- mapped SSB slot index
+```
+
+**Load forwarding within block:** Loads forward from Block SSB entries in the same block without BID ordering checks (Block SSB only contains stores from this block, which are already program-ordered).
+
+**Store commit at block retire:** All Block SSB entries for the retiring block are transferred to the SSB with `btag=0xFF` and `drain_rdy=1`. They drain to L1-D in program order via the existing SSB drain pump.
+
+**Block STQ:** Analogous to Block SSB for tile stores (TILE.ST, TILE.SCATTER). Tile data stays in TRegFile-4K; Block STQ holds the intent (address, source phys-tile, bid).
+
+### 11.11.11 Integration with Existing Infrastructure
+
+**MapQ:** Fully reused. Each renamed destination pushes a MapQ entry with `{arch_reg, old_ptag, new_ptag, RID, checkpoint_id}`. On exception, MapQ reverse replay from `fault_rid` recovers P-reg state. Unchanged.
+
+**Branch-tag tracker:** Fully reused. Each block receives a branch tag at BSTART. Branch-tag CAM-clear is extended to flush by `bid > flush_bid`. Unchanged.
+
+**RAT checkpoints:** Extended with `checkpoint_id` per BROB entry. On exception: Tile RAT restored from `BROB[head].checkpoint_id`. Scalar RAT recovered via MapQ reverse replay (already instruction-precise).
+
+**SSB/STQ:** Extended with `bid` field (8 b per entry). At D2: `SSB[idx].bid = current_bid`. At block retire: `SSB[idx].btag = 0xFF`, `drain_rdy = 1`. At flush: entries with `bid > flush_bid` are invalidated.
+
+**VTG / GVIQ:** GVIQ entries are stamped with `bid`. GVIQ issue is gated by `block_complete = (BROB[bid_slot].engine_done || !BROB[bid_slot].needs_engine)`. Unchanged GVIQ rotation scheduler.
+
+### 11.11.12 Flush Protocol Summary
+
+```
+flush_bid = BROB[head].bid
+
+In parallel (1 cycle):
+  a) iROB: invalidate entries with bid > flush_bid
+  b) BROB: valid = 0 for entries with bid > flush_bid; tail advances
+  c) GVIQ: invalidate entries with bid > flush_bid
+  d) IQ: invalidate entries with bid > flush_bid
+  e) SSB: valid = 0 for entries with bid > flush_bid
+  f) STQ: valid = 0 for entries with bid > flush_bid
+  g) Block SSB: invalidate entries with bid > flush_bid
+  h) Block STQ: invalidate entries with bid > flush_bid
+  i) MapQ: pop entries from flush_rid+1 backward (undo SMAP writes)
+  j) Tile RAT: restore from BROB[flush_bid_slot].checkpoint_id
+  k) Scalar RAT: flash-restore from checkpoint (unchanged)
+  l) Branch-tag tracker: free tags for flushed blocks
+```
+
+## 12. Memory Subsystem
+
+> **(v1 → v2: §12.1 / §12.3 / §12.4 / §12.5 完整复制自 v1 §11。v2 增量集中在 §12.2 Store path,把 v1 的 16-entry store buffer 升级为 24-entry SSB,并加入 STQ。)**
+
+The memory subsystem is structurally identical to v1 §11. The two changes are integration points for the SSB and STQ.
+
+### 12.1 Cache Hierarchy (v1 §11.1, 未变更)
+
+```
+  ┌────────────┐    ┌────────────┐
+  │  L1-I      │    │  L1-D      │
+  │  64 KB     │    │  64 KB     │
+  │  4-way     │    │  4-way     │
+  │  2-cy lat  │    │  4-cy lat  │
+  └─────┬──────┘    └─────┬──────┘
+        │                 │
+        └────────┬────────┘
+                 ▼
+        ┌────────────────┐
+        │  L2 (Unified)  │
+        │  512 KB        │
+        │  8-way         │
+        │  12-cy lat     │
+        └───────┬────────┘
+                │
+                ▼
+        External Bus / NoC
+```
+
+| Cache | Size | Associativity | Line size | Latency | Ports | MSHRs |
+|-------|------|---------------|-----------|---------|-------|-------|
+| L1-I | **64 KB** | 4-way | 64 B | **2** cycles | 1 read (fetch) | 4 |
+| L1-D | **64 KB** | 4-way | 64 B | **4** cycles | 1 read + 1 write (LSU) | 8 |
+| L2 | **512 KB** | 8-way | 64 B | **12** cycles | 1 read + 1 write | 16 |
+
+### 12.2 Store Path (v2 增量)
+
+```
+  Scalar store:      LSU-RS → SSB (24 entries) → L1-D (only on tag-clear)
+  Bulk tile store:   MTE-RS → STQ (8 entries)  → MTE memory pipeline → L2 (only on tag-clear)
+```
+
+The L1-D's existing 8 MSHRs and 4-cy store pipeline are unchanged. The SSB inserts in front of L1-D as a CAM-addressable forwarding buffer; it already played that role in v1's 16-entry store buffer, so the L1-D interface is unchanged. v2 widens the buffer to 24 entries and adds branch-tag gating (full design in §11.4).
+
+| Property | v1 store buffer | v2 SSB |
+|----------|-----------------|--------|
+| Entries | 16 | **24** |
+| Forwarding | yes | yes (now btag-aware §11.4.3) |
+| Branch-tag gating | — | **yes (§11.4)** |
+| Drain to L1-D | OoO upon resolve | only when btag = `0xFF` |
+
+### 12.3 TLBs (v1 §11.2, 未变更)
+
+| TLB | Entries | Associativity | Page sizes | Miss penalty |
+|-----|---------|---------------|------------|-------------|
+| I-TLB | **64** | Fully assoc | 4 KB, 2 MB | L2 TLB lookup |
+| D-TLB | **64** | Fully assoc | 4 KB, 2 MB | L2 TLB lookup |
+| L2 TLB (unified) | **512** | 8-way | 4 KB, 2 MB, 1 GB | Page table walk |
+
+### 12.4 MTE Memory Path (v1 §11.4, 未变更)
+
+The MTE unit has a **high-bandwidth path** to the L2 cache (and external memory) for tile data transfers, separate from the scalar LSU path through L1-D.
+
+```
+  MTE ──▶ L2 Cache (512 KB) ──▶ External Memory
+           64 B/cy sustained bandwidth
+           1 cache line per cycle
+           1 tile (4 KB) = 64 cache lines = 64 cycles from L2
+```
+
+| Parameter | Value |
+|-----------|-------|
+| MTE → L2 bandwidth | **64 B/cycle** (1 cache line/cycle) |
+| Tile load from L2 (hit) | **64 cycles** per tile (4 KB / 64 B) |
+| Tile load from external memory | **200–400 cycles** per tile (DRAM dependent) |
+| Outstanding MTE requests | **32** (deep buffer for memory-level parallelism) |
+| Prefetch support | MTE RS can issue TILE.LD early, buffering data in TRegFile |
+
+The MTE unit exploits the large TRegFile-4K (256 tiles, 1 MB) as a **software-managed scratchpad**. Programmers (or compiler) schedule TILE.LD instructions well ahead of CUBE.OPA to hide memory latency. The 32-entry outstanding request buffer allows many tile loads to be in flight simultaneously, maximizing bandwidth utilization.
+
+**v2 增量:** `TILE.ST` and `TILE.SCATTER` traffic on this path is gated by the 8-entry STQ (§11.5).
+
+### 12.5.1 VTG Vector Load/Store (v2.2)
+
+> **(Change Point #2 -- hardware-revised)**
+
+VTG vector memory operations load or store 256 B or 512 B VTG payloads under predicate control. VTG memory ops share the LSU pipeline with MTE. VTG loads perform a full-tile RMW on writeback (16 cy minimum, same as ALU ops).
+
+**Vector Load:**
+
+```text
+VLD.F32  T8.g0, [Xbase + Xoff], T8.p0
+```
+
+Flow:
+1. Read loop/thread counters from GVIQ entry prefix (`iter0..iter3`)
+2. Compute effective address from scalar operands, immediate, and loop counters
+3. LSU fetches active lanes from memory (inactive lanes skipped)
+4. LSU builds 256 B or 512 B VTG payload
+5. LSU submits VTG payload to Group Write Adapter
+6. Group Write Adapter: full-tile RMW (16 cy minimum): read old tile, merge VTG payload into sub-range, write merged tile back
+7. Update VTG metadata for `T8.g0` (`valid=1`, `defined=1`, `dirty=1`, set after writeback complete)
+
+**Inactive-lane fault suppression:** VTG loads/stores MUST NOT fault for predicate-inactive lanes. The LSU checks `active_lanes` and the predicate VTG before performing address calculation for each lane. Faulting addresses in inactive lanes are suppressed and do not generate exceptions.
+
+**Vector Store:**
+
+```text
+VST.F32  T8.g2, [Xbase + Xoff], T8.p0
+```
+
+Inactive lanes MUST NOT write memory and MUST NOT fault for invalid inactive-lane addresses.
+
+**Strided and Gather:**
+
+| Instruction | Syntax | Operation |
+|-------------|--------|-----------|
+| `VLDSTRIDE` | `VLDSTRIDE.type Td, Xbase, Xstride, Xcount, Tp` | `Td[i] = mem[Xbase + i*Xstride]` |
+| `VSTSTRIDE` | `VSTSTRIDE.type Ts, Xbase, Xstride, Xcount, Tp` | `mem[Xbase + i*Xstride] = Ts[i]` |
+| `PGATHER` | `PGATHER.type Tpd, [Xbase + Ts*esize], Tp` | Predicate gather: `Tpd[i] = mem[Xbase + Ts[i]*esize]` |
+
+**Ordering within a VTG micro block:** Vector loads/stores inside the same micro block use conservative ordering (load after store with unknown alias requires a block boundary or fence). No `VWAIT` — ordering is handled by the existing scalar memory ordering model (§12.5).
+
+### 12.5 Memory Ordering (v1 §11.5, 未变更; v2 增加 SSB 备注)
+
+Within a single thread:
+
+- **Scalar loads and stores** maintain **program order** through the LSU's address disambiguation (store-to-load forwarding, load queue snooping). In v2 the forwarding source is the SSB (24 entries) instead of v1's 16-entry store buffer; semantics identical.
+- **TILE.LD/ST** operations are **unordered** with respect to each other by default. Software uses `FENCE` instructions when ordering between tile operations and scalar operations is required.
+- **CUBE.OPA** reads from TRegFile-4K are ordered with respect to preceding TILE.LD operations by the **Tile RAT ready bits** (the cube RS will not issue until the source physical tiles are marked "ready" by completed TILE.LD operations).
+
+The SSB's branch-tag gating is **orthogonal** to memory ordering: stores still drain in alloc-age order, just only when their branch tag is non-speculative (§11.4.2).
+
+---
+
+## 13. Mixed-Domain Instruction Scheduling
+
+> **(v1 → v2: 子节 13.A / 13.B / 13.C / 13.D 完整复制自 v1 §12.1 / §12.2 / §12.3 / §12.4。v2 增量为 §13.1 / §13.2 / §13.3。)**
+
+All four domains share the same front-end, dispatch to domain-specific RSs, and synchronize through Tile RAT ready bits / TCB / CDB.
+
+### 13.A Unified Front-End, Distributed Back-End (v1 §12.1, 未变更)
+
+All four instruction domains share the same front-end pipeline (fetch, decode, rename). At dispatch, instructions are routed to domain-specific reservation stations. This allows the core to exploit instruction-level parallelism across domains:
+
+```
+  Single instruction stream (architectural tile regs T0–T31):
+    ADD   X5, X2, X3        → Scalar RS → ALU
+    TILE.LD T10, [X5]       → Tile RAT: T10→PT200;  MTE RS (ptdst=PT200, depends on X5 via CDB)
+    TILE.LD T20, [X6]       → Tile RAT: T20→PT201;  MTE RS (ptdst=PT201, independent)
+    VADD  T30, T10, T20     → Tile RAT: T30→PT202;  Vector RS (ptsrc=PT200,PT201; depends via TCB)
+    CUBE.OPA z0, T10, T20, r1  → Cube RS (ptsrc=PT200,PT201; depends via TCB ready bits)
+    TILE.GET X7, T30, X8    → MTE RS (ptsrc=PT202, depends via TCB; pdst=P60 → CDB scalar result)
+    TILE.PUT T10, X9, X10   → Tile RAT: T10→PT203; MTE RS (ptsrc=PT200_old, ptdst=PT203; RMW)
+    ADD   X11, X7, X9       → Scalar RS → ALU (depends on X7 via CDB from TILE.GET)
+```
+
+### 13.B Cross-Domain Dependencies (v1 §12.2, 未变更; v2 增加投机条目见 §13.3)
+
+Dependencies between domains are tracked through shared mechanisms:
+
+| Dependency | Mechanism |
+|------------|-----------|
+| **Scalar → MTE** (address operands) | MTE RS entry holds scalar P-reg tag for base address; wakeup via CDB when scalar ALU produces address |
+| **Scalar → Vector** (scalar operand in vector reduction) | Vector RS entry holds scalar P-reg tag for scalar inputs; wakeup via CDB |
+| **MTE → Vector** (tile data readiness) | Tile RAT: TILE.LD completes → sets ready bit for physical tile; Vector RS wakes via TCB |
+| **MTE → Cube** (tile data readiness) | Tile RAT: TILE.LD completes → sets ready bit for physical tile; Cube RS wakes via TCB |
+| **Vector → Cube/MTE** (vector result tile) | Tile RAT: vector write completes → sets ready bit; downstream RS entries wake via TCB |
+| **Cube → MTE** (drain result tile) | Tile RAT: CUBE.DRAIN completes → sets ready bit for physical tile; MTE RS wakes via TCB |
+| **Tile → Scalar** (TILE.GET element extract) | TILE.GET reads physical tile, extracts element, broadcasts scalar result on CDB |
+| **Scalar → Tile** (TILE.PUT element insert) | TILE.PUT reads scalar GPR via CDB wakeup, reads old physical tile, writes new physical tile; TCB broadcast |
+| **Vector → Vector** (reduction result) | VEC reduction ops produce column/row-vector tile result (TCB completion) |
+
+### 13.C Tile RAT Wakeup & Tile Completion Bus (TCB) — (v1 §12.3, 未变更)
+
+The Tile RAT maintains a **ready bit** per physical tile register (256 bits total). This replaces a scoreboard: rename ensures every tile destination gets a unique physical tile, so there are no WAW/WAR hazards. The ready bit simply tracks whether the producing operation has finished writing the physical tile.
+
+```
+  ┌─────────────────────────────────────────────────────────────────┐
+  │  Tile RAT Ready Bits + Tile Completion Bus (TCB)                │
+  │                                                                 │
+  │  Tile RAT: 32 entries (arch T0–T31) → phys PT0–PT255           │
+  │  Ready array: 256 bits (one per physical tile)                  │
+  │  TCB: 4 broadcast ports (8-bit tag each, no data payload)      │
+  │                                                                 │
+  │  TILE.LD T10 renamed:    Tile RAT T10→PT200; ready[PT200] ← 0 │
+  │  TILE.LD PT200 completed: ready[PT200] ← 1; TCB broadcast PT200│
+  │                                                                 │
+  │  VADD T30,T10,T20 renamed: T30→PT202, reads PT200,PT201       │
+  │    RS entry: ptsrc1=PT200, ptsrc2=PT201, ptdst=PT202           │
+  │    TCB snoop: waits for ready[PT200] && ready[PT201]           │
+  │  VADD PT202 completed:   ready[PT202] ← 1; TCB broadcast PT202│
+  │                                                                 │
+  │  CUBE.OPA reads T10→PT200: checks ready[PT200]                │
+  │    if 0 → stall in Cube RS (waits for TCB wakeup)              │
+  │    if 1 → issue                                                 │
+  │                                                                 │
+  │  CUBE.DRAIN writes T12→PT205: ready[PT205] ← 0 at rename      │
+  │  CUBE.DRAIN completed:  ready[PT205] ← 1; TCB broadcast PT205 │
+  │                                                                 │
+  │  TILE.ST reads T12→PT205: checks ready[PT205]                 │
+  └─────────────────────────────────────────────────────────────────┘
+
+  TCB wakeup logic (per tile-domain RS entry):
+    For each RS entry with N tile sources (up to 3 in v2):
+      if (ptsrc_k == TCB_tag && !trdy_k):  trdy_k ← 1
+    Ready to issue when all trdy bits set (and scalar rdy if applicable)
+```
+
+### 13.D Concurrent Execution Example (v1 §12.4, 未变更)
+
+A typical transformer inference kernel mixes all four domains:
+
+```
+  Cycle  │ Scalar ALU │ LSU        │ Vector            │ MTE             │ Cube MXU
+  ───────┼────────────┼────────────┼───────────────────┼─────────────────┼──────────────
+  0–7    │ addr calc  │ scalar LD  │ —                 │ TILE.LD T0-T3   │ —
+  8–15   │ loop ctrl  │ scalar LD  │ —                 │ TILE.LD T4-T7   │ —
+  16–23  │ addr calc  │ —          │ VADD read epoch   │ TILE.LD T8-T11  │ CUBE.OPA z0,...
+  24–31  │ addr calc  │ —          │ VADD write epoch  │ TILE.LD T12-T15 │ (OPA continues)
+  32–47  │ addr calc  │ scalar ST  │ VMUL (16cy)       │ TILE.LD (next)  │ (OPA continues)
+  48–63  │ loop ctrl  │ —          │ VCVT (16cy)       │ TILE.ST T16     │ CUBE.DRAIN z0
+  64+    │ next iter  │ —          │ —                 │ TILE.LD (next)  │ CUBE.OPA z1,...
+```
+
+Key observations:
+- Scalar ALU computes addresses and loop control concurrently with cube execution.
+- MTE loads next tiles while cube processes current tiles (double-buffering at software level).
+- Vector unit handles element-wise operations (activation functions, normalization) in parallel.
+- All domains proceed independently, limited only by true data dependencies.
+
+---
+
+### 13.1 New scheduling considerations under speculation (v2 增量)
+
+| Scenario | v1 behaviour | v2 behaviour |
+|----------|--------------|--------------|
+| Speculative TILE.LD | Couldn't be issued past branch | Can issue speculatively; on mispredict, allocated physical tile rolls back via Tile RAT/refcount |
+| Speculative VEC op | Couldn't be issued past branch | Can issue speculatively; staging registers, Acc, etc. are renamed-tile-only and rollback via Tile RAT |
+| Speculative TILE.ST | Couldn't be issued past branch | Issues speculatively to STQ; STQ-full → dispatch stall |
+| Speculative CUBE.OPA | Couldn't be issued past branch | Can issue speculatively; cube unit's accumulator is in physical-tile rename space |
+| FENCE under speculation | Held until branch resolves | Held in RS until `btag = 0xFF` (issue-gated) |
+
+### 13.2 Speculative tile-domain ops
+
+A subtle point: vector / cube / MTE-load instructions younger than an unresolved branch can execute speculatively because:
+
+- Their inputs (source physical tiles) are versioned via the Tile RAT — even if the wrong physical tile is read, the instruction simply commits to its (also-renamed) destination physical tile.
+- Their destination is a fresh physical tile that gets freed via refcount + free-list-restore on misspeculation.
+- They consume TRegFile-4K port cycles (epoch slots) but don't change architectural memory state.
+
+The only "wasted" resource on misspeculation is **TRegFile-4K port bandwidth** and **microcode beats** spent on the wrong path. In a typical kernel where vector ops are 20–30% of dispatch volume, ~95% branch prediction accuracy means ~1–1.5% of vector compute is wasted on misspeculated paths — well within budget.
+
+### 13.3 Cross-domain dependency table (v2 adds two rows)
+
+| Dependency | Mechanism |
+|------------|-----------|
+| Scalar → MTE (address operands) | MTE RS holds scalar P-reg tag; CDB wakeup |
+| Scalar → Vector (scalar operand SX/SY) | Vector RS holds scalar P-reg tag; CDB wakeup; OR captured at issue-time GPR read |
+| MTE → Vector / Cube (tile data readiness) | Tile RAT ready bit + TCB wakeup |
+| Vector → Cube/MTE | Tile RAT ready bit + TCB |
+| Cube → MTE | Tile RAT ready bit + TCB |
+| Tile → Scalar (TILE.GET) | CDB scalar broadcast |
+| Scalar → Tile (TILE.PUT) | TCB tile broadcast |
+| **Branch → Speculative Memory** | **SSB / STQ branch-tag gating (§11.4, §11.5)** |
+| **Branch → Speculative Register / Tile** | **RAT checkpoint flash-restore + refcount free-list-head restore (§11.3)** |
+
+---
+
+## 14. Performance Targets
+
+### 14.1 Clock & throughput
+
+| Metric | Target |
+|--------|--------|
+| Clock frequency | ≥ **1.5 GHz** (5 nm) |
+| Scalar IPC peak / sustained | 4.0 / 2.5–3.0 |
+| **Vector throughput (FP32 elementwise, 1 tile/8 cy)** | **0.77 TFLOPS** |
+| **Vector throughput (FP4 elementwise)** | **6.14 TFLOPS** |
+| **Vector throughput (FP32 wide row-reduce, recommended baseline)** | **~8.4 GFLOPS effective** (13 beats/8 lanes/iteration) |
+| **TINV throughput (128×128 FP32)** | **~1 inverse / 33 µs ≈ 30 K inverses / s @ 1.5 GHz, single-tile-resident** |
+| **TMRGSORT throughput (1024 FP32 sort)** | **~6.8 M sorts / s @ 1.5 GHz, single-instruction** |
+| Cube FP16 / FP8 / MXFP4 | 12.3 / 24.6 / 98.3 TFLOPS / TOPS |
+| MTE tile bandwidth | 4 KB/cy aggregate read + 4 KB/cy aggregate write |
+| Memory bandwidth (L2) | 96 GB/s |
+| **Mispredict penalty** | **6–7 cy** (vs. v1's 6 cy; the +1 cy is the SSB/STQ tag-CAM propagation) |
+| **Speculative depth** | **up to 8 unresolved branches** (matches RAT checkpoint count) |
+
+### 14.2 Workload performance summary
+
+For pure-cube kernels (transformer GEMM, CNN), v2 performance equals v1 (cube unit unchanged). Improvements vs. v1 appear in:
+
+| Workload | v1 | v2 | Speedup |
+|----------|-----|----|---------|
+| Softmax (batch 8, dim 4096) | 24K cy (vector + scalar) | 18K cy (masked reductions, no `TTRANS` predecessor) | **1.33×** |
+| Layer norm (batch 8, dim 4096) | 22K cy | 16K cy | **1.38×** |
+| Attention with mask (batch 8, seq 1024) | 80K cy | 56K cy (per-element mask native) | **1.43×** |
+| GEMM 128×128 inverse (Kalman update) | software emulation (~4M cy CPU-equivalent on vector) | **TINV 33 K cy** | **~120×** |
+| 1024-element top-k | software ~50K cy | TMRGSORT 220 cy + scalar cleanup ~250 cy | **~100×** |
+| Speculative scalar-heavy code (e.g. graph algorithm) | limited to in-branch parallelism | full speculation past 8 unresolved branches | **~2–3× sustained IPC improvement** |
+
+### 14.3 IPC breakdown (transformer decode)
+
+```
+  Instruction mix (typical transformer layer, M=8, K=4096, N=4096, FP16):
+    Scalar:   ~15%
+    MTE:      ~25%
+    Cube:     ~55%
+    Vector:   ~5%
+  
+  v2 advantage:
+    Speculation lets 30–40% of scalar ops past unresolved branches issue early
+    → effective scalar IPC rises from ~2.5 (v1) to ~3.2 (v2)
+    → end-to-end kernel time drops ~3–5% (cube remains the bottleneck)
+```
+
+---
+
+## 15. Area & Power
+
+### 15.1 Area summary (v2 deltas vs. v1 in **bold**)
+
+| Component | v1 area | v2 area | Δ |
+|-----------|---------|---------|---|
+| TRegFile-4K (1 MB SRAM + 32 b metadata SRAM) | ~1.20 mm² | ~1.20 mm² + ~0.005 mm² metadata | +0.005 mm² |
+| outerCube MXU | ~0.80 mm² | ~0.80 mm² | 0 |
+| **Vector unit** | ~0.20 mm² (v1) | **~0.30 mm² (VEC-4K-v2 SRAM-staging baseline)** | +0.10 mm² (with new ISA: TINV/TROWRANGE_MUL/TMRGSORT) |
+| L1-I / L1-D / L2 | ~0.66 mm² | ~0.66 mm² | 0 |
+| Scalar physical RF + Scalar RAT + free list | ~0.22 mm² | ~0.22 mm² | 0 |
+| Tile RAT + Tile free list + tile refcount | ~0.05 mm² | ~0.05 mm² + 0.005 mm² metadata RAT | +0.005 mm² |
+| Tile Completion Bus (TCB) + tile RS CAMs | ~0.05 mm² | ~0.06 mm² (24-entry vector RS) | +0.01 mm² |
+| MTE transpose buffer | 0.005 mm² (4 KB) | 0.001 mm² (512 B) | -0.004 mm² (smaller buffer) |
+| RS + dispatch + checkpoint control | ~0.15 mm² | ~0.15 mm² | 0 |
+| **Speculative Store Buffer (SSB, 24 entries)** | — | **+0.02 mm²** | +0.02 mm² |
+| **Speculative Tile-Store Queue (STQ, 8 entries)** | — | **+0.003 mm²** | +0.003 mm² |
+| **Branch-tag tracker** | — | **+0.001 mm²** | +0.001 mm² |
+| **Total core (estimated)** | **~3.26 mm²** | **~3.41 mm²** | **+0.15 mm² (+4.6%)** |
+
+### 15.2 Net impact
+
+The v2 core is approximately **5% larger** than v1 in exchange for:
+
+- A re-architected vector unit with per-element masking, 3-source/2-dest, restored FP4/FP8, and three new high-impact instructions (TINV / TROWRANGE_MUL / TMRGSORT).
+- Per-port `is_transpose` on TRegFile-4K (eliminating most `TILE.TRANSPOSE` predecessors).
+- Branch-prediction-driven speculative execution with full architectural-state recovery (no ROB).
+
+Vector unit area actually shrinks vs. v1's vector unit when accounting for [`vector4k_v2.md`](vector4k_v2.md) §10's analysis (~27% smaller for VEC-v2 SRAM-staging baseline vs. v1) — but the v2 unit also adds the new instructions (TINV / TROWRANGE_MUL / TMRGSORT) that bring net area roughly to parity with v1's vector unit, while delivering ~100× higher performance on those kernels.
+
+### 15.3 Power management
+
+**Same techniques as v1 §14.2.** Adds:
+
+- **Branch-tag tracker clock-gates** when no branches are in flight (very common in straight-line code regions).
+- **SSB / STQ entries clock-gate** their flip-flop fields when invalid.
+
+---
+
+## 16. External Interfaces
+
+> **(v1 → v2: 内容未变更,以下完整复制自 v1 §15。)**
+
+### 16.1 Core-to-NoC Interface (v1 §15.1)
+
+| Parameter | Value |
+|-----------|-------|
+| Bus width | **256 bits** (32 B) |
+| Protocol | AXI4 (or similar point-to-point) |
+| Outstanding requests | **32** (read) + **16** (write) |
+| Burst length | Up to 4 beats (128 B, 2 cache lines) |
+| Clock domain | Core clock (synchronous) or async bridge |
+
+### 16.2 Cache Coherence (v1 §15.2)
+
+The Davinci core is designed primarily for single-core or non-coherent multi-core configurations (AI accelerator context). When coherence is needed:
+
+| Parameter | Value |
+|-----------|-------|
+| Protocol | MOESI or directory-based |
+| Snoop filter | L2 tag duplicate |
+| Coherence granularity | 64 B (cache line) |
+
+For tile data (TRegFile-4K), coherence is managed at the software level. Tile data bypasses the coherence protocol, flowing through the MTE's dedicated memory path.
+
+### 16.3 Debug & Trace Interface (v1 §15.3)
+
+| Feature | Description |
+|---------|-------------|
+| Debug halt | External debug request halts core at next instruction boundary |
+| PC trace | Compressed branch trace (taken/not-taken stream) |
+| Performance counters | 8 programmable counters: IPC, branch mispredict rate, cache miss rate, cube utilization, MTE stalls, RS occupancy |
+| Breakpoint registers | 4 instruction address breakpoints + 2 data address watchpoints |
+
+---
+
+## Appendix A: Glossary (v2 additions in **bold**)
+
+| Term | Definition |
+|------|-----------|
+| RAT | Register Alias Table |
+| **Tile Metadata RAT** | 256 × 32 b SRAM holding per-physical-tile (shape.x, shape.y, format) |
+| TCB | Tile Completion Bus — 4-port broadcast for tile wakeup |
+| CDB | Common Data Bus — broadcast network for scalar results |
+| RS | Reservation Station |
+| MTE | Memory Tile Engine |
+| MXU | Matrix Unit (outerCube) |
+| TRegFile-4K | Tile Register File with 4 KB physical tiles, 8R+8W ports, **per-port `is_transpose` (v2)** |
+| OPA | Outer Product Accumulate |
+| **VEC-4K-v2** | Re-architected vector unit ([`vector4k_v2.md`](vector4k_v2.md)) with staging registers, microcode beat machine, and 3-operand support |
+| **SA, SB, SC** | Vector unit's value-tile and mask staging registers (4 KB each, 1R1W SRAM in production baseline) |
+| **SX, SY** | Scalar staging slots in VEC-4K-v2 (64 b each, GPR/IMM/TILE/ACC sourced) |
+| **TINV** | Tile matrix inverse instruction (up to 128×128 FP32) |
+| **TROWRANGE_MUL** | Column-wise product over a dynamic row sub-range |
+| **TMRGSORT** | Reconfigurable bitonic sort over any `N = 2^p` up to 8192 |
+| **TSETMETA** | Rename-only instruction that updates a tile's metadata word |
+| **is_transpose** | Per-read-port flag on TRegFile-4K that selects row-mode vs. col-mode chunk-grid delivery |
+| **tilelet_xpose** | Per-beat microcode bit in VEC-4K-v2 selecting per-tilelet chunk-grid transpose at staging-side |
+| **branch_tag** | 3-bit tag attached to every µop younger than an unresolved branch; 8 active tags max |
+| **SSB** | Speculative Store Buffer — 24-entry buffer that gates scalar stores by branch tag |
+| **STQ** | Speculative Tile-Store Queue — 8-entry buffer that gates MTE bulk stores by branch tag |
+| **Speculation Tracker** | 5-K-gate structure tracking 8 active branch tags + 8×8 ancestry bitmap |
+| ROB | Reorder Buffer — *not present* in Davinci-v2; functionally replaced by RAT checkpoint + refcount + SSB + STQ |
+| MSHR | Miss Status Holding Register |
+| BTB | Branch Target Buffer |
+| TAGE | TAgged GEometric history length predictor |
+| RAS | Return Address Stack |
+| IPC | Instructions Per Cycle |
+| MLP | Memory-Level Parallelism |
+
+## Appendix B: Reference Documents
+
+| Document | Content |
+|----------|---------|
+| [`Davinci_supersclar.md`](Davinci_supersclar.md) | Davinci v1 — direct predecessor; v2 inherits all unchanged subsystems |
+| [`outerCube.md`](outerCube.md) | outerCube MXU architecture, dual-mode operation, ISA, pipeline, performance analysis |
+| [`tregfile4k.md`](tregfile4k.md) | TRegFile-4K design (256×4KB tiles, 8R+8W ports, **§7 per-port `is_transpose` enhancement**) |
+| [`vector4k_v2.md`](vector4k_v2.md) | **VEC-4K-v2 vector unit specification** — staging registers, per-beat microcode, masked / 3-source / 2-dest, TINV/TROWRANGE_MUL/TMRGSORT |
+| [`vector4k.md`](vector4k.md) | VEC-4K v1 vector unit (referenced by v2 for unchanged subsystems) |
+| [`Simplified_Superscalar Design Concepts-2.md`](Simplified_Superscalar%20Design%20Concepts-2.md) | OoO execution theory background: no-ROB design, RAT checkpointing, refcount freeing |
+| [pto-isa vector docs](https://github.com/hw-native-sys/pto-isa/tree/main/docs/isa) | Authoritative PTO ISA |
+
+## Appendix C: Document History
+
+| Version | Date | Notes |
+|---------|------|-------|
+| **v2.1** | 2026-04-30 | **Native 3-source ternary FMA family (`VFMA`, `VFNMA`, `VLERP`) added — see §2.2.6a.** Operand `C` is promoted to a **dual role** (mask **or** value tile) selected by a new 1-bit issue-time `c_role ∈ {MASK, VALUE}` flag in the instruction word's `funct6` field (§2.2.2, §2.2.3). With `c_role = VALUE`, `C` is fetched as a full 4 KB value tile through a **3rd VEC-side TRegFile read port (R1)** — TRegFile-4K has 8 read ports, so this is purely a binding allocation, no new SRAM or bank-conflict pressure. With three value tiles fetched in parallel within one 8 cy epoch, `VFMA` runs at the **same throughput as a binary `VADD` / `VMUL` (1 tile / 8 cy)** — a 2× speed-up over the emulated `VMUL` + `VADD` two-instruction sequence — and produces the IEEE-754 single-rounding FMA result, halving the rounding error vs. emulation (critical for FP16 / BF16 / FP8 narrow-format normalisation kernels). **Justification (from [`FMA指令场景说明.md`](FMA指令场景说明.md))**: the canonical `y = γ·x̂ + β` LayerNorm / RMSNorm affine, Welford incremental update (`μ_new = δ·inv_n + μ_old`, `M2_new = δ·δ_2 + M2_old`), Welford state merge, activation polynomials (`gelu`, `swiglu`), and trigonometric polynomials (`sin`, `cos`) all need a third operand that is **not** the previous accumulator — v2.0's `VFMA_ACC D = A·B + Acc` does not apply. **Hardware delta vs. v2.0**: ~6 K gate (~0.2 % of VEC-4K-v2 area) — ~5 K for adding a 512 B/cy value-mode read path on `SC` alongside the existing 1-bit-mask path, ~1 K for control-path widening (Tile RAT / RS / dispatch carry the `c_role` bit). The stage-(B) per-lane FMA core, microcode beat machinery, and 8-port TRegFile already supported `A·B + Z` and the 3rd binding allocation. RS entry width unchanged in concept (the `c_role` bit slots into the existing flags). **Pipeline timing**: §8.3.7 latency table updated with `VFMA / VFNMA` rows (16 cy total = 8 fetch + 8 retire, 1/8 throughput) and `VLERP` (24 cy total, 1/16 throughput due to dual retire); mixed-`is_transpose` rows added (16 cy fetch for one-mismatched, 24 cy for all-distinct degenerate). **Documentation updates**: §2.2.2 operand model gains the `c_role` row + 3rd-port rationale callout; §2.2.3 encoding diagram shows the new `c_role` bit; **new §2.2.6a with full ISA semantics, kernel motivation, hardware-cost breakdown, and pipeline-timing table**; §2.2.8 instruction list gains **Category O — Native 3-source Ternary FMA family**; §8.3.7 latency table updated. **Backward compatibility preserved**: v1 / v2.0 binaries emit `c_role = MASK` exclusively, `R1` stays idle and clock-gated, no behaviour change. Companion update in [`vector4k_v2.md`](vector4k_v2.md) v0.18 (§3.1, §3.3c, §6.2, §7.6, §10). |
+| **v2.0** | 2026-04-30 | **Initial Davinci-v2 specification.** Three major changes vs. v1: **(1) TRegFile-4K with per-port `is_transpose` flag** ([`tregfile4k.md`](tregfile4k.md) §7), enabling row-mode or col-mode delivery at full 512 B/cy; consumed by v2 vector unit and (optionally) by cube and MTE. **(2) Vector unit re-architected to VEC-4K-v2** ([`vector4k_v2.md`](vector4k_v2.md)): explicit SRAM-based staging registers (`SA`, `SB`, `SC`) decoupling TRegFile fetch from compute, per-beat microcode dispatch, 3-source / 2-dest tile operands with per-element bitmask predication, restored FP4 and FP8 formats, three new PTO instructions (`TINV` matrix inverse up to 128×128 FP32, `TROWRANGE_MUL` row-range product, `TMRGSORT` bitonic sort over any `N = 2^p` up to 8192), and **tile-register metadata** (32 b: `shape.x`, `shape.y`, `format`) carried via a new Tile Metadata RAT. **(3) Branch-prediction-driven speculative execution** with a ROB-less recovery scheme (§11): a 5-K-gate Branch-Tag Speculation Tracker (8 tags + 8×8 ancestry bitmap), a 24-entry **Speculative Store Buffer** (SSB) that gates scalar stores by branch tag, and an 8-entry **Speculative Tile-Store Queue** (STQ) that gates MTE bulk stores by branch tag. The scheme proves that all three classes of speculative state (renamed registers/tiles, in-flight pipeline state, externally-visible memory effects) can be safely recovered without a Reorder Buffer: classes A and B reuse the v1 RAT-checkpoint + refcount + branch-tag-CAM machinery; class C is gated by SSB / STQ until the producing branch tag becomes non-speculative. Mispredict penalty: 6–7 cy (vs. v1's 6 cy). Total v2 speculation hardware: ~110 K gate (~0.025 mm²), about 3.5% of the v1 core area. v2 core area: ~3.41 mm², a ~5% increase over v1's ~3.26 mm². Performance gains: 1.3–1.4× on masked-vector kernels (softmax, layer norm, masked attention), ~100× on `TINV`-bound (Kalman, NeRF pose) and `TMRGSORT`-bound (top-k, beam-search) kernels, and ~2–3× sustained scalar IPC improvement on speculative-heavy code paths. Cube unit, scalar unit, memory subsystem (caches), and external interfaces remain unchanged from v1. |
diff --git a/designs/outerCube/Davinci_vtg_vector_micro_instructions_v1.md b/designs/outerCube/Davinci_vtg_vector_micro_instructions_v1.md
new file mode 100644
index 00000000..7add9967
--- /dev/null
+++ b/designs/outerCube/Davinci_vtg_vector_micro_instructions_v1.md
@@ -0,0 +1,509 @@
+# Davinci-v2 VTG Vector Micro-Instructions — Hardware-Consistent Design v1.1
+
+> **Document ID**: DSP-002
+> **Version**: v1.1
+> **Date**: 2026-05-02
+> **Status**: **Draft — hardware inconsistencies v1.0 being resolved**
+> **Target**: `pyCircuit/designs/outerCube/Davinci_superscalar_v2.md`
+> **Change Point**: #2 — Add VTG (Vector Thread Group) vector micro-instructions with SIMD-group execution; v1.1 revises the integration model to be consistent with VEC-4K-v2 and TRegFile-4K-v2 hardware
+> **Hardware cross-check**: This document was revised against `vector4k_v2.md` §3–§9, `tregfile4k_v2.md` §3–§7, and `Davinci_superscalar_v2.md` §6–§9.
+
+---
+
+## Change Log v1.0 → v1.1
+
+v1.0 had **3 fatal and 5 high-severity hardware inconsistencies** identified by cross-check:
+
+| # | Issue | Fix in v1.1 |
+|---|-------|-------------|
+| F1 | TRegFile port conflict: VTG Group Read Adapter claimed independent R0/R4 ports, but VEC-4K-v2 is hard-bound to R0/R4. | VTG **reuses** VEC-4K-v2's port bindings; VTG operands are sub-ranges of the same tiles VEC-4K-v2 stages. No independent ports. |
+| F2 | Micro-instruction format mismatch: `MicroOpEntry` {opcode, elem_type} does not match VEC-4K-v2's 64-bit beat-word format {src_*, s_*, xp_*, alu_op, acc_op, ...}. | Micro-instruction buffer stores **pre-decoded VEC beat-word sequences** rather than V*-level instructions. VTG microassembler generates beat words from V* operands. |
+| F3 | Beat-level control undefined: VTG microcode is V*-level (38 opcodes) but VEC-4K-v2 ALU is driven beat-by-beat. | VTG micro-instructions are expanded into **per-beat word sequences** by the VTG microassembler. Each VTG op = 1–N beat words. |
+| H1 | TRegFile epoch timing: "TRegFile read at I1" implied full tile immediately; ignores 8-cycle epoch. | Revised lifecycle with **prologue model**: VTG submits TRegFile read request; full tile delivered over 8-cycle epoch; sub-range selection happens after prologue. |
+| H2 | VTG latency (9 cycles) ignored prologue (8–15 cy) and writeback RMW (16 cy). | Revised to **T_fetch + 1 + T_writeback** = 9–23 cy minimum, plus prologue penalty for alignment. |
+| H3 | Group Write Adapter partial-write to TRegFile: TRegFile has no partial-write mechanism. | Group Write Adapter performs **full-tile read-modify-write**: read old tile (8 cy), merge VTG sub-range, write merged tile (8 cy) = **16 cy minimum**. |
+| H4 | Port arbitration: "arbitrate via existing issue pipeline" — no mechanism described. | Added **VEC-domain arbitration matrix** covering Vector RS + GVIQ + MTE RS. VEC ALU is 1-wide, single client at a time. |
+| H5 | Staging register sizes: VTG described Group Read Adapter outputting 256/512 B to SA/SB; but SA/SB/SC are 4 KB each. | VTG operates **behind** VEC staging: SA/SB/SC filled by VEC-4K-v2 prologue (unchanged); VTG sub-range selector reads **from** SA/SB at the ALU input mux, not from TRegFile. |
+
+---
+
+## 1. Motivation (unchanged from v1.0)
+
+### 1.1 Current Davinci-v2 Vector Execution Model
+
+Davinci-v2 currently executes vector instructions as **full-tile operations** on VEC-4K-v2. For AI kernels with strip-mined inner loops, the current model requires the compiler to generate repeated tile ops with different effective addresses. VTG vector micro-instructions enable the same micro-kernel to run across multiple 256 B or 512 B slices inside a tile using a pre-allocated micro-instruction buffer and a warp-like rotation scheduler.
+
+### 1.2 What This Change Adds (v1.1 corrected model)
+
+This change introduces **VTG (Vector Thread Group) vector micro-instructions**:
+
+- One 4 KB tile is partitioned into 16×256 B or 8×512 B VTGs
+- Each VTG carries loop/thread counter state in the GVIQ entry prefix
+- A **micro-instruction buffer** in the vector ALU holds pre-decoded VEC beat-word sequences, shared by all VTGs via `block_id`
+- VTG **operates behind VEC-4K-v2's staging registers** (SA/SB/SC are filled by the VEC prologue); VTG sub-range selection happens at the ALU input mux
+- VTG uses the **same TRegFile ports** (R0/R4 for reads, W0 for writeback) as VEC-4K-v2, with VEC-domain arbitration
+- The existing VEC-4K-v2 ALU datapath (128-lane SIMD groups, 512 B/cycle throughput) is reused unchanged
+
+---
+
+## 2. Concepts
+
+### 2.1 SIMD Group
+
+The **SIMD group** is VEC-4K-v2's internal 128-lane execution unit. The hardware is identical to the full-tile VEC-4K-v2 path; no changes to the SIMD group are required. In `G512` mode, one VTG fills one SIMD group beat (512 B). In `G256` mode, one VTG fills half a SIMD group beat.
+
+### 2.2 Vector Thread Group (VTG)
+
+A **VTG** is a warp-like scheduling context: 256 B (`G256`) or 512 B (`G512`) inside a 4 KB tile. It has its own `group_id`, loop counters, and active-lane state in the GVIQ entry. VTG operands are tile-relative (`T4.g2` = tile T4, group g2).
+
+**Critical (v1.1):** A VTG is not an independent hardware entity with its own TRegFile ports. VTGs **share** the VEC-4K-v2 datapath and TRegFile ports. Multiple VTGs are in flight simultaneously, but they compete for the same VEC ALU through the GVIQ scheduler. There is no parallel VTG execution — VTG is a **scheduling abstraction**, not a parallel execution unit.
+
+### 2.3 Micro-Instruction Buffer
+
+The **micro-instruction buffer** is a pre-allocated buffer in the vector ALU that holds **pre-decoded VEC beat-word sequences** for VTG blocks. It is shared by all VTGs in the same tile group.
+
+**Critical (v1.1):** The buffer does NOT store V*-level instructions (`VADD.type`, etc.). It stores **VEC beat-word sequences** — each entry is the pre-decoded beat-word sequence that drives VEC-4K-v2's ALU cycle-by-cycle. The VTG microassembler generates these beat-word sequences from V* operands at decode time.
+
+Buffer organization:
+```
+BufferEntry {
+  valid:      1 b
+  block_id:   12 b  [tag]
+  pc_limit:   8 b   [last beat word index]
+  beat_words: array[64] of VECBeatWord  // pre-decoded VEC beat words
+}
+```
+
+### 2.4 VTG Microassembler
+
+The **VTG microassembler** generates VEC beat-word sequences from V* operands at decode time. It consults the VTG Metadata Table and the VTG's element type, predicate mode, and loop counters to produce the correct beat-word sequence for each VTG micro-op.
+
+---
+
+## 3. Hardware Integration Model
+
+### 3.1 VTG Sits Behind VEC-4K-v2 Staging
+
+The fundamental architectural decision (v1.1) is that **VTG operands are sub-ranges of tiles that VEC-4K-v2 has already staged**:
+
+```
+TRegFile-4K
+  R0 ──────► VEC prologue (8-cycle epoch)
+  R4 ──────► SA (4 KB) ──┐
+                            ├── VTG sub-range selector ──► ALU input mux ──► VEC ALU
+                       SB (4 KB) ──┘
+  W0 ◄────── VTG Group Write Adapter (full-tile RMW)
+```
+
+Key consequences:
+1. **No new TRegFile ports**: VTG reuses R0/R4 (for tile reads) and W0 (for writeback)
+2. **No new staging registers**: VTG reads from SA/SB/SC (4 KB each), which are filled by the VEC prologue
+3. **VEC prologue is shared**: Both VTG and full-tile VEC-4K-v2 use the same prologue to fill SA/SB/SC
+4. **Sub-range selection at ALU input mux**: The VTG sub-range selector reads 256/512 B sub-ranges from SA/SB/SC and presents them to the ALU
+
+### 3.2 TRegFile Epoch Sharing
+
+VEC-4K-v2's operand-fetch prologue (`vector4k_v2.md` §6.1–§6.3) occupies the TRegFile ports for the prologue duration:
+
+| N_val | `is_xpose` mix | Prologue T_fetch |
+|--------|----------------|-----------------|
+| 1 | any | **8–15 cy** |
+| 2 | uniform | **8–15 cy** |
+| 2 | mixed | **16–23 cy** (R2 penalty) |
+
+VTG, operating behind the prologue, is subject to the same latency:
+- VTG submits a tile read request at issue
+- TRegFile delivers 512 B/cycle over 8 cycles (1 epoch)
+- Sub-range selection from SA/SB begins after the relevant strips have arrived
+- VTG compute begins after sub-range selection
+
+### 3.3 VEC-Domain Arbitration Matrix
+
+The VEC-4K-v2 ALU is **1-wide**: only one client can use it per cycle. The three clients are:
+
+| Client | Issue Queue | Throughput | Priority |
+|--------|------------|------------|----------|
+| Full-tile VEC-4K-v2 | Vector RS (24 entries) | 1 tile op / 8–15 cy | Highest (coarser grain) |
+| VTG micro-op | GVIQ (32 entries) | 1 VTG op / 8–16 cy | Medium |
+| MTE | MTE RS (16 entries) | Varies | Lowest |
+
+The VEC-domain arbiter grants the VEC ALU to one client per cycle based on readiness and priority. Full-tile VEC ops have higher priority because they hold the prologue for longer and are coarser-grain.
+
+---
+
+## 4. VTG Micro-Instruction Buffer (Hardware-Correct)
+
+### 4.1 Buffer Organization
+
+The buffer stores **pre-decoded VEC beat-word sequences** (not V*-level instructions):
+
+```
+BufferEntry {
+  valid:      1 b
+  block_id:   12 b  [tag]
+  pc_limit:   8 b   [last beat_word_index]
+
+  beat_words: array[64] of VECBeatWord
+}
+
+VECBeatWord {          // Same as VEC-4K-v2's SOP beat-word format
+  src_A:      3 b    // SA / SB / ACC / SX / SY / ZERO
+  src_B:      3 b
+  src_C:      3 b    // SC (mask/value)
+  s_A:        3 b    // strip index 0..7
+  s_B:        3 b
+  s_C:        3 b
+  xp_A:       1 b    // transpose for operand A
+  xp_B:       1 b
+  xp_C:       1 b
+  alu_op:     5 b    // ADD/SUB/MUL/FMA/etc.
+  acc_op:     3 b    // NONE/INIT/ACCUM/MERGE/READOUT
+  acc_slot:   1 b    // LO/HI
+  wr_en_D0:   1 b
+  wr_strip_D0: 3 b  // which strip writes to D0
+  wr_en_D1:   1 b
+  wr_strip_D1: 3 b
+}
+```
+
+### 4.2 VTG Microassembler
+
+At decode time, the VTG microassembler generates the beat-word sequence for each VTG micro-op:
+
+```
+VADD.F32 Td.gN, Ts0.gM, Ts1.gP, Tp:
+  // beat 0: SA ← Ts0 sub-range (group M), SB ← Ts1 sub-range (group P)
+  beat_words[0] = {
+    src_A: SA, s_A: group_strip_of(M), xp_A: 0,
+    src_B: SB, s_B: group_strip_of(P), xp_B: 0,
+    src_C: SC, s_C: pred_strip_of(Tp), xp_C: 0,
+    alu_op: ADD, acc_op: NONE, wr_en_D0: 1, wr_strip_D0: group_strip_of(N)
+  }
+  // For G256 (256 B = 1 strip): group_strip_of(gN) = 0..7 depending on byte offset
+  // For G512 (512 B = full epoch): one strip covers the full VTG
+```
+
+The microassembler consults:
+- **VTG Metadata Table**: `group_mode`, `elem_type`, `pred_granule`, `active_bytes`
+- **Tile Metadata RAT**: `shape.x`, `shape.y`, `format` (from `Davinci_superscalar_v2.md` §6.1)
+- **GVIQ entry prefix**: `iter0..iter3`, `active_lanes`
+
+### 4.3 Buffer Allocation and Access
+
+Buffer allocation: at decode, the Vector Micro Block Builder assigns `block_id` and runs the microassembler to generate the beat-word sequence, writing each `VECBeatWord` into `buffer[block_id % depth][way].beat_words[beat_index]`.
+
+Buffer access at issue:
+```
+at P1/I1:
+    winner = gviq.pick_oldest_ready()
+    beat_word = buffer.lookup(winner.block_id, winner.pc_index)
+    // beat_word drives VEC ALU for this cycle
+    winner.pc_index++
+    if winner.pc_index > winner.pc_limit:
+        winner.valid = 0  // retire
+```
+
+---
+
+## 5. Execution Pipeline (Hardware-Correct)
+
+### 5.1 Revised VTG Micro-Op Lifecycle
+
+```
+Cycle N+6:  D1/D2   — Decode + Tile RAT rename; microassembler generates beat-word sequence
+Cycle N+11: S1/S2   — GVIQ entry write; micro-instruction buffer populated
+Cycle N+12: P1       — GVIQ pick: select oldest-ready VTG micro-op
+Cycle N+13: I1       — TRegFile read request submitted (pending register)
+Cycle N+13..N+20:    — VEC prologue: SA/SB/SC fill over 8-cycle epoch (512 B/cy × 8)
+Cycle N+21: I2       — Issue confirm; SA/SB/SC staging populated; prologue done
+Cycle N+22: E1       — VTG sub-range selector: read 256/512 B from SA/SB at ALU input mux
+Cycle N+22..N+22+K: — VEC ALU: beat_word drives compute for K beats
+Cycle N+22+K+1: W1   — Group Write Adapter: read old tile (8 cy), merge VTG sub-range, write merged tile (8 cy)
+Cycle N+22+K+17:     — Writeback complete; VTG ready bit set
+```
+
+**Total VTG latency for a single-beat VTG op (K=1):**
+- Best case (well-aligned epoch): `8 (prologue) + 1 (compute) + 16 (RMW writeback) = 25 cy`
+- Worst case (misaligned epoch): `15 (prologue) + 1 + 16 = 32 cy`
+
+This replaces the v1.0 claim of 9 cycles.
+
+### 5.2 TRegFile Writeback: Full-Tile Read-Modify-Write
+
+The Group Write Adapter performs a **full-tile read-modify-write** for every VTG writeback:
+
+```
+Group Write Adapter (writeback):
+    // Step 1: Read the full current tile
+    TRegFile.submit_read_request(dst_ptag)          // occupies W0 for 8 cycles
+    wait 8 cycles
+    old_tile = TRegFile.read_data                   // 4 KB
+
+    // Step 2: Merge VTG result into the correct sub-range
+    if group_mode == G256:
+        start = group_id * 256
+        end   = start + 256
+    else:  # G512
+        start = group_id * 512
+        end   = start + 512
+    new_tile = old_tile
+    new_tile[start:end] = vtg_result               // merge sub-range
+
+    // Step 3: Write merged tile back
+    TRegFile.submit_write_request(dst_ptag, new_tile)  // occupies W0 for 8 cycles
+    wait 8 cycles
+    TRegFile.write_complete()
+
+    // Total: 16 cycles for the RMW cycle
+```
+
+**Implication:** VTG writeback ties up W0 for **16 cycles** (8 read + 8 write), which is the same write latency as a full-tile VEC op. VTG does not have a separate write port in v1.1 — it shares W0 with VEC-4K-v2.
+
+### 5.3 VTG Sub-Range Selection at ALU Input Mux
+
+After the prologue completes, SA/SB/SC contain the full 4 KB tile. The VTG sub-range selector reads the appropriate 256/512 B slice:
+
+```
+V TG Sub-Range Selector:
+    input: SA_full[4096 B], group_id, group_mode
+    if group_mode == G256:
+        vtg_A[256 B] = SA_full[group_id * 256 : (group_id+1) * 256]
+    else:  # G512
+        vtg_A[512 B] = SA_full[group_id * 512 : (group_id+1) * 512]
+    output: vtg_A → ALU operand mux (SA input)
+```
+
+This sub-range selection happens **in parallel with** the VEC ALU input muxing — it is a simple byte-range mux, not a separate pipeline stage. It does not add latency to the critical path.
+
+---
+
+## 6. VTG Micro-Instruction Families
+
+### 6.1 Instruction Syntax (unchanged from v1.0)
+
+```
+VINST.type  Td.gN, Ts0.gM, Ts1.gP, Tp.gQ
+```
+
+### 6.2 ALU Instructions
+
+| Instruction | Operation | VEC beat words |
+|------------|-----------|---------------|
+| `VADD` | `Td[i] = Tp[i] ? (Ts0[i] + Ts1[i]) : merge(Td[i])` | 1 beat: `alu_op=ADD, src_A=SA, src_B=SB, mask=SC` |
+| `VSUB` | `Td[i] = Tp[i] ? (Ts0[i] - Ts1[i]) : merge(Td[i])` | 1 beat: `alu_op=SUB` |
+| `VMUL` | Multiplication | 1 beat: `alu_op=MUL` |
+| `VMIN` | `min(Ts0, Ts1)` | 1 beat: `alu_op=MIN` |
+| `VMAX` | `max(Ts0, Ts1)` | 1 beat: `alu_op=MAX` |
+| `VABS` | `abs(Ts0)` | 1 beat: `alu_op=PASS_A` + post-processing |
+| `VNEG` | `-Ts0[i]` | 1 beat: `alu_op=PASS_A` + negate |
+
+### 6.3 Scalar-Broadcast ALU
+
+| Instruction | Operation | Notes |
+|------------|-----------|-------|
+| `VADDS` | `Td[i] = Ts[i] + scalar` | Scalar from SX/SY broadcast |
+| `VMULS` | `Td[i] = Ts[i] × scalar` | Scalar broadcast via SX/SY |
+
+The scalar operand comes from the scalar register file (via SX/SY staging) and is broadcast to all lanes by VEC's existing broadcast mechanism.
+
+### 6.4 Compare and Select
+
+| Instruction | Operation | VEC beat words |
+|------------|-----------|---------------|
+| `VCMP.{LT/...}` | Predicate result | 1 beat: `alu_op=CMP`, `wr_en_D0=0` |
+| `VSEL` | `Td = Tp ? Ts0 : Ts1` | 1 beat: `alu_op=SELECT, src_A=Ts0, src_B=Ts1` |
+| `VMERGE` | Merging-mode fill | 1 beat: `alu_op=PASS_A` (old dest + pred gate) |
+
+### 6.5 Memory Instructions
+
+| Instruction | Operation | VEC beat words |
+|------------|-----------|---------------|
+| `VLD` | Load 256/512 B into VTG | VTG microassembler expands to VEC-style strip-fill sequence |
+| `VST` | Store 256/512 B from VTG | VTG microassembler expands to VEC-style strip-drain sequence |
+| `VLDSTRIDE` | Strided load | Multiple beats with stride address calculation |
+| `VSTSTRIDE` | Strided store | Multiple beats |
+| `PGATHER` | Predicate gather | Multiple beats with gather address |
+
+**Inactive-lane fault suppression** is handled by the LSU checking the active-lane mask before address generation.
+
+### 6.6 Predicate Instructions
+
+| Instruction | Operation |
+|-------------|-----------|
+| `PLT` | `Tpd[i] = (i < iter0)` — loop counter predicate via SX/SY broadcast |
+| `PAND` | Predicate AND |
+| `POR` | Predicate OR |
+| `PXOR` | Predicate XOR |
+| `PNOT` | Predicate NOT |
+
+---
+
+## 7. GVIQ — Grouped Vector Issue Queue
+
+### 7.1 GVIQ Entry (unchanged from v1.0)
+
+```
+GVIQEntry {
+  valid:           1 b
+  block_id:       12 b
+  pc_index:        8 b
+  tile_group:      5 b   // architectural tile T0..T31
+  phys_tile:       8 b   // physical tile PT0..PT255
+  group_id:        4 b   // 0..15 (G256) or 0..7 (G512)
+  group_mode:      1 b
+  thread_id:       8 b
+  iter0..iter3:   4×16 b
+  active_lanes:   16 b
+  active_group_mask: 16 b
+  src0_ptag:       8 b
+  src1_ptag:       8 b
+  src2_ptag:       8 b
+  pred_ptag:       8 b
+  dst_ptag:        8 b
+  has_dst:         1 b
+  src_ready:       4 b
+  vtg_ready:       1 b
+  branch_tag:      3 b
+}
+```
+
+### 7.2 VTG Wakeup
+
+VTG readiness is tracked by the **VTG Ready Table** (256-bit bitmap, one bit per physical tile). When a VTG micro-op writes back:
+1. Group Write Adapter performs full-tile RMW (16 cy)
+2. On writeback completion, `VTG_Ready_Table[dst_ptag] = 1`
+3. GVIQ entries waiting on this tile set `src_ready`
+
+### 7.3 Issue Rules
+
+| Rule | Description |
+|------|-------------|
+| GVIQ-1 | `pc_index <= pc_limit` for the given `block_id` |
+| GVIQ-2 | All source VTG `src_ready` bits set |
+| GVIQ-3 | Active loop counter (`iter*`) non-zero |
+| GVIQ-4 | GVIQ is 1-wide: one VTG micro-op per cycle |
+| GVIQ-5 | VEC-4K-v2 ALU is single-ported: VTG competes with Vector RS for ALU access |
+| GVIQ-6 | VTG competes with Vector RS for TRegFile ports (R0, R4, W0) |
+| GVIQ-7 | Paired `G256` issue (optional): two VTGs with matching beat_word share one VEC beat cycle |
+
+---
+
+## 8. VTG Metadata Table
+
+### 8.1 Metadata Structure
+
+The VTG Metadata Table overlays the existing **Tile Metadata RAT** (`Davinci_superscalar_v2.md` §6.1). Each physical tile's metadata entry is extended with VTG-specific fields:
+
+```
+TileMetadataEntry (extended, 32+14 = 46 b per physical tile):
+  // From Tile Metadata RAT:
+  shape.x:    14 b   // columns C
+  shape.y:    14 b   // rows R
+  format:      4 b   // FP32/FP16/FP8/FP4
+  flags:       4 b   // arg_tile, scalar_tile, prefetch_hint
+
+  // VTG additions (overlaid or extending):
+  group_mode:  1 b   // G256=0, G512=1
+  pred_granule: 2 b  // 8/16/32-bit lane grouping
+  // The following are per-VTG (16 entries per tile):
+  vtg_meta[16]: {
+    valid:       1 b
+    defined:     1 b
+    dirty:       1 b
+    kind:        3 b   // VEC | PRED | WIDE_LO | WIDE_HI | SCRATCH | UNDEF
+  }
+```
+
+**Note (v1.1):** `elem_type` is NOT duplicated — it is the same 4-bit `format` field from the Tile Metadata RAT. `active_bytes` is computed from `shape.x × shape.y × E` and the VTG's position in the tile.
+
+### 8.2 VTG Byte Mapping (unchanged)
+
+`G256` (16 VTGs, 256 B each):
+
+| VTG | Byte range |
+|-----|-----------|
+| `g0` | `[0, 255]` |
+| `g1` | `[256, 511]` |
+| ... | ... |
+| `g15` | `[3840, 4095]` |
+
+`G512` (8 VTGs, 512 B each):
+
+| VTG | Byte range |
+|-----|-----------|
+| `g0` | `[0, 511]` |
+| `g1` | `[512, 1023]` |
+| ... | ... |
+| `g7` | `[3584, 4095]` |
+
+---
+
+## 9. Integration with Davinci_superscalar_v2.md
+
+### 9.1 What Sections Need Updates
+
+| Section | Update Required |
+|---------|----------------|
+| §1 Key Parameters | Add VTG parameters; note VTG reuses VEC-4K-v2's R0/R4 ports |
+| §2.2.6 VTG Micro-Instr | Update with hardware-correct lifecycle and beat-word format |
+| §3 Block Diagram | VTG sub-range selector shown between SA/SB and ALU input mux |
+| §7.4 GVIQ | Update with prologue model and VEC-domain arbitration |
+| §8.3.10 VTG | Update with staging model, prologue timing, RMW writeback |
+| §9.2.5 VTG | Update metadata to overlay Tile Metadata RAT (not separate table) |
+| §10.5.1 VTG dependency | Update with VTG Ready Table and RMW writeback latency |
+| §12.5.1 VTG memory | Update with prologue timing and RMW writeback |
+
+### 9.2 Key Corrections to Apply
+
+1. **VEC staging reuse**: VTG operates **behind** SA/SB/SC, not as a separate path. Show VTG sub-range selector between staging and ALU.
+2. **Prologue model**: VTG latency starts with `T_fetch` (8–15 cy), not cycle I1.
+3. **RMW writeback**: Group Write Adapter = full-tile read (8 cy) + merge + full-tile write (8 cy) = **16 cy minimum**.
+4. **Metadata overlay**: VTG metadata fields (`group_mode`, `pred_granule`, VTG validity) extend the Tile Metadata RAT entry, not a separate table.
+5. **elem_type = format**: Remove duplicate `elem_type` field; use `format` from Tile Metadata RAT.
+
+---
+
+## 10. Key Parameters (v1.1)
+
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| VTG modes | `G256` (16×256 B VTGs/tile) · `G512` (8×512 B VTGs/tile) | |
+| GVIQ depth | 32 entries, 1-wide issue | |
+| Micro-instruction buffer | 16 entries, 2-way; stores **VEC beat-word sequences** | Each entry = up to 64 × ~50 b = ~3.2 Kb |
+| VTG beat words per micro-op | 1–N (1 for elementwise, N for strip-strided) | |
+| VTG latency (best case) | **25 cy** (8 prologue + 1 compute + 16 RMW) | |
+| VTG latency (worst case) | **32 cy** (15 prologue + 1 + 16) | |
+| TRegFile ports used | R0 + R4 (reads, shared with VEC-4K-v2 prologue) + W0 (writeback, shared) | |
+| VTG sub-range selection | At ALU input mux, after prologue completes | No extra pipeline stage |
+| VTG Ready Table | 256-bit bitmap | Same as scalar Ready Table |
+| Metadata | Tile Metadata RAT extended with VTG fields | No separate VTG Metadata Table |
+
+---
+
+## 11. Open Questions (remaining after v1.1)
+
+| ID | Question | Priority |
+|----|----------|----------|
+| OQ-A | Does VTG need a dedicated write port (W6/W7) to avoid blocking VEC-4K-v2's W0 during VTG RMW? | High |
+| OQ-B | What is the arbitration priority between Vector RS and GVIQ for VEC ALU access? | High |
+| OQ-C | How does VTG interact with VEC's accumulator (256×32 b ping-pong)? Can VTG produce accumulator results? | Medium |
+| OQ-D | For VTG memory ops (VLD/VST), does VTG share the LSU pipeline with MTE, or does it have its own LSU path? | Medium |
+| OQ-E | Should paired G256 issue be v1 or deferred? | Medium |
+| OQ-F | What is the exact beat-word encoding for each VTG opcode? Requires enumerating all 38 V* × all format × all predicate_mode combinations. | High |
+| OQ-G | Does VTG support the full VEC beat-word set (including `acc_op`, `shuffle`, `CAS` for TMRGSORT)? | Medium |
+
+---
+
+## Appendix A: VTG Lifecycle Comparison (v1.0 vs v1.1)
+
+| Dimension | v1.0 (incorrect) | v1.1 (hardware-correct) |
+|-----------|---------------------|--------------------------|
+| TRegFile ports | Independent R0/R4 | Shared with VEC-4K-v2 |
+| TRegFile reads | "Immediate" at I1 | 8-cycle epoch prologue |
+| TRegFile writes | Direct `write_vtg(dst_ptag, group_id, result)` | Full-tile RMW (16 cy minimum) |
+| Staging | VTG output to SA/SB (new 256/512 B) | VTG reads from SA/SB (4 KB) at ALU mux |
+| Latency claimed | 9 cycles | 25–32 cycles minimum |
+| Micro-instruction format | `MicroOpEntry {opcode, elem_type, pred_mode}` | Pre-decoded `VECBeatWord` sequence |
+| Beat-level control | Not specified | Each VTG micro-op = 1–N beat words from microassembler |
+| Metadata | Separate VTG Metadata Table (16 entries/tile) | Overlaid on Tile Metadata RAT |
+| Port arbitration | "Arbitrate via existing pipeline" | VEC-domain arbiter: Vector RS > GVIQ > MTE RS |
diff --git "a/designs/outerCube/FMA\346\214\207\344\273\244\345\234\272\346\231\257\350\257\264\346\230\216.md" "b/designs/outerCube/FMA\346\214\207\344\273\244\345\234\272\346\231\257\350\257\264\346\230\216.md"
new file mode 100644
index 00000000..c1b2b78c
--- /dev/null
+++ "b/designs/outerCube/FMA\346\214\207\344\273\244\345\234\272\346\231\257\350\257\264\346\230\216.md"
@@ -0,0 +1,94 @@
+# 问题说明
+当前看到PTO ISA包含了1src指令(如TABS)和2src指令(如TADD、TMUL)，
+在vector4k_v2文档中看到了TFMA指令相关描述：TFMA_ACC D=A*B+Acc, Acc来源于staging reg,而非从treg中读取，等同2src指令。
+
+
+因此缺失3src相关指令，如FMA指令： 
+$$dst_{i,j} = src0_{i,j}*src1_{i,j} + src2_{i,j}$$ 
+
+
+目前看到FMA在两类场景有使用到：
+1. reduce_sum相关场景，当前在vector4k_v2中已经描述
+2. 提升计算精度和吞吐量场景，如LayerNorm中的FMA优化点。
+   
+相比于独立的乘法和加法指令，FMA 具有两大决定性优势：
+1. **吞吐量翻倍**：将两条独立指令融合成一条，极大地提高了算术逻辑单元 (ALU) 的利用率。
+2. **精度提升**：在计算 $A \times B$ 的中间结果时持有无穷精度，仅在最终加上 $C$ 后做一次舍入 (Rounding)，有效避免了两次舍入带来的精度损失。。
+
+当前实际网络或者算子涉及到FMA指令的场景：norm类(layernorm/rmsnorm)、激活函数(gelu、swiglu等)、三角函数(sin、cos)等场景中。
+
+
+---
+# 问题需求
+如果需要在vector4k架构中支持3src FMA指令，当前需要适配项：
+1. ISA扩展，添加TFMA指令：$$dst_{i,j} = src0_{i,j}*src1_{i,j} + src2_{i,j}$$ 
+2. 为用满TFMA算力，需要Tregfile提供512B/C*3的read带宽，当前trefile对vector4k只存在2个read port
+   
+是否需要考虑针对FMA等3src指令增加vector4k读取trefile带宽？
+
+
+# 具体场景说明
+以layernom为例，说明FMA指令使用场景：
+
+# FMA在layernorm中的使用
+
+
+---
+
+## 1. Welford 局部状态更新 (Local State Update)
+
+在遍历输入向量的每一个元素 $x_i$ 时，单线程需要实时更新局部的均值 ($\mu$) 和平方差和 ($M_2$)。
+
+### 优化点 1：均值更新 (Mean Update)
+原数学公式为：
+$$\mu_{new} = \mu_{old} + \frac{x_i - \mu_{old}}{n}$$
+
+**工程转换：** GPU 和 CPU 上的硬件除法延迟极高。我们通常提前计算数量的倒数 $\text{inv\_n} = 1.0 / n$，并设 $\delta = x_i - \mu_{old}$，将除法转换为乘法：
+$$\mu_{new} = \delta \times \text{inv\_n} + \mu_{old}$$
+此时，它完美契合了 FMA 的形式 `$fmaf(\delta, inv\_n, \mu_{old})$`。
+
+### 优化点 2：平方差和更新 (M2 Update)
+原数学公式为，设 $\delta_2 = x_i - \mu_{new}$：
+$$M2_{new} = M2_{old} + \delta \times \delta_2$$
+
+**工程转换：** 这里天然就是一个乘加结构，直接映射为 FMA 指令 `$fmaf(\delta, \delta_2, M2_{old})$`。由于这里涉及累加极小的二次项，使用 FMA 可以显著降低灾难性数值抵消 (Catastrophic Cancellation) 的风险。
+
+---
+
+## 2. Welford 状态合并 (State Merge / Reduction)
+
+在 GPU 的 Warp/Block 级别归约，或 CPU 的多线程合并时，我们需要合并两个独立的 Welford 状态（状态 A 和状态 B）。设合并后的总数为 $n = n_A + n_B$，均值差为 $\delta = \mu_B - \mu_A$。
+
+同样，为了避开除法，我们预计算 $inv\_n = 1.0 / n$。
+
+### 优化点 3：合并均值 (Merged Mean)
+原计算逻辑为：
+$$\mu = \mu_A + \delta \times \frac{n_B}{n}$$
+
+**工程转换：** 提取乘法因子 $\text{factor} = n_B \times inv\_n$。
+公式转化为：
+$$\mu = \delta \times \text{factor} + \mu_A$$
+映射为 FMA：`$fmaf(\delta, factor, \mu_A)$`。
+
+### 优化点 4：合并平方差和 (Merged M2)
+原计算逻辑为：
+$$M2 = M2_A + M2_B + \delta^2 \times \frac{n_A \times n_B}{n}$$
+
+**工程转换：** 提取组合因子 $\text{factor\_m2} = (n_A \times n_B) \times inv\_n$。
+公式转化为：
+$$M2 = M2_A + \delta \times (\delta \times \text{factor\_m2}) + M2_B$$
+这可以被视作一次针对 $M2_B$ 的基础加法，结合一次核心的 FMA 操作 `$fmaf(\delta \times factor\_m2, \delta, M2_B)$`，然后再与 $M2_A$ 相加。
+
+---
+
+## 3. 仿射变换 (Scale and Shift)
+
+这是 LayerNorm 的最后一步，也是计算最密集、访存最频繁的一步。我们需要将标准化后的结果 $\hat{x}$ 乘以可学习的缩放参数 $\gamma$，并加上平移参数 $\beta$。
+
+### 优化点 5：最终写回 (Final Affine Transform)
+原数学公式为：
+$$y_i = \gamma_i \times \hat{x}_i + \beta_i$$
+
+**工程转换：** 这是整个深度学习中最经典、最纯粹的 FMA 场景。无论是在 CUDA 还是 C++ AVX/NEON 指令集中，这一步都**必须**被编译为一条独立的 FMA 指令。
+映射为：`$fmaf(\gamma_i, \hat{x}_i, \beta_i)$`。
+
diff --git a/designs/outerCube/PTOISA/MGATHER.md b/designs/outerCube/PTOISA/MGATHER.md
new file mode 100644
index 00000000..e092860b
--- /dev/null
+++ b/designs/outerCube/PTOISA/MGATHER.md
@@ -0,0 +1,100 @@
+﻿# MGATHER
+
+
+## Tile Operation Diagram
+
+![MGATHER tile operation](../figures/isa/MGATHER.svg)
+
+## Introduction
+
+Gather-load elements from global memory into a tile using per-element indices.
+
+## Math Interpretation
+
+For each element `(i, j)` in the destination valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{mem}[\mathrm{idx}_{i,j}] $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
+-> !pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDst, typename GlobalData, typename TileInd, typename... WaitEvents>
+PTO_INST RecordEvent MGATHER(TileDst &dst, GlobalData &src, TileInd &indexes, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Supported data types**:
+    - `dst`/`src` element type must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
+    - On AICore targets, `float8_e4m3_t` and `float8_e5m2_t` are also supported.
+    - `indexes` element type must be `int32_t` or `uint32_t`.
+- **Tile and memory types**:
+    - `dst` must be a vector tile (`TileType::Vec`).
+    - `indexes` must be a vector tile (`TileType::Vec`).
+    - `dst` and `indexes` must use row-major layout.
+    - `src` must be a `GlobalTensor` in GM memory.
+    - `src` must use `ND` layout.
+- **Shape constraints**:
+    - `dst.Rows == indexes.Rows`.
+    - `indexes` must be shaped as `[N, 1]` for row-indexed gather or `[N, M]` for element-indexed gather.
+    - `dst` row width must be 32-byte aligned, that is, `dst.Cols * sizeof(DType)` must be a multiple of 32.
+    - `src` static shape must satisfy `Shape<1, 1, 1, TableRows, RowWidth>`.
+- **Index interpretation**:
+    - Index interpretation is target-defined. The CPU simulator treats indices as linear element indices into `src.data()`.
+    - The CPU simulator does not enforce bounds checks on `indexes`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
+```
+
+### PTO Assembly Form
+
+```text
+%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/MGATHER_zh.md b/designs/outerCube/PTOISA/MGATHER_zh.md
new file mode 100644
index 00000000..1b0ba2a3
--- /dev/null
+++ b/designs/outerCube/PTOISA/MGATHER_zh.md
@@ -0,0 +1,100 @@
+﻿# MGATHER
+
+## 指令示意图
+
+![MGATHER tile operation](../figures/isa/MGATHER.svg)
+
+## 简介
+
+使用逐元素索引从全局内存收集加载元素到 Tile 中。
+
+## 数学语义
+
+对目标有效区域中的每个元素 `(i, j)`：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{mem}[\mathrm{idx}_{i,j}] $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
+-> !pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDst, typename GlobalData, typename TileInd, typename... WaitEvents>
+PTO_INST RecordEvent MGATHER(TileDst &dst, GlobalData &src, TileInd &indexes, WaitEvents &... events);
+```
+
+## 约束
+
+- **支持的数据类型**：
+    - `dst`/`src` 的元素类型必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`bfloat16_t`、`float`。
+    - 在 AICore 目标上，还支持 `float8_e4m3_t` 和 `float8_e5m2_t`。
+    - `indexes` 的元素类型必须是 `int32_t` 或 `uint32_t`。
+- **Tile 与内存类型约束**：
+    - `dst` 必须是向量 Tile（`TileType::Vec`）。
+    - `indexes` 必须是向量 Tile（`TileType::Vec`）。
+    - `dst` 和 `indexes` 必须使用行主序布局。
+    - `src` 必须是位于 GM 内存中的 `GlobalTensor`。
+    - `src` 必须使用 `ND` 布局。
+- **形状约束**：
+    - `dst.Rows == indexes.Rows`。
+    - `indexes` 的形状必须为 `[N, 1]`（按行 gather）或 `[N, M]`（按元素 gather）。
+    - `dst` 的行宽必须满足 32 字节对齐，即 `dst.Cols * sizeof(DType)` 必须是 32 的倍数。
+    - `src` 的静态 shape 必须满足 `Shape<1, 1, 1, TableRows, RowWidth>`。
+- **索引解释**：
+    - 索引解释由目标定义。CPU 模拟器将索引视为 `src.data()` 中的线性元素索引。
+    - CPU 模拟器不对 `indexes` 强制执行边界检查。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
+```
+
+### PTO 汇编形式
+
+```text
+%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/MSCATTER.md b/designs/outerCube/PTOISA/MSCATTER.md
new file mode 100644
index 00000000..f7b34223
--- /dev/null
+++ b/designs/outerCube/PTOISA/MSCATTER.md
@@ -0,0 +1,105 @@
+﻿# MSCATTER
+
+
+## Tile Operation Diagram
+
+![MSCATTER tile operation](../figures/isa/MSCATTER.svg)
+
+## Introduction
+
+Scatter-store elements from a tile into global memory using per-element indices.
+
+## Math Interpretation
+
+For each element `(i, j)` in the source valid region:
+
+$$ \mathrm{mem}[\mathrm{idx}_{i,j}] = \mathrm{src}_{i,j} $$
+
+If multiple elements map to the same destination location, the final value is implementation-defined (CPU simulator: last writer wins in row-major iteration order).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename GlobalData, typename TileSrc, typename TileInd, typename... WaitEvents>
+PTO_INST RecordEvent MSCATTER(GlobalData &dst, TileSrc &src, TileInd &indexes, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Supported data types**:
+    - `src`/`dst` element type must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
+    - On AICore targets, `float8_e4m3_t` and `float8_e5m2_t` are also supported.
+    - `indexes` element type must be `int32_t` or `uint32_t`.
+- **Tile and memory types**:
+    - `src` must be a vector tile (`TileType::Vec`).
+    - `indexes` must be a vector tile (`TileType::Vec`).
+    - `src` and `indexes` must use row-major layout.
+    - `dst` must be a `GlobalTensor` in GM memory.
+    - `dst` must use `ND` layout.
+- **Atomic operation constraints**:
+    - Non-atomic scatter is supported for all supported element types.
+    - `Add` atomic mode requires `int32_t`, `uint32_t`, `float`, or `half`.
+    - `Max`/`Min` atomic mode requires `int32_t` or `float`.
+- **Shape constraints**:
+    - `src.Rows == indexes.Rows`.
+    - `indexes` must be shaped as `[N, 1]` for row-indexed scatter or `[N, M]` for element-indexed scatter.
+    - `src` row width must be 32-byte aligned, that is, `src.Cols * sizeof(DType)` must be a multiple of 32.
+    - `dst` static shape must satisfy `Shape<1, 1, 1, TableRows, RowWidth>`.
+- **Index interpretation**:
+    - Index interpretation is target-defined. The CPU simulator treats indices as linear element indices into `dst.data()`.
+    - The CPU simulator does not enforce bounds checks on `indexes`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### PTO Assembly Form
+
+```text
+mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
+# AS Level 2 (DPS)
+pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+
diff --git a/designs/outerCube/PTOISA/MSCATTER_zh.md b/designs/outerCube/PTOISA/MSCATTER_zh.md
new file mode 100644
index 00000000..c0d9efdf
--- /dev/null
+++ b/designs/outerCube/PTOISA/MSCATTER_zh.md
@@ -0,0 +1,105 @@
+﻿# MSCATTER
+
+## 指令示意图
+
+![MSCATTER tile operation](../figures/isa/MSCATTER.svg)
+
+## 简介
+
+使用逐元素索引将 Tile 中的元素散播存储到全局内存。
+
+## 数学语义
+
+对源有效区域中的每个元素 `(i, j)`：
+
+$$ \mathrm{mem}[\mathrm{idx}_{i,j}] = \mathrm{src}_{i,j} $$
+
+如果多个元素映射到同一目标位置，最终值由实现定义（CPU 模拟器：按行主序迭代顺序，最后写入者获胜）。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename GlobalData, typename TileSrc, typename TileInd, typename... WaitEvents>
+PTO_INST RecordEvent MSCATTER(GlobalData &dst, TileSrc &src, TileInd &indexes, WaitEvents &... events);
+```
+
+## 约束
+
+- **支持的数据类型**：
+    - `src`/`dst` 的元素类型必须是以下之一：`int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`。
+    - 在 AICore 目标上，还支持 `float8_e4m3_t` 和 `float8_e5m2_t`。
+    - `indexes` 的元素类型必须是 `int32_t` 或 `uint32_t`。
+- **Tile 与内存类型约束**：
+    - `src` 必须是向量 Tile（`TileType::Vec`）。
+    - `indexes` 必须是向量 Tile（`TileType::Vec`）。
+    - `src` 和 `indexes` 必须使用行主序布局。
+    - `dst` 必须是位于 GM 内存中的 `GlobalTensor`。
+    - `dst` 必须使用 `ND` 布局。
+- **原子操作约束**：
+    - 非原子 scatter 对所有受支持元素类型都可用。
+    - `Add` 原子模式要求元素类型为 `int32_t`、`uint32_t`、`float` 或 `half`。
+    - `Max`/`Min` 原子模式要求元素类型为 `int32_t` 或 `float`。
+- **形状约束**：
+    - `src.Rows == indexes.Rows`。
+    - `indexes` 的形状必须为 `[N, 1]`（按行 scatter）或 `[N, M]`（按元素 scatter）。
+    - `src` 的行宽必须满足 32 字节对齐，即 `src.Cols * sizeof(DType)` 必须是 32 的倍数。
+    - `dst` 的静态 shape 必须满足 `Shape<1, 1, 1, TableRows, RowWidth>`。
+- **索引解释**：
+    - 索引解释由目标定义。CPU 模拟器将索引视为 `dst.data()` 中的线性元素索引。
+    - CPU 模拟器不对 `indexes` 强制执行边界检查。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### PTO 汇编形式
+
+```text
+mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
+# AS Level 2 (DPS)
+pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+
diff --git a/designs/outerCube/PTOISA/PTOISA.md b/designs/outerCube/PTOISA/PTOISA.md
new file mode 100644
index 00000000..75fd1802
--- /dev/null
+++ b/designs/outerCube/PTOISA/PTOISA.md
@@ -0,0 +1,152 @@
+# PTO ISA Overview
+
+This page is the source-synchronized ISA index generated from `docs/isa/manifest.yaml`.
+
+## Docs Contents
+
+| Area | Page | Description |
+|---|---|---|
+| Overview | [`docs/README.md`](README.md) | PTO ISA guide entry point and navigation. |
+| Overview | [`docs/PTOISA.md`](PTOISA.md) | This page (overview + full instruction index). |
+| ISA reference | [`docs/isa/README.md`](isa/README.md) | Per-instruction reference directory index. |
+| ISA reference | [`docs/isa/conventions.md`](isa/conventions.md) | Shared notation, operands, events, and modifiers. |
+| Assembly (PTO-AS) | [`docs/assembly/PTO-AS.md`](assembly/PTO-AS.md) | PTO-AS syntax reference. |
+| Source of truth | [`include/pto/common/pto_instr.hpp`](reference/pto-intrinsics-header.md) | C++ intrinsic API (authoritative). |
+| PTO Auto Mode | [`docs/auto_mode/README.md`](README.md) | PTO auto mode guide entry point. |
+
+## Instruction Index (All PTO Instructions)
+
+| Category | Instruction | Description |
+|---|---|---|
+| Synchronization | [`TSYNC`](isa/TSYNC.md) | Synchronize PTO execution (wait on events or insert a per-op pipeline barrier). |
+| Manual / Resource Binding | [`TASSIGN`](isa/TASSIGN.md) | Bind a Tile object to an implementation-defined on-chip address (manual placement). |
+| Manual / Resource Binding | [`TSETFMATRIX`](isa/TSETFMATRIX.md) | Set FMATRIX register(s) for IMG2COL-like ops. |
+| Manual / Resource Binding | [`TSET_IMG2COL_RPT`](isa/TSET_IMG2COL_RPT.md) | Set IMG2COL repeat metadata from an IMG2COL configuration tile. |
+| Manual / Resource Binding | [`TSET_IMG2COL_PADDING`](isa/TSET_IMG2COL_PADDING.md) | Set IMG2COL padding metadata from an IMG2COL configuration tile. |
+| Elementwise (Tile-Tile) | [`TADD`](isa/TADD.md) | Elementwise add of two tiles. |
+| Elementwise (Tile-Tile) | [`TABS`](isa/TABS.md) | Elementwise absolute value of a tile. |
+| Elementwise (Tile-Tile) | [`TAND`](isa/TAND.md) | Elementwise bitwise AND of two tiles. |
+| Elementwise (Tile-Tile) | [`TOR`](isa/TOR.md) | Elementwise bitwise OR of two tiles. |
+| Elementwise (Tile-Tile) | [`TSUB`](isa/TSUB.md) | Elementwise subtract of two tiles. |
+| Elementwise (Tile-Tile) | [`TMUL`](isa/TMUL.md) | Elementwise multiply of two tiles. |
+| Elementwise (Tile-Tile) | [`TMIN`](isa/TMIN.md) | Elementwise minimum of two tiles. |
+| Elementwise (Tile-Tile) | [`TMAX`](isa/TMAX.md) | Elementwise maximum of two tiles. |
+| Elementwise (Tile-Tile) | [`TCMP`](isa/TCMP.md) | Compare two tiles and write a packed predicate mask. |
+| Elementwise (Tile-Tile) | [`TDIV`](isa/TDIV.md) | Elementwise division of two tiles. |
+| Elementwise (Tile-Tile) | [`TSHL`](isa/TSHL.md) | Elementwise shift-left of two tiles. |
+| Elementwise (Tile-Tile) | [`TSHR`](isa/TSHR.md) | Elementwise shift-right of two tiles. |
+| Elementwise (Tile-Tile) | [`TXOR`](isa/TXOR.md) | Elementwise bitwise XOR of two tiles. |
+| Elementwise (Tile-Tile) | [`TLOG`](isa/TLOG.md) | Elementwise natural logarithm of a tile. |
+| Elementwise (Tile-Tile) | [`TRECIP`](isa/TRECIP.md) | Elementwise reciprocal of a tile. |
+| Elementwise (Tile-Tile) | [`TPRELU`](isa/TPRELU.md) | Elementwise PReLU (parametric ReLU) with a per-element slope tile. |
+| Elementwise (Tile-Tile) | [`TADDC`](isa/TADDC.md) | Elementwise ternary add: `src0 + src1 + src2`. |
+| Elementwise (Tile-Tile) | [`TSUBC`](isa/TSUBC.md) | Elementwise ternary op: `src0 - src1 + src2`. |
+| Elementwise (Tile-Tile) | [`TCVT`](isa/TCVT.md) | Elementwise type conversion with a specified rounding mode. |
+| Elementwise (Tile-Tile) | [`TSEL`](isa/TSEL.md) | Select between two tiles using a mask tile (per-element selection). |
+| Elementwise (Tile-Tile) | [`TRSQRT`](isa/TRSQRT.md) | Elementwise reciprocal square root. |
+| Elementwise (Tile-Tile) | [`TSQRT`](isa/TSQRT.md) | Elementwise square root. |
+| Elementwise (Tile-Tile) | [`TEXP`](isa/TEXP.md) | Elementwise exponential. |
+| Elementwise (Tile-Tile) | [`TNOT`](isa/TNOT.md) | Elementwise bitwise NOT of a tile. |
+| Elementwise (Tile-Tile) | [`TRELU`](isa/TRELU.md) | Elementwise ReLU of a tile. |
+| Elementwise (Tile-Tile) | [`TNEG`](isa/TNEG.md) | Elementwise negation of a tile. |
+| Elementwise (Tile-Tile) | [`TREM`](isa/TREM.md) | Elementwise remainder of two tiles. |
+| Elementwise (Tile-Tile) | [`TFMOD`](isa/TFMOD.md) | Elementwise fmod of two tiles. |
+| Tile-Scalar / Tile-Immediate | [`TEXPANDS`](isa/TEXPANDS.md) | Broadcast a scalar into a destination tile. |
+| Tile-Scalar / Tile-Immediate | [`TCMPS`](isa/TCMPS.md) | Compare a tile against a scalar and write per-element comparison results. |
+| Tile-Scalar / Tile-Immediate | [`TSELS`](isa/TSELS.md) | Select between source tile and scalar using a mask tile (per-element selection for source tile). |
+| Tile-Scalar / Tile-Immediate | [`TMINS`](isa/TMINS.md) | Elementwise minimum of a tile and a scalar. |
+| Tile-Scalar / Tile-Immediate | [`TADDS`](isa/TADDS.md) | Elementwise add a scalar to a tile. |
+| Tile-Scalar / Tile-Immediate | [`TSUBS`](isa/TSUBS.md) | Elementwise subtract a scalar from a tile. |
+| Tile-Scalar / Tile-Immediate | [`TDIVS`](isa/TDIVS.md) | Elementwise division with a scalar (tile/scalar or scalar/tile). |
+| Tile-Scalar / Tile-Immediate | [`TMULS`](isa/TMULS.md) | Elementwise multiply a tile by a scalar. |
+| Tile-Scalar / Tile-Immediate | [`TFMODS`](isa/TFMODS.md) | Elementwise remainder with a scalar: `fmod(src, scalar)`. |
+| Tile-Scalar / Tile-Immediate | [`TREMS`](isa/TREMS.md) | Elementwise remainder with a scalar: `remainder(src, scalar)`. |
+| Tile-Scalar / Tile-Immediate | [`TMAXS`](isa/TMAXS.md) | Elementwise max of a tile and a scalar: `max(src, scalar)`. |
+| Tile-Scalar / Tile-Immediate | [`TANDS`](isa/TANDS.md) | Elementwise bitwise AND of a tile and a scalar. |
+| Tile-Scalar / Tile-Immediate | [`TORS`](isa/TORS.md) | Elementwise bitwise OR of a tile and a scalar. |
+| Tile-Scalar / Tile-Immediate | [`TSHLS`](isa/TSHLS.md) | Elementwise shift-left a tile by a scalar. |
+| Tile-Scalar / Tile-Immediate | [`TSHRS`](isa/TSHRS.md) | Elementwise shift-right a tile by a scalar. |
+| Tile-Scalar / Tile-Immediate | [`TXORS`](isa/TXORS.md) | Elementwise bitwise XOR of a tile and a scalar. |
+| Tile-Scalar / Tile-Immediate | [`TLRELU`](isa/TLRELU.md) | Leaky ReLU with a scalar slope. |
+| Tile-Scalar / Tile-Immediate | [`TADDSC`](isa/TADDSC.md) | Elementwise fused add with scalar and a second tile: `src0 + scalar + src1`. |
+| Tile-Scalar / Tile-Immediate | [`TSUBSC`](isa/TSUBSC.md) | Elementwise fused op: `src0 - scalar + src1`. |
+| Axis Reduce / Expand | [`TROWSUM`](isa/TROWSUM.md) | Reduce each row by summing across columns. |
+| Axis Reduce / Expand | [`TROWPROD`](isa/TROWPROD.md) | Reduce each row by multiplying across columns. |
+| Axis Reduce / Expand | [`TCOLSUM`](isa/TCOLSUM.md) | Reduce each column by summing across rows. |
+| Axis Reduce / Expand | [`TCOLPROD`](isa/TCOLPROD.md) | Reduce each column by multiplying across rows. |
+| Axis Reduce / Expand | [`TCOLMAX`](isa/TCOLMAX.md) | Reduce each column by taking the maximum across rows. |
+| Axis Reduce / Expand | [`TROWMAX`](isa/TROWMAX.md) | Reduce each row by taking the maximum across columns. |
+| Axis Reduce / Expand | [`TROWMIN`](isa/TROWMIN.md) | Reduce each row by taking the minimum across columns. |
+| Axis Reduce / Expand | [`TROWARGMAX`](isa/TROWARGMAX.md) | Get the column index of the maximum element for each row. |
+| Axis Reduce / Expand | [`TROWARGMIN`](isa/TROWARGMIN.md) | Get the column index of the minimum element for each row. |
+| Axis Reduce / Expand | [`TCOLARGMAX`](isa/TCOLARGMAX.md) | Get the row index of the maximum element for each column. |
+| Axis Reduce / Expand | [`TCOLARGMIN`](isa/TCOLARGMIN.md) | Get the row index of the minimum element for each column. |
+| Axis Reduce / Expand | [`TROWEXPAND`](isa/TROWEXPAND.md) | Broadcast the first element of each source row across the destination row. |
+| Axis Reduce / Expand | [`TROWEXPANDDIV`](isa/TROWEXPANDDIV.md) | Row-wise broadcast divide: divide each row of `src0` by a per-row scalar vector `src1`. |
+| Axis Reduce / Expand | [`TROWEXPANDMUL`](isa/TROWEXPANDMUL.md) | Row-wise broadcast multiply: multiply each row of `src0` by a per-row scalar vector `src1`. |
+| Axis Reduce / Expand | [`TROWEXPANDSUB`](isa/TROWEXPANDSUB.md) | Row-wise broadcast subtract: subtract a per-row scalar vector `src1` from each row of `src0`. |
+| Axis Reduce / Expand | [`TROWEXPANDADD`](isa/TROWEXPANDADD.md) | Row-wise broadcast add: add a per-row scalar vector. |
+| Axis Reduce / Expand | [`TROWEXPANDMAX`](isa/TROWEXPANDMAX.md) | Row-wise broadcast max with a per-row scalar vector. |
+| Axis Reduce / Expand | [`TROWEXPANDMIN`](isa/TROWEXPANDMIN.md) | Row-wise broadcast min with a per-row scalar vector. |
+| Axis Reduce / Expand | [`TROWEXPANDEXPDIF`](isa/TROWEXPANDEXPDIF.md) | Row-wise exp-diff: compute exp(src0 - src1) with per-row scalars. |
+| Axis Reduce / Expand | [`TCOLMIN`](isa/TCOLMIN.md) | Reduce each column by taking the minimum across rows. |
+| Axis Reduce / Expand | [`TCOLEXPAND`](isa/TCOLEXPAND.md) | Broadcast the first element of each source column across the destination column. |
+| Axis Reduce / Expand | [`TCOLEXPANDDIV`](isa/TCOLEXPANDDIV.md) | Column-wise broadcast divide: divide each column by a per-column scalar vector. |
+| Axis Reduce / Expand | [`TCOLEXPANDMUL`](isa/TCOLEXPANDMUL.md) | Column-wise broadcast multiply: multiply each column by a per-column scalar vector. |
+| Axis Reduce / Expand | [`TCOLEXPANDADD`](isa/TCOLEXPANDADD.md) | Column-wise broadcast add with per-column scalar vector. |
+| Axis Reduce / Expand | [`TCOLEXPANDMAX`](isa/TCOLEXPANDMAX.md) | Column-wise broadcast max with per-column scalar vector. |
+| Axis Reduce / Expand | [`TCOLEXPANDMIN`](isa/TCOLEXPANDMIN.md) | Column-wise broadcast min with per-column scalar vector. |
+| Axis Reduce / Expand | [`TCOLEXPANDSUB`](isa/TCOLEXPANDSUB.md) | Column-wise broadcast subtract: subtract a per-column scalar vector from each column. |
+| Axis Reduce / Expand | [`TCOLEXPANDEXPDIF`](isa/TCOLEXPANDEXPDIF.md) | Column-wise exp-diff: compute exp(src0 - src1) with per-column scalars. |
+| Memory (GM <-> Tile) | [`TLOAD`](isa/TLOAD.md) | Load data from a GlobalTensor (GM) into a Tile. |
+| Memory (GM <-> Tile) | [`TPREFETCH`](isa/TPREFETCH.md) | Prefetch data from global memory into a tile-local cache/buffer (hint). |
+| Memory (GM <-> Tile) | [`TSTORE`](isa/TSTORE.md) | Store data from a Tile into a GlobalTensor (GM), optionally using atomic write or quantization parameters. |
+| Memory (GM <-> Tile) | [`TSTORE_FP`](isa/TSTORE_FP.md) | Store an accumulator tile into global memory using a scaling (`fp`) tile for vector quantization parameters. |
+| Memory (GM <-> Tile) | [`MGATHER`](isa/MGATHER.md) | Gather-load elements from global memory into a tile using per-element indices. |
+| Memory (GM <-> Tile) | [`MSCATTER`](isa/MSCATTER.md) | Scatter-store elements from a tile into global memory using per-element indices. |
+| Matrix Multiply | [`TGEMV_MX`](isa/TGEMV_MX.md) | GEMV with additional scaling tiles for mixed-precision / quantized matrix-vector compute. |
+| Matrix Multiply | [`TMATMUL_MX`](isa/TMATMUL_MX.md) | Matrix multiply (GEMM) with additional scaling tiles for mixed-precision / quantized matmul on supported targets. |
+| Matrix Multiply | [`TMATMUL`](isa/TMATMUL.md) | Matrix multiply (GEMM) producing an accumulator/output tile. |
+| Matrix Multiply | [`TMATMUL_ACC`](isa/TMATMUL_ACC.md) | Matrix multiply with accumulator input (fused accumulate). |
+| Matrix Multiply | [`TMATMUL_BIAS`](isa/TMATMUL_BIAS.md) | Matrix multiply with bias add. |
+| Matrix Multiply | [`TGEMV`](isa/TGEMV.md) | General Matrix-Vector multiplication producing an accumulator/output tile. |
+| Matrix Multiply | [`TGEMV_ACC`](isa/TGEMV_ACC.md) | GEMV with explicit accumulator input/output tiles. |
+| Matrix Multiply | [`TGEMV_BIAS`](isa/TGEMV_BIAS.md) | GEMV with bias add. |
+| Data Movement / Layout | [`TEXTRACT`](isa/TEXTRACT.md) | Extract a sub-tile from a source tile. |
+| Data Movement / Layout | [`TEXTRACT_FP`](isa/TEXTRACT_FP.md) | Extract with fp/scaling tile (vector-quantization parameters). |
+| Data Movement / Layout | [`TIMG2COL`](isa/TIMG2COL.md) | Image-to-column transform for convolution-like workloads. |
+| Data Movement / Layout | [`TINSERT`](isa/TINSERT.md) | Insert a sub-tile into a destination tile at an (indexRow, indexCol) offset. |
+| Data Movement / Layout | [`TINSERT_FP`](isa/TINSERT_FP.md) | Insert with fp/scaling tile (vector-quantization parameters). |
+| Data Movement / Layout | [`TFILLPAD`](isa/TFILLPAD.md) | Copy+pad a tile outside the valid region with a compile-time pad value. |
+| Data Movement / Layout | [`TFILLPAD_INPLACE`](isa/TFILLPAD_INPLACE.md) | In-place fill/pad variant. |
+| Data Movement / Layout | [`TFILLPAD_EXPAND`](isa/TFILLPAD_EXPAND.md) | Fill/pad while allowing dst to be larger than src. |
+| Data Movement / Layout | [`TMOV`](isa/TMOV.md) | Move/copy between tiles, optionally applying implementation-defined conversion modes. |
+| Data Movement / Layout | [`TMOV_FP`](isa/TMOV_FP.md) | Move/convert from an accumulator tile into a destination tile, using a scaling (`fp`) tile for vector quantization parameters. |
+| Data Movement / Layout | [`TRESHAPE`](isa/TRESHAPE.md) | Reinterpret a tile as another tile type/shape while preserving the underlying bytes. |
+| Data Movement / Layout | [`TTRANS`](isa/TTRANS.md) | Transpose with an implementation-defined temporary tile. |
+| Data Movement / Layout | [`TSUBVIEW`](isa/TSUBVIEW.md) | Reinterpret a tile as a subtile of another tile. |
+| Data Movement / Layout | [`TGET_SCALE_ADDR`](isa/TGET_SCALE_ADDR.md) | Bind the on-chip address of output tile to a scaled factor of that of input tile. |
+| Complex | [`TPRINT`](isa/TPRINT.md) | Debug/print elements from a tile (implementation-defined). |
+| Complex | [`TMRGSORT`](isa/TMRGSORT.md) | Merge sort for multiple sorted lists (implementation-defined element format and layout). |
+| Complex | [`TSORT32`](isa/TSORT32.md) | Sort 32-element blocks of `src` with accompanying `idx` entries and output sorted value-index pairs. |
+| Complex | [`TGATHER`](isa/TGATHER.md) | Gather/select elements using either an index tile or a compile-time mask pattern. |
+| Complex | [`TCI`](isa/TCI.md) | Generate a contiguous integer sequence into a destination tile. |
+| Complex | [`TTRI`](isa/TTRI.md) | Generate a triangular (lower/upper) mask tile. |
+| Complex | [`TPARTADD`](isa/TPARTADD.md) | Partial elementwise add with implementation-defined handling of mismatched valid regions. |
+| Complex | [`TPARTMUL`](isa/TPARTMUL.md) | Partial elementwise multiply with implementation-defined handling of mismatched valid regions. |
+| Complex | [`TPARTMAX`](isa/TPARTMAX.md) | Partial elementwise max with implementation-defined handling of mismatched valid regions. |
+| Complex | [`TPARTMIN`](isa/TPARTMIN.md) | Partial elementwise min with implementation-defined handling of mismatched valid regions. |
+| Complex | [`TGATHERB`](isa/TGATHERB.md) | Gather elements using byte offsets. |
+| Complex | [`TSCATTER`](isa/TSCATTER.md) | Scatter rows of a source tile into a destination tile using per-element row indices. |
+| Complex | [`TQUANT`](isa/TQUANT.md) | Quantize a tile (e.g. FP32 to FP8) producing exponent/scaling/max outputs. |
+| Communication | [`TPUT`](isa/comm/TPUT.md) | Remote write: transfer local data to remote NPU memory (GM → UB → GM). |
+| Communication | [`TGET`](isa/comm/TGET.md) | Remote read: read remote NPU data to local memory (GM → UB → GM). |
+| Communication | [`TPUT_ASYNC`](isa/comm/TPUT_ASYNC.md) | Asynchronous remote write (local GM → DMA engine → remote GM). |
+| Communication | [`TGET_ASYNC`](isa/comm/TGET_ASYNC.md) | Asynchronous remote read (remote GM → DMA engine → local GM). |
+| Communication | [`TNOTIFY`](isa/comm/TNOTIFY.md) | Send flag notification to remote NPU. |
+| Communication | [`TWAIT`](isa/comm/TWAIT.md) | Blocking wait until signal(s) meet comparison condition. |
+| Communication | [`TTEST`](isa/comm/TTEST.md) | Non-blocking test if signal(s) meet comparison condition. |
+| Communication | [`TGATHER`](isa/comm/TGATHER.md) | Gather data from all ranks and concatenate along DIM_3. |
+| Communication | [`TSCATTER`](isa/comm/TSCATTER.md) | Scatter data to all ranks by splitting along DIM_3. |
+| Communication | [`TREDUCE`](isa/comm/TREDUCE.md) | Gather and reduce data from all ranks element-wise to local. |
+| Communication | [`TBROADCAST`](isa/comm/TBROADCAST.md) | Broadcast data from current NPU to all ranks. |
diff --git a/designs/outerCube/PTOISA/README.md b/designs/outerCube/PTOISA/README.md
new file mode 100644
index 00000000..a637245d
--- /dev/null
+++ b/designs/outerCube/PTOISA/README.md
@@ -0,0 +1,153 @@
+﻿<p align="center">
+  <img src="../figures/pto_logo.svg" alt="PTO Tile Lib" width="180" />
+</p>
+
+# PTO ISA Reference
+
+This directory contains the per-instruction reference for the PTO Tile Lib ISA.
+
+- Source of truth (C++ intrinsics): `include/pto/common/pto_instr.hpp`
+- [Common conventions (operands, events, modifiers)](conventions.md)
+
+## Synchronization
+- [TSYNC](TSYNC.md) - Synchronize PTO execution (wait on events or insert a per-op pipeline barrier).
+
+## Manual / Resource Binding
+- [TASSIGN](TASSIGN.md) - Bind a Tile object to an implementation-defined on-chip address (manual placement).
+- [TSETFMATRIX](TSETFMATRIX.md) - Set FMATRIX register(s) for IMG2COL-like ops.
+- [TSET_IMG2COL_RPT](TSET_IMG2COL_RPT.md) - Set IMG2COL repeat metadata from an IMG2COL configuration tile.
+- [TSET_IMG2COL_PADDING](TSET_IMG2COL_PADDING.md) - Set IMG2COL padding metadata from an IMG2COL configuration tile.
+
+## Elementwise (Tile-Tile)
+- [TADD](TADD.md) - Elementwise add of two tiles.
+- [TABS](TABS.md) - Elementwise absolute value of a tile.
+- [TAND](TAND.md) - Elementwise bitwise AND of two tiles.
+- [TOR](TOR.md) - Elementwise bitwise OR of two tiles.
+- [TSUB](TSUB.md) - Elementwise subtract of two tiles.
+- [TMUL](TMUL.md) - Elementwise multiply of two tiles.
+- [TMIN](TMIN.md) - Elementwise minimum of two tiles.
+- [TMAX](TMAX.md) - Elementwise maximum of two tiles.
+- [TCMP](TCMP.md) - Compare two tiles and write a packed predicate mask.
+- [TDIV](TDIV.md) - Elementwise division of two tiles.
+- [TSHL](TSHL.md) - Elementwise shift-left of two tiles.
+- [TSHR](TSHR.md) - Elementwise shift-right of two tiles.
+- [TXOR](TXOR.md) - Elementwise bitwise XOR of two tiles.
+- [TLOG](TLOG.md) - Elementwise natural logarithm of a tile.
+- [TRECIP](TRECIP.md) - Elementwise reciprocal of a tile.
+- [TPRELU](TPRELU.md) - Elementwise PReLU (parametric ReLU) with a per-element slope tile.
+- [TADDC](TADDC.md) - Elementwise ternary add: `src0 + src1 + src2`.
+- [TSUBC](TSUBC.md) - Elementwise ternary op: `src0 - src1 + src2`.
+- [TCVT](TCVT.md) - Elementwise type conversion with a specified rounding mode.
+- [TSEL](TSEL.md) - Select between two tiles using a mask tile (per-element selection).
+- [TRSQRT](TRSQRT.md) - Elementwise reciprocal square root.
+- [TSQRT](TSQRT.md) - Elementwise square root.
+- [TEXP](TEXP.md) - Elementwise exponential.
+- [TNOT](TNOT.md) - Elementwise bitwise NOT of a tile.
+- [TRELU](TRELU.md) - Elementwise ReLU of a tile.
+- [TNEG](TNEG.md) - Elementwise negation of a tile.
+- [TREM](TREM.md) - Elementwise remainder of two tiles.
+- [TFMOD](TFMOD.md) - Elementwise fmod of two tiles.
+
+## Tile-Scalar / Tile-Immediate
+- [TEXPANDS](TEXPANDS.md) - Broadcast a scalar into a destination tile.
+- [TCMPS](TCMPS.md) - Compare a tile against a scalar and write per-element comparison results.
+- [TSELS](TSELS.md) - Select between source tile and scalar using a mask tile (per-element selection for source tile).
+- [TMINS](TMINS.md) - Elementwise minimum of a tile and a scalar.
+- [TADDS](TADDS.md) - Elementwise add a scalar to a tile.
+- [TSUBS](TSUBS.md) - Elementwise subtract a scalar from a tile.
+- [TDIVS](TDIVS.md) - Elementwise division with a scalar (tile/scalar or scalar/tile).
+- [TMULS](TMULS.md) - Elementwise multiply a tile by a scalar.
+- [TFMODS](TFMODS.md) - Elementwise remainder with a scalar: `fmod(src, scalar)`.
+- [TREMS](TREMS.md) - Elementwise remainder with a scalar: `remainder(src, scalar)`.
+- [TMAXS](TMAXS.md) - Elementwise max of a tile and a scalar: `max(src, scalar)`.
+- [TANDS](TANDS.md) - Elementwise bitwise AND of a tile and a scalar.
+- [TORS](TORS.md) - Elementwise bitwise OR of a tile and a scalar.
+- [TSHLS](TSHLS.md) - Elementwise shift-left a tile by a scalar.
+- [TSHRS](TSHRS.md) - Elementwise shift-right a tile by a scalar.
+- [TXORS](TXORS.md) - Elementwise bitwise XOR of a tile and a scalar.
+- [TLRELU](TLRELU.md) - Leaky ReLU with a scalar slope.
+- [TADDSC](TADDSC.md) - Elementwise fused add with scalar and a second tile: `src0 + scalar + src1`.
+- [TSUBSC](TSUBSC.md) - Elementwise fused op: `src0 - scalar + src1`.
+
+## Axis Reduce / Expand
+- [TROWSUM](TROWSUM.md) - Reduce each row by summing across columns.
+- [TROWPROD](TROWPROD.md) - Reduce each row by multiplying across columns.
+- [TCOLSUM](TCOLSUM.md) - Reduce each column by summing across rows.
+- [TCOLPROD](TCOLPROD.md) - Reduce each column by multiplying across rows.
+- [TCOLMAX](TCOLMAX.md) - Reduce each column by taking the maximum across rows.
+- [TROWMAX](TROWMAX.md) - Reduce each row by taking the maximum across columns.
+- [TROWMIN](TROWMIN.md) - Reduce each row by taking the minimum across columns.
+- [TROWARGMAX](TROWARGMAX.md) - Get the column index of the maximum element for each row.
+- [TROWARGMIN](TROWARGMIN.md) - Get the column index of the minimum element for each row.
+- [TCOLARGMAX](TCOLARGMAX.md) - Get the row index of the maximum element for each column.
+- [TCOLARGMIN](TCOLARGMIN.md) - Get the row index of the minimum element for each column.
+- [TROWEXPAND](TROWEXPAND.md) - Broadcast the first element of each source row across the destination row.
+- [TROWEXPANDDIV](TROWEXPANDDIV.md) - Row-wise broadcast divide: divide each row of `src0` by a per-row scalar vector `src1`.
+- [TROWEXPANDMUL](TROWEXPANDMUL.md) - Row-wise broadcast multiply: multiply each row of `src0` by a per-row scalar vector `src1`.
+- [TROWEXPANDSUB](TROWEXPANDSUB.md) - Row-wise broadcast subtract: subtract a per-row scalar vector `src1` from each row of `src0`.
+- [TROWEXPANDADD](TROWEXPANDADD.md) - Row-wise broadcast add: add a per-row scalar vector.
+- [TROWEXPANDMAX](TROWEXPANDMAX.md) - Row-wise broadcast max with a per-row scalar vector.
+- [TROWEXPANDMIN](TROWEXPANDMIN.md) - Row-wise broadcast min with a per-row scalar vector.
+- [TROWEXPANDEXPDIF](TROWEXPANDEXPDIF.md) - Row-wise exp-diff: compute exp(src0 - src1) with per-row scalars.
+- [TCOLMIN](TCOLMIN.md) - Reduce each column by taking the minimum across rows.
+- [TCOLEXPAND](TCOLEXPAND.md) - Broadcast the first element of each source column across the destination column.
+- [TCOLEXPANDDIV](TCOLEXPANDDIV.md) - Column-wise broadcast divide: divide each column by a per-column scalar vector.
+- [TCOLEXPANDMUL](TCOLEXPANDMUL.md) - Column-wise broadcast multiply: multiply each column by a per-column scalar vector.
+- [TCOLEXPANDADD](TCOLEXPANDADD.md) - Column-wise broadcast add with per-column scalar vector.
+- [TCOLEXPANDMAX](TCOLEXPANDMAX.md) - Column-wise broadcast max with per-column scalar vector.
+- [TCOLEXPANDMIN](TCOLEXPANDMIN.md) - Column-wise broadcast min with per-column scalar vector.
+- [TCOLEXPANDSUB](TCOLEXPANDSUB.md) - Column-wise broadcast subtract: subtract a per-column scalar vector from each column.
+- [TCOLEXPANDEXPDIF](TCOLEXPANDEXPDIF.md) - Column-wise exp-diff: compute exp(src0 - src1) with per-column scalars.
+
+## Memory (GM <-> Tile)
+- [TLOAD](TLOAD.md) - Load data from a GlobalTensor (GM) into a Tile.
+- [TPREFETCH](TPREFETCH.md) - Prefetch data from global memory into a tile-local cache/buffer (hint).
+- [TSTORE](TSTORE.md) - Store data from a Tile into a GlobalTensor (GM), optionally using atomic write or quantization parameters.
+- [TSTORE_FP](TSTORE_FP.md) - Store an accumulator tile into global memory using a scaling (`fp`) tile for vector quantization parameters.
+- [MGATHER](MGATHER.md) - Gather-load elements from global memory into a tile using per-element indices.
+- [MSCATTER](MSCATTER.md) - Scatter-store elements from a tile into global memory using per-element indices.
+
+## Matrix Multiply
+- [TGEMV_MX](TGEMV_MX.md) - GEMV with additional scaling tiles for mixed-precision / quantized matrix-vector compute.
+- [TMATMUL_MX](TMATMUL_MX.md) - Matrix multiply (GEMM) with additional scaling tiles for mixed-precision / quantized matmul on supported targets.
+- [TMATMUL](TMATMUL.md) - Matrix multiply (GEMM) producing an accumulator/output tile.
+- [TMATMUL_ACC](TMATMUL_ACC.md) - Matrix multiply with accumulator input (fused accumulate).
+- [TMATMUL_BIAS](TMATMUL_BIAS.md) - Matrix multiply with bias add.
+- [TGEMV](TGEMV.md) - General Matrix-Vector multiplication producing an accumulator/output tile.
+- [TGEMV_ACC](TGEMV_ACC.md) - GEMV with explicit accumulator input/output tiles.
+- [TGEMV_BIAS](TGEMV_BIAS.md) - GEMV with bias add.
+
+## Data Movement / Layout
+- [TEXTRACT](TEXTRACT.md) - Extract a sub-tile from a source tile.
+- [TEXTRACT_FP](TEXTRACT_FP.md) - Extract with fp/scaling tile (vector-quantization parameters).
+- [TIMG2COL](TIMG2COL.md) - Image-to-column transform for convolution-like workloads.
+- [TINSERT](TINSERT.md) - Insert a sub-tile into a destination tile at an (indexRow, indexCol) offset.
+- [TINSERT_FP](TINSERT_FP.md) - Insert with fp/scaling tile (vector-quantization parameters).
+- [TFILLPAD](TFILLPAD.md) - Copy+pad a tile outside the valid region with a compile-time pad value.
+- [TFILLPAD_INPLACE](TFILLPAD_INPLACE.md) - In-place fill/pad variant.
+- [TFILLPAD_EXPAND](TFILLPAD_EXPAND.md) - Fill/pad while allowing dst to be larger than src.
+- [TMOV](TMOV.md) - Move/copy between tiles, optionally applying implementation-defined conversion modes.
+- [TMOV_FP](TMOV_FP.md) - Move/convert from an accumulator tile into a destination tile, using a scaling (`fp`) tile for vector quantization parameters.
+- [TRESHAPE](TRESHAPE.md) - Reinterpret a tile as another tile type/shape while preserving the underlying bytes.
+- [TTRANS](TTRANS.md) - Transpose with an implementation-defined temporary tile.
+- [TSUBVIEW](TSUBVIEW.md) - Reinterpret a tile as a subtile of another tile.
+- [TGET_SCALE_ADDR](TGET_SCALE_ADDR.md) - Bind the on-chip address of output tile to a scaled factor of that of input tile.
+
+## Complex
+- [TPRINT](TPRINT.md) - Debug/print elements from a tile (implementation-defined).
+- [TMRGSORT](TMRGSORT.md) - Merge sort for multiple sorted lists (implementation-defined element format and layout).
+- [TSORT32](TSORT32.md) - Sort each 32-element block of `src` together with the corresponding indices from `idx`, and write the sorted value-index pairs into `dst`.
+- [TGATHER](TGATHER.md) - Gather/select elements using either an index tile or a compile-time mask pattern.
+- [TCI](TCI.md) - Generate a contiguous integer sequence into a destination tile.
+- [TTRI](TTRI.md) - Generate a triangular (lower/upper) mask tile.
+- [TPARTADD](TPARTADD.md) - Partial elementwise add with implementation-defined handling of mismatched valid regions.
+- [TPARTMUL](TPARTMUL.md) - Partial elementwise multiply with implementation-defined handling of mismatched valid regions.
+- [TPARTMAX](TPARTMAX.md) - Partial elementwise max with implementation-defined handling of mismatched valid regions.
+- [TPARTMIN](TPARTMIN.md) - Partial elementwise min with implementation-defined handling of mismatched valid regions.
+- [TGATHERB](TGATHERB.md) - Gather elements using byte offsets.
+- [TSCATTER](TSCATTER.md) - Scatter rows of a source tile into a destination tile using per-element row indices.
+- [TQUANT](TQUANT.md) - Quantize a tile (e.g. FP32 to FP8) producing exponent/scaling/max outputs.
+
+## Communication
+
+See [comm/README.md](comm/README.md) for the full per-instruction communication ISA reference (point-to-point, async, synchronization, and collective operations).
diff --git a/designs/outerCube/PTOISA/README_zh.md b/designs/outerCube/PTOISA/README_zh.md
new file mode 100644
index 00000000..00d087e3
--- /dev/null
+++ b/designs/outerCube/PTOISA/README_zh.md
@@ -0,0 +1,153 @@
+﻿<p align="center">
+  <img src="../figures/pto_logo.svg" alt="PTO Tile Lib" width="180" />
+</p>
+
+# PTO ISA 参考
+
+本目录是 PTO Tile Lib ISA 的指令参考（每条指令一页）。
+
+- 权威来源（C++ 内建函数）：`include/pto/common/pto_instr.hpp`
+- [通用约定（操作数、事件、修饰符）](conventions_zh.md)
+
+## 同步
+- [TSYNC](TSYNC_zh.md) - 同步 PTO 执行（等待事件或插入每操作流水线屏障）。
+
+## 手动 / 资源绑定
+- [TASSIGN](TASSIGN_zh.md) - 将 Tile 对象绑定到实现定义的片上地址（手动放置）。
+- [TSETFMATRIX](TSETFMATRIX_zh.md) - 为类 IMG2COL 操作设置 FMATRIX 寄存器。
+- [TSET_IMG2COL_RPT](TSET_IMG2COL_RPT_zh.md) - 从 IMG2COL 配置 Tile 设置 IMG2COL 重复次数元数据。
+- [TSET_IMG2COL_PADDING](TSET_IMG2COL_PADDING_zh.md) - 从 IMG2COL 配置 Tile 设置 IMG2COL 填充元数据。
+
+## 逐元素（Tile-Tile）
+- [TADD](TADD_zh.md) - 两个 Tile 的逐元素加法。
+- [TABS](TABS_zh.md) - Tile 的逐元素绝对值。
+- [TAND](TAND_zh.md) - 两个 Tile 的逐元素按位与。
+- [TOR](TOR_zh.md) - 两个 Tile 的逐元素按位或。
+- [TSUB](TSUB_zh.md) - 两个 Tile 的逐元素减法。
+- [TMUL](TMUL_zh.md) - 两个 Tile 的逐元素乘法。
+- [TMIN](TMIN_zh.md) - 两个 Tile 的逐元素最小值。
+- [TMAX](TMAX_zh.md) - 两个 Tile 的逐元素最大值。
+- [TCMP](TCMP_zh.md) - 比较两个 Tile 并写入一个打包的谓词掩码。
+- [TDIV](TDIV_zh.md) - 两个 Tile 的逐元素除法。
+- [TSHL](TSHL_zh.md) - 两个 Tile 的逐元素左移。
+- [TSHR](TSHR_zh.md) - 两个 Tile 的逐元素右移。
+- [TXOR](TXOR_zh.md) - 两个 Tile 的逐元素按位异或。
+- [TLOG](TLOG_zh.md) - Tile 的逐元素自然对数。
+- [TRECIP](TRECIP_zh.md) - Tile 的逐元素倒数。
+- [TPRELU](TPRELU_zh.md) - 带逐元素斜率 Tile 的逐元素参数化 ReLU (PReLU)。
+- [TADDC](TADDC_zh.md) - 三元逐元素加法：`src0 + src1 + src2`。
+- [TSUBC](TSUBC_zh.md) - 三元逐元素运算：`src0 - src1 + src2`。
+- [TCVT](TCVT_zh.md) - 带指定舍入模式的逐元素类型转换。
+- [TSEL](TSEL_zh.md) - 使用掩码 Tile 在两个 Tile 之间进行选择（逐元素选择）。
+- [TRSQRT](TRSQRT_zh.md) - 逐元素倒数平方根。
+- [TSQRT](TSQRT_zh.md) - 逐元素平方根。
+- [TEXP](TEXP_zh.md) - 逐元素指数运算。
+- [TNOT](TNOT_zh.md) - Tile 的逐元素按位取反。
+- [TRELU](TRELU_zh.md) - Tile 的逐元素 ReLU。
+- [TNEG](TNEG_zh.md) - Tile 的逐元素取负。
+- [TREM](TREM_zh.md) - 两个 Tile 的逐元素余数，余数符号与除数相同。
+- [TFMOD](TFMOD_zh.md) - 两个 Tile 的逐元素余数，余数符号与被除数相同。
+
+## Tile-标量 / Tile-立即数
+- [TEXPANDS](TEXPANDS_zh.md) - 将标量广播到目标 Tile 中。
+- [TCMPS](TCMPS_zh.md) - 将 Tile 与标量比较并写入逐元素比较结果。
+- [TSELS](TSELS_zh.md) - 使用掩码 Tile 在源 Tile 和标量之间进行选择（源 Tile 逐元素选择）。
+- [TMINS](TMINS_zh.md) - Tile 与标量的逐元素最小值。
+- [TADDS](TADDS_zh.md) - Tile 与标量的逐元素加法。
+- [TSUBS](TSUBS_zh.md) - 从 Tile 中逐元素减去一个标量。
+- [TDIVS](TDIVS_zh.md) - 与标量的逐元素除法（Tile/标量 或 标量/Tile）。
+- [TMULS](TMULS_zh.md) - Tile 与标量的逐元素乘法。
+- [TFMODS](TFMODS_zh.md) - 与标量的逐元素余数：`fmod(src, scalar)`。
+- [TREMS](TREMS_zh.md) - 与标量的逐元素余数：`remainder(src, scalar)`。
+- [TMAXS](TMAXS_zh.md) - Tile 与标量的逐元素最大值：`max(src, scalar)`。
+- [TANDS](TANDS_zh.md) - Tile 与标量的逐元素按位与。
+- [TORS](TORS_zh.md) - Tile 与标量的逐元素按位或。
+- [TSHLS](TSHLS_zh.md) - Tile 按标量逐元素左移。
+- [TSHRS](TSHRS_zh.md) - Tile 按标量逐元素右移。
+- [TXORS](TXORS_zh.md) - Tile 与标量的逐元素按位异或。
+- [TLRELU](TLRELU_zh.md) - 带标量斜率的 Leaky ReLU。
+- [TADDSC](TADDSC_zh.md) - 与标量和第二个 Tile 的融合逐元素加法：`src0 + scalar + src1`。
+- [TSUBSC](TSUBSC_zh.md) - 融合逐元素运算：`src0 - scalar + src1`。
+
+## 轴归约 / 扩展
+- [TROWSUM](TROWSUM_zh.md) - 通过对列求和来归约每一行。
+- [TROWPROD](TROWPROD_zh.md) - 通过跨列乘积来归约每一行。
+- [TCOLSUM](TCOLSUM_zh.md) - 通过对行求和来归约每一列。
+- [TCOLPROD](TCOLPROD_zh.md) - 通过跨行乘积来归约每一列。
+- [TCOLMAX](TCOLMAX_zh.md) - 通过取行间最大值来归约每一列。
+- [TROWMAX](TROWMAX_zh.md) - 通过取列间最大值来归约每一行。
+- [TROWMIN](TROWMIN_zh.md) - 通过取列间最小值来归约每一行。
+- [TROWARGMAX](TROWARGMAX_zh.md) - 获取每行最大值对应列索引。
+- [TROWARGMIN](TROWARGMIN_zh.md) - 获取每行最小值对应列索引。
+- [TCOLARGMAX](TCOLARGMAX_zh.md) - 获取每列最大值对应行索引。
+- [TCOLARGMIN](TCOLARGMIN_zh.md) - 获取每列最小值对应行索引。
+- [TROWEXPAND](TROWEXPAND_zh.md) - 将每个源行的第一个元素广播到目标行中。
+- [TROWEXPANDDIV](TROWEXPANDDIV_zh.md) - 行广播除法：将 `src0` 的每一行除以一个每行标量向量 `src1`。
+- [TROWEXPANDMUL](TROWEXPANDMUL_zh.md) - 行广播乘法：将 `src0` 的每一行乘以一个每行标量向量 `src1`。
+- [TROWEXPANDSUB](TROWEXPANDSUB_zh.md) - 行广播减法：从 `src0` 的每一行中减去一个每行标量向量 `src1`。
+- [TROWEXPANDADD](TROWEXPANDADD_zh.md) - 行广播加法：加上一个每行标量向量。
+- [TROWEXPANDMAX](TROWEXPANDMAX_zh.md) - 行广播最大值：与每行标量向量取最大值。
+- [TROWEXPANDMIN](TROWEXPANDMIN_zh.md) - 行广播最小值：与每行标量向量取最小值。
+- [TROWEXPANDEXPDIF](TROWEXPANDEXPDIF_zh.md) - 行指数差运算：计算 exp(src0 - src1)，其中 src1 为每行标量。
+- [TCOLMIN](TCOLMIN_zh.md) - 通过取行间最小值来归约每一列。
+- [TCOLEXPAND](TCOLEXPAND_zh.md) - 将每个源列的第一个元素广播到目标列中。
+- [TCOLEXPANDDIV](TCOLEXPANDDIV_zh.md) - 列广播除法：将每一列除以一个每列标量向量。
+- [TCOLEXPANDMUL](TCOLEXPANDMUL_zh.md) - 列广播乘法：将每一列乘以一个每列标量向量。
+- [TCOLEXPANDADD](TCOLEXPANDADD_zh.md) - 列广播加法：对每一列加上每列标量向量。
+- [TCOLEXPANDMAX](TCOLEXPANDMAX_zh.md) - 列广播最大值：与每列标量向量取最大值。
+- [TCOLEXPANDMIN](TCOLEXPANDMIN_zh.md) - 列广播最小值：与每列标量向量取最小值。
+- [TCOLEXPANDSUB](TCOLEXPANDSUB_zh.md) - 列广播减法：从每一列中减去一个每列标量向量。
+- [TCOLEXPANDEXPDIF](TCOLEXPANDEXPDIF_zh.md) - 列指数差运算：计算 exp(src0 - src1)，其中 src1 为每列标量。
+
+## 内存（GM <-> Tile）
+- [TLOAD](TLOAD_zh.md) - 从 GlobalTensor (GM) 加载数据到 Tile。
+- [TPREFETCH](TPREFETCH_zh.md) - 将数据从全局内存预取到 Tile 本地缓存/缓冲区（提示）。
+- [TSTORE](TSTORE_zh.md) - 将 Tile 中的数据存储到 GlobalTensor (GM)，可选使用原子写入或量化参数。
+- [TSTORE_FP](TSTORE_FP_zh.md) - 使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 存储到全局内存。
+- [MGATHER](MGATHER_zh.md) - 使用逐元素索引从全局内存收集加载元素到 Tile 中。
+- [MSCATTER](MSCATTER_zh.md) - 使用逐元素索引将 Tile 中的元素散播存储到全局内存。
+
+## 矩阵乘
+- [TGEMV_MX](TGEMV_MX_zh.md) - 带缩放 Tile 的 GEMV 变体，支持混合精度/量化矩阵向量计算。
+- [TMATMUL_MX](TMATMUL_MX_zh.md) - 带额外缩放 Tile 的矩阵乘法 (GEMM)，用于支持目标上的混合精度/量化矩阵乘法。
+- [TMATMUL](TMATMUL_zh.md) - 矩阵乘法 (GEMM)，生成累加器/输出 Tile。
+- [TMATMUL_ACC](TMATMUL_ACC_zh.md) - 带累加器输入的矩阵乘法（融合累加）。
+- [TMATMUL_BIAS](TMATMUL_BIAS_zh.md) - 带偏置加法的矩阵乘法。
+- [TGEMV](TGEMV_zh.md) - 通用矩阵-向量乘法，生成累加器/输出 Tile。
+- [TGEMV_ACC](TGEMV_ACC_zh.md) - 带显式累加器输入/输出 Tile 的 GEMV。
+- [TGEMV_BIAS](TGEMV_BIAS_zh.md) - 带偏置加法的 GEMV。
+
+## 数据搬运 / 布局
+- [TEXTRACT](TEXTRACT_zh.md) - 从源 Tile 中提取子 Tile。
+- [TEXTRACT_FP](TEXTRACT_FP_zh.md) - 带 fp/缩放 Tile 的提取（向量量化参数）。
+- [TIMG2COL](TIMG2COL_zh.md) - 用于类卷积工作负载的图像到列变换。
+- [TINSERT](TINSERT_zh.md) - 在 (indexRow, indexCol) 偏移处将子 Tile 插入到目标 Tile 中。
+- [TINSERT_FP](TINSERT_FP_zh.md) - 带 fp/缩放 Tile 的插入（向量量化参数）。
+- [TFILLPAD](TFILLPAD_zh.md) - 复制 Tile 并在有效区域外使用编译时填充值进行填充。
+- [TFILLPAD_INPLACE](TFILLPAD_INPLACE_zh.md) - 原地填充/填充变体。
+- [TFILLPAD_EXPAND](TFILLPAD_EXPAND_zh.md) - 填充/填充时允许目标大于源。
+- [TMOV](TMOV_zh.md) - 在 Tile 之间移动/复制，可选应用实现定义的转换模式。
+- [TMOV_FP](TMOV_FP_zh.md) - 使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 移动/转换到目标 Tile。
+- [TRESHAPE](TRESHAPE_zh.md) - 将 Tile 重新解释为另一种 Tile 类型/形状，同时保留底层字节。
+- [TTRANS](TTRANS_zh.md) - 使用实现定义的临时 Tile 进行转置。
+- [TSUBVIEW](TSUBVIEW_zh.md) - 表达一个tile是另一个tile的subview。
+- [TGET_SCALE_ADDR](TGET_SCALE_ADDR_zh.md) - 将输出tile的片上内存值绑定为扩展后的输入tile内存的值。
+
+## 复杂指令
+- [TPRINT](TPRINT_zh.md) - 调试/打印 Tile 中的元素（实现定义）。
+- [TMRGSORT](TMRGSORT_zh.md) - 用于多个已排序列表的归并排序（实现定义的元素格式和布局）。
+- [TSORT32](TSORT32_zh.md) - 对 `src` 的每个 32 元素块，与 `idx` 中对应的索引一起进行排序，并将排序后的值-索引对写入 `dst`。
+- [TGATHER](TGATHER_zh.md) - 使用索引 Tile 或编译时掩码模式来收集/选择元素。
+- [TCI](TCI_zh.md) - 生成连续整数序列到目标 Tile 中。
+- [TTRI](TTRI_zh.md) - 生成三角（下/上）掩码 Tile。
+- [TPARTADD](TPARTADD_zh.md) - 部分逐元素加法，对不匹配的有效区域具有实现定义的处理方式。
+- [TPARTMUL](TPARTMUL_zh.md) - 部分逐元素乘法，对有效区域不一致的处理为实现定义。
+- [TPARTMAX](TPARTMAX_zh.md) - 部分逐元素最大值，对不匹配的有效区域具有实现定义的处理方式。
+- [TPARTMIN](TPARTMIN_zh.md) - 部分逐元素最小值，对不匹配的有效区域具有实现定义的处理方式。
+- [TGATHERB](TGATHERB_zh.md) - 使用字节偏移量收集元素。
+- [TSCATTER](TSCATTER_zh.md) - 使用逐元素行索引将源 Tile 的行散播到目标 Tile 中。
+- [TQUANT](TQUANT_zh.md) - 量化 Tile（例如 FP32 到 FP8），生成指数/缩放/最大值输出。
+
+## 通信
+
+完整的通信 ISA 指令参考（点对点、异步、同步原语及集合通信）见 [comm/README_zh.md](comm/README_zh.md)。
diff --git a/designs/outerCube/PTOISA/TABS.md b/designs/outerCube/PTOISA/TABS.md
new file mode 100644
index 00000000..3931b44e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TABS.md
@@ -0,0 +1,133 @@
+﻿﻿# TABS
+
+
+## Tile Operation Diagram
+
+![TABS tile operation](../figures/isa/TABS.svg)
+
+## Introduction
+
+Elementwise absolute value of a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \left|\mathrm{src}_{i,j}\right| $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TABS(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (CPU sim)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float`.
+    - The implementation iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+- **Implementation checks (Costmodel)**:
+    - `TileData::DType` must be one of: `int32_t`、`int16_t`、`int8_t`、`uint8_t`、`half`、`float`.
+- **Implementation checks (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TABS(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TABS(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tabs %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TABS_zh.md b/designs/outerCube/PTOISA/TABS_zh.md
new file mode 100644
index 00000000..3b7f23ce
--- /dev/null
+++ b/designs/outerCube/PTOISA/TABS_zh.md
@@ -0,0 +1,106 @@
+# TABS
+
+## 指令示意图
+
+![TABS tile operation](../figures/isa/TABS.svg)
+
+## 简介
+
+Tile 的逐元素绝对值。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \left|\mathrm{src}_{i,j}\right| $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TABS(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (CPU sim)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float`.
+    - The implementation iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+- **实现检查 (Costmodel)**:
+    - `TileData::DType` must be one of: `int32_t`、`int16_t`、`int8_t`、`uint8_t`、`half`、`float`.
+- **实现检查 (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TABS(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TABS(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TADD.md b/designs/outerCube/PTOISA/TADD.md
new file mode 100644
index 00000000..d6b4a5ad
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADD.md
@@ -0,0 +1,129 @@
+﻿# TADD
+
+
+## Tile Operation Diagram
+
+![TADD tile operation](../figures/isa/TADD.svg)
+
+## Introduction
+
+Elementwise add of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tadd %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`, `bfloat16_t`, `uint8_t`, `int8_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TADD(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TADD(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tadd %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TADDC.md b/designs/outerCube/PTOISA/TADDC.md
new file mode 100644
index 00000000..f91ad263
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADDC.md
@@ -0,0 +1,103 @@
+﻿# TADDC
+
+
+## Tile Operation Diagram
+
+![TADDC tile operation](../figures/isa/TADDC.svg)
+
+## Introduction
+
+Elementwise ternary add: `src0 + src1 + src2`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = taddc %src0, %src1, %src2 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TADDC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
+```
+
+## Constraints
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c, out;
+  TADDC(out, a, b, c);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = taddc %src0, %src1, %src2 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TADDC_zh.md b/designs/outerCube/PTOISA/TADDC_zh.md
new file mode 100644
index 00000000..1a9fec65
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADDC_zh.md
@@ -0,0 +1,76 @@
+# TADDC
+
+## 指令示意图
+
+![TADDC tile operation](../figures/isa/TADDC.svg)
+
+## 简介
+
+三元逐元素加法：`src0 + src1 + src2`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = taddc %src0, %src1, %src2 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.taddc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.taddc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TADDC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
+```
+
+## 约束
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c, out;
+  TADDC(out, a, b, c);
+}
+```
diff --git a/designs/outerCube/PTOISA/TADDS.md b/designs/outerCube/PTOISA/TADDS.md
new file mode 100644
index 00000000..c30212cc
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADDS.md
@@ -0,0 +1,134 @@
+﻿# TADDS
+
+
+## Tile Operation Diagram
+
+![TADDS tile operation](../figures/isa/TADDS.svg)
+
+## Introduction
+
+Elementwise add a scalar to a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} + \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tadds %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TADDS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TADDS(dst, src, 1.0f);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TADDS(dst, src, 1.0f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tadds %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TADDSC.md b/designs/outerCube/PTOISA/TADDSC.md
new file mode 100644
index 00000000..697dcef1
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADDSC.md
@@ -0,0 +1,116 @@
+﻿# TADDSC
+
+
+## Tile Operation Diagram
+
+![TADDSC tile operation](../figures/isa/TADDSC.svg)
+
+## Introduction
+
+Elementwise fused add with scalar and a second tile: `src0 + scalar + src1`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{scalar} + \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = taddsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TADDSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
+                            WaitEvents&... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
+    - Scalar type must match the Tile data type.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, out;
+  TADDSC(out, a, 2.0f, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = taddsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
+# AS Level 2 (DPS)
+pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TADDSC_zh.md b/designs/outerCube/PTOISA/TADDSC_zh.md
new file mode 100644
index 00000000..b3920061
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADDSC_zh.md
@@ -0,0 +1,89 @@
+# TADDSC
+
+## 指令示意图
+
+![TADDSC tile operation](../figures/isa/TADDSC.svg)
+
+## 简介
+
+与标量和第二个 Tile 的融合逐元素加法：`src0 + scalar + src1`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{scalar} + \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = taddsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.taddsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.taddsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TADDSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
+                            WaitEvents&... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
+    - Scalar type must match the Tile data type.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, out;
+  TADDSC(out, a, 2.0f, b);
+}
+```
diff --git a/designs/outerCube/PTOISA/TADDS_zh.md b/designs/outerCube/PTOISA/TADDS_zh.md
new file mode 100644
index 00000000..1ab2a003
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADDS_zh.md
@@ -0,0 +1,107 @@
+# TADDS
+
+## 指令示意图
+
+![TADDS tile operation](../figures/isa/TADDS.svg)
+
+## 简介
+
+Tile 与标量的逐元素加法。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} + \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tadds %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tadds %src, %scalar : (!pto.tile<...>,dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tadds ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TADDS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TADDS(dst, src, 1.0f);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TADDS(dst, src, 1.0f);
+}
+```
diff --git a/designs/outerCube/PTOISA/TADD_zh.md b/designs/outerCube/PTOISA/TADD_zh.md
new file mode 100644
index 00000000..8041e0fe
--- /dev/null
+++ b/designs/outerCube/PTOISA/TADD_zh.md
@@ -0,0 +1,102 @@
+# TADD
+
+## 指令示意图
+
+![TADD tile operation](../figures/isa/TADD.svg)
+
+## 简介
+
+两个 Tile 的逐元素加法。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tadd %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`, `bfloat16_t`, `uint8_t`, `int8_t`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TADD(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TADD(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TALIAS.md b/designs/outerCube/PTOISA/TALIAS.md
new file mode 100644
index 00000000..a827762d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TALIAS.md
@@ -0,0 +1,40 @@
+# TALIAS
+
+## Tile Operation Diagram
+
+![TALIAS tile operation](../figures/isa/TALIAS.svg)
+
+## Introduction
+
+Create an alias tile view that shares the original tile storage.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.talias ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.talias ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TALIAS_zh.md b/designs/outerCube/PTOISA/TALIAS_zh.md
new file mode 100644
index 00000000..d182462c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TALIAS_zh.md
@@ -0,0 +1,41 @@
+# TALIAS
+
+## 指令示意图
+
+![TALIAS tile operation](../figures/isa/TALIAS.svg)
+
+## 简介
+
+创建一个与原始 Tile 共享底层存储的别名视图。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.talias ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.talias ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TAND.md b/designs/outerCube/PTOISA/TAND.md
new file mode 100644
index 00000000..cb660442
--- /dev/null
+++ b/designs/outerCube/PTOISA/TAND.md
@@ -0,0 +1,103 @@
+﻿# TAND
+
+
+## Tile Operation Diagram
+
+![TAND tile operation](../figures/isa/TAND.svg)
+
+## Introduction
+
+Elementwise bitwise AND of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;\&\; \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tand %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TAND(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported element types are 1-byte or 2-byte integral types.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Implementation checks (A5)**:
+    - Supported element types are 1-byte, 2-byte, or 4-byte integral types.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT a, b, out;
+  TAND(out, a, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tand %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TANDS.md b/designs/outerCube/PTOISA/TANDS.md
new file mode 100644
index 00000000..e829518b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TANDS.md
@@ -0,0 +1,106 @@
+﻿# TANDS
+
+
+## Tile Operation Diagram
+
+![TANDS tile operation](../figures/isa/TANDS.svg)
+
+## Introduction
+
+Elementwise bitwise AND of a tile and a scalar.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;\&\; \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tands %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TANDS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Intended for integral element types.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
+    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
+- **Implementation checks (A5)**:
+    - Intended for integral element types supported by `TEXPANDS` and `TAND`.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TANDS(dst, src, 0xffu);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tands %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TANDS_zh.md b/designs/outerCube/PTOISA/TANDS_zh.md
new file mode 100644
index 00000000..0968ae50
--- /dev/null
+++ b/designs/outerCube/PTOISA/TANDS_zh.md
@@ -0,0 +1,106 @@
+﻿# TANDS
+
+## 指令示意图
+
+![TANDS tile operation](../figures/isa/TANDS.svg)
+
+## 简介
+
+Tile 与标量的逐元素按位与。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;\&\; \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tands %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TANDS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 适用于整数元素类型。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
+- **实现检查 (A5)**:
+    - 适用于 `TEXPANDS` 和 `TAND` 支持的整数元素类型。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TANDS(dst, src, 0xffu);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tands %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tands %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tands ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TAND_zh.md b/designs/outerCube/PTOISA/TAND_zh.md
new file mode 100644
index 00000000..e91e08bd
--- /dev/null
+++ b/designs/outerCube/PTOISA/TAND_zh.md
@@ -0,0 +1,103 @@
+﻿# TAND
+
+## 指令示意图
+
+![TAND tile operation](../figures/isa/TAND.svg)
+
+## 简介
+
+两个 Tile 的逐元素按位与。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;\&\; \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tand %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TAND(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 支持的元素类型为 1 字节或 2 字节整数类型。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **实现检查 (A5)**:
+    - 支持的元素类型为 1 字节、2 字节或 4 字节整数类型。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT a, b, out;
+  TAND(out, a, b);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tand %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tand %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tand ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TASSIGN.md b/designs/outerCube/PTOISA/TASSIGN.md
new file mode 100644
index 00000000..92ad1c2c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TASSIGN.md
@@ -0,0 +1,219 @@
+﻿# TASSIGN
+
+
+## Tile Operation Diagram
+
+![TASSIGN tile operation](../figures/isa/TASSIGN.svg)
+
+## Introduction
+
+Bind a Tile object to an implementation-defined on-chip address (manual placement).
+
+## Math Interpretation
+
+Not applicable.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+`TASSIGN` is typically introduced by bufferization/lowering when mapping SSA tiles to physical storage.
+
+Synchronous form:
+
+```text
+tassign %tile, %addr : !pto.tile<...>, index
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tassign %tile, %addr : !pto.tile<...>, dtype
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tassign %tile, %addr : !pto.tile<...>, dtype
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+### Form 1: Runtime address
+
+```cpp
+template <typename T, typename AddrType>
+PTO_INST void TASSIGN(T& obj, AddrType addr);
+```
+
+Binds `obj` to the on-chip address `addr`. No compile-time bounds checking is
+performed (the address value is not available at compile time).
+
+### Form 2: Compile-time address (with static bounds check)
+
+```cpp
+template <std::size_t Addr, typename T>
+PTO_INST void TASSIGN(T& obj);
+```
+
+Binds `obj` to the on-chip address `Addr`. Because `Addr` is a non-type
+template parameter, the compiler performs the following **compile-time** checks
+via `static_assert`:
+
+| Check | Condition | Assertion ID | Error message |
+|-------|-----------|--------------|---------------|
+| Memory space exists | `capacity > 0` | SA-0351 | Memory space is not available on this architecture. |
+| Tile fits in memory | `tile_size <= capacity` | SA-0352 | Tile storage size exceeds memory space capacity. |
+| Address in bounds | `Addr + tile_size <= capacity` | SA-0353 | addr + tile_size exceeds memory space capacity (out of bounds). |
+| Address aligned | `Addr % alignment == 0` | SA-0354 | addr is not properly aligned for the target memory space. |
+
+See `docs/coding/debug.md` (fix recipe `FIX-A12`) for suggested remedies.
+
+The memory space, capacity, and alignment are determined automatically from the
+Tile's `TileType` (i.e. `Loc` template parameter):
+
+| TileType | Memory | Capacity (A2A3) | Capacity (A5) | Capacity (Kirin9030) | Capacity (KirinX90) | Alignment |
+|----------|--------|-----------------|---------------|----------------------|---------------------|-----------|
+| Vec | UB | 192 KB | 256 KB | 128 KB | 128 KB | 32 B |
+| Mat | L1 | 512 KB | 512 KB | 512 KB | 1024 KB | 32 B |
+| Left | L0A | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
+| Right | L0B | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
+| Acc | L0C | 128 KB | 256 KB | 64 KB | 128 KB | 32 B |
+| Bias | Bias | 1 KB | 4 KB | 1 KB | 1 KB | 32 B |
+| Scaling | FBuffer | 2 KB | 4 KB | 7 KB | 6 KB | 32 B |
+| ScaleLeft | L0A | N/A | 4 KB | N/A | N/A | 32 B |
+| ScaleRight | L0B | N/A | 4 KB | N/A | N/A | 32 B |
+
+Capacities can be overridden at build time via `-D` flags (e.g.
+`-DPTO_UBUF_SIZE_BYTES=262144`). See `include/pto/common/buffer_limits.hpp`.
+
+**Note:** This overload is only available for `Tile` and `ConvTile` types. For
+`GlobalTensor`, use `TASSIGN(obj, pointer)` (Form 1).
+
+## Constraints
+
+- **Implementation checks**:
+    - If `obj` is a Tile:
+    - In manual mode (when `__PTO_AUTO__` is not defined), `addr` must be an integral type and is reinterpreted as the tile's storage address.
+    - In auto mode (when `__PTO_AUTO__` is defined), `TASSIGN(tile, addr)` is a no-op.
+    - If `obj` is a `GlobalTensor`:
+    - `addr` must be a pointer type.
+    - The pointed-to element type must match `GlobalTensor::DType`.
+
+## Examples
+
+### Runtime address (no compile-time check)
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_runtime() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c, 0x3000);
+  TADD(c, a, b);
+}
+```
+
+### Compile-time address (with static bounds check)
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_checked() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c;
+
+  TASSIGN<0x0000>(a);   // OK: 0x0000 + 1024 <= 192KB
+  TASSIGN<0x0400>(b);   // OK: 0x0400 + 1024 <= 192KB
+  TASSIGN<0x0800>(c);   // OK: 0x0800 + 1024 <= 192KB
+  TADD(c, a, b);
+}
+```
+
+The following triggers a compile error:
+
+```cpp
+void example_oob() {
+  // Tile<Vec, float, 256, 256> occupies 256*256*4 = 256KB
+  using BigTile = Tile<TileType::Vec, float, 256, 256>;
+  BigTile t;
+
+  // static_assert fires: tile_size (256KB) > UB capacity (192KB on A2A3)
+  TASSIGN<0x0>(t);
+}
+```
+
+```cpp
+void example_oob_addr() {
+  using TileT = Tile<TileType::Vec, float, 128, 128>;  // 64KB
+  TileT t;
+
+  // static_assert fires: 0x20000 (128KB) + 64KB = 192KB,
+  //                       but 0x20001 + 64KB > 192KB
+  TASSIGN<0x20001>(t);
+}
+```
+
+### Ping-pong L0 buffer allocation
+
+```cpp
+void example_pingpong() {
+  using L0ATile = TileLeft<half, 64, 128>;   // L0A tile
+  using L0BTile = TileRight<half, 128, 64>;  // L0B tile
+
+  L0ATile a0, a1;
+  L0BTile b0, b1;
+
+  TASSIGN<0x0000>(a0);   // L0A ping
+  TASSIGN<0x8000>(a1);   // L0A pong
+  TASSIGN<0x0000>(b0);   // L0B ping  (separate physical memory from L0A)
+  TASSIGN<0x8000>(b1);   // L0B pong
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.tassign %tile, %addr : !pto.tile<...>, dtype
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.tassign %tile, %addr : !pto.tile<...>, dtype
+```
+
+### PTO Assembly Form
+
+```text
+tassign %tile, %addr : !pto.tile<...>, index
+# AS Level 2 (DPS)
+pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
+```
diff --git a/designs/outerCube/PTOISA/TASSIGN_zh.md b/designs/outerCube/PTOISA/TASSIGN_zh.md
new file mode 100644
index 00000000..cdf0ce03
--- /dev/null
+++ b/designs/outerCube/PTOISA/TASSIGN_zh.md
@@ -0,0 +1,192 @@
+# TASSIGN
+
+## 指令示意图
+
+![TASSIGN tile operation](../figures/isa/TASSIGN.svg)
+
+## 简介
+
+将 Tile 对象绑定到实现定义的片上地址（手动放置）。
+
+## 数学语义
+
+Not applicable.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+`TASSIGN` is typically introduced by bufferization/lowering when mapping SSA tiles to physical storage.
+
+同步形式：
+
+```text
+tassign %tile, %addr : !pto.tile<...>, index
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tassign %tile, %addr : !pto.tile<...>, dtype
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tassign %tile, %addr : !pto.tile<...>, dtype
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tassign ins(%tile, %addr : !pto.tile_buf<...>, dtype)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+### Form 1: Runtime address
+
+```cpp
+template <typename T, typename AddrType>
+PTO_INST void TASSIGN(T& obj, AddrType addr);
+```
+
+Binds `obj` to the on-chip address `addr`. No compile-time bounds checking is
+performed (the address value is not available at compile time).
+
+### Form 2: Compile-time address (with static bounds check)
+
+```cpp
+template <std::size_t Addr, typename T>
+PTO_INST void TASSIGN(T& obj);
+```
+
+Binds `obj` to the on-chip address `Addr`. Because `Addr` is a non-type
+template parameter, the compiler performs the following **compile-time** checks
+via `static_assert`:
+
+| Check | Condition | Assertion ID | Error message |
+|-------|-----------|--------------|---------------|
+| Memory space exists | `capacity > 0` | SA-0351 | Memory space is not available on this architecture. |
+| Tile fits in memory | `tile_size <= capacity` | SA-0352 | Tile storage size exceeds memory space capacity. |
+| Address in bounds | `Addr + tile_size <= capacity` | SA-0353 | addr + tile_size exceeds memory space capacity (out of bounds). |
+| Address aligned | `Addr % alignment == 0` | SA-0354 | addr is not properly aligned for the target memory space. |
+
+See `docs/coding/debug.md` (fix recipe `FIX-A12`) for suggested remedies.
+
+The memory space, capacity, and alignment are determined automatically from the
+Tile's `TileType` (i.e. `Loc` template parameter):
+
+| TileType | Memory | Capacity (A2A3) | Capacity (A5) | Capacity (Kirin9030) | Capacity (KirinX90) | Alignment |
+|----------|--------|-----------------|---------------|----------------------|---------------------|-----------|
+| Vec | UB | 192 KB | 256 KB | 128 KB | 128 KB | 32 B |
+| Mat | L1 | 512 KB | 512 KB | 512 KB | 1024 KB | 32 B |
+| Left | L0A | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
+| Right | L0B | 64 KB | 64 KB | 32 KB | 64 KB | 32 B |
+| Acc | L0C | 128 KB | 256 KB | 64 KB | 128 KB | 32 B |
+| Bias | Bias | 1 KB | 4 KB | 1 KB | 1 KB | 32 B |
+| Scaling | FBuffer | 2 KB | 4 KB | 7 KB | 6 KB | 32 B |
+| ScaleLeft | L0A | N/A | 4 KB | N/A | N/A | 32 B |
+| ScaleRight | L0B | N/A | 4 KB | N/A | N/A | 32 B |
+
+Capacities can be overridden at build time via `-D` flags (e.g.
+`-DPTO_UBUF_SIZE_BYTES=262144`). See `include/pto/common/buffer_limits.hpp`.
+
+**Note:** This overload is only available for `Tile` and `ConvTile` types. For
+`GlobalTensor`, use `TASSIGN(obj, pointer)` (Form 1).
+
+## 约束
+
+- **实现检查**:
+    - If `obj` is a Tile:
+    - In manual mode (when `__PTO_AUTO__` is not defined), `addr` must be an integral type and is reinterpreted as the tile's storage address.
+    - In auto mode (when `__PTO_AUTO__` is defined), `TASSIGN(tile, addr)` is a no-op.
+    - If `obj` is a `GlobalTensor`:
+    - `addr` must be a pointer type.
+    - The pointed-to element type must match `GlobalTensor::DType`.
+
+## 示例
+
+### Runtime address (no compile-time check)
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_runtime() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c, 0x3000);
+  TADD(c, a, b);
+}
+```
+
+### Compile-time address (with static bounds check)
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_checked() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c;
+
+  TASSIGN<0x0000>(a);   // OK: 0x0000 + 1024 <= 192KB
+  TASSIGN<0x0400>(b);   // OK: 0x0400 + 1024 <= 192KB
+  TASSIGN<0x0800>(c);   // OK: 0x0800 + 1024 <= 192KB
+  TADD(c, a, b);
+}
+```
+
+The following triggers a compile error:
+
+```cpp
+void example_oob() {
+  // Tile<Vec, float, 256, 256> occupies 256*256*4 = 256KB
+  using BigTile = Tile<TileType::Vec, float, 256, 256>;
+  BigTile t;
+
+  // static_assert fires: tile_size (256KB) > UB capacity (192KB on A2A3)
+  TASSIGN<0x0>(t);
+}
+```
+
+```cpp
+void example_oob_addr() {
+  using TileT = Tile<TileType::Vec, float, 128, 128>;  // 64KB
+  TileT t;
+
+  // static_assert fires: 0x20000 (128KB) + 64KB = 192KB,
+  //                       but 0x20001 + 64KB > 192KB
+  TASSIGN<0x20001>(t);
+}
+```
+
+### Ping-pong L0 buffer allocation
+
+```cpp
+void example_pingpong() {
+  using L0ATile = TileLeft<half, 64, 128>;   // L0A tile
+  using L0BTile = TileRight<half, 128, 64>;  // L0B tile
+
+  L0ATile a0, a1;
+  L0BTile b0, b1;
+
+  TASSIGN<0x0000>(a0);   // L0A ping
+  TASSIGN<0x8000>(a1);   // L0A pong
+  TASSIGN<0x0000>(b0);   // L0B ping  (separate physical memory from L0A)
+  TASSIGN<0x8000>(b1);   // L0B pong
+}
+```
diff --git a/designs/outerCube/PTOISA/TAXPY.md b/designs/outerCube/PTOISA/TAXPY.md
new file mode 100644
index 00000000..ffe6e549
--- /dev/null
+++ b/designs/outerCube/PTOISA/TAXPY.md
@@ -0,0 +1,40 @@
+# TAXPY
+
+## Tile Operation Diagram
+
+![TAXPY tile operation](../figures/isa/TAXPY.svg)
+
+## Introduction
+
+AXPY-style fused update: multiply a tile by a scalar and accumulate into the destination tile.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.taxpy ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.taxpy ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TAXPY_zh.md b/designs/outerCube/PTOISA/TAXPY_zh.md
new file mode 100644
index 00000000..df5907e8
--- /dev/null
+++ b/designs/outerCube/PTOISA/TAXPY_zh.md
@@ -0,0 +1,41 @@
+# TAXPY
+
+## 指令示意图
+
+![TAXPY tile operation](../figures/isa/TAXPY.svg)
+
+## 简介
+
+AXPY 风格融合更新：将 Tile 乘以标量并累加到目标 Tile。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.taxpy ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.taxpy ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TCI.md b/designs/outerCube/PTOISA/TCI.md
new file mode 100644
index 00000000..ee4ab171
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCI.md
@@ -0,0 +1,133 @@
+﻿# TCI
+
+
+## Tile Operation Diagram
+
+![TCI tile operation](../figures/isa/TCI.svg)
+
+## Introduction
+
+Generate a contiguous integer sequence into a destination tile.
+
+## Math Interpretation
+
+For a linearized index `k` over the valid elements:
+
+- Ascending:
+
+  $$ \mathrm{dst}_{k} = S + k $$
+
+- Descending:
+
+  $$ \mathrm{dst}_{k} = S - k $$
+
+The linearization order depends on the tile layout (implementation-defined).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tci %S {descending = false} : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename T, int descending, typename... WaitEvents>
+PTO_INST RecordEvent TCI(TileData &dst, T start, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3/A5)**:
+    - `TileData::DType` must be exactly the same type as the scalar template parameter `T`.
+    - `dst/scalar` element types must be identical, and must be one of: `int32_t`, `uint32_t`, `int16_t`, `uint16_t`.
+    - `TileData::Cols != 1` (this is the condition enforced by the implementation).
+- **Valid region**:
+    - The implementation uses `dst.GetValidCol()` as the sequence length and does not consult `dst.GetValidRow()`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
+  TileT dst;
+  TCI<TileT, int32_t, /*descending=*/0>(dst, /*S=*/0);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
+  TileT dst;
+  TASSIGN(dst, 0x1000);
+  TCI<TileT, int32_t, /*descending=*/1>(dst, /*S=*/100);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tci %S {descending = false} : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCI_zh.md b/designs/outerCube/PTOISA/TCI_zh.md
new file mode 100644
index 00000000..ba405600
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCI_zh.md
@@ -0,0 +1,106 @@
+# TCI
+
+## 指令示意图
+
+![TCI tile operation](../figures/isa/TCI.svg)
+
+## 简介
+
+生成连续整数序列到目标 Tile 中。
+
+## 数学语义
+
+For a linearized index `k` over the valid elements:
+
+- Ascending:
+
+  $$ \mathrm{dst}_{k} = S + k $$
+
+- Descending:
+
+  $$ \mathrm{dst}_{k} = S - k $$
+
+The linearization order depends on the tile layout (implementation-defined).
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tci %S {descending = false} : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tci %scalar {descending = false} : dtype -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename T, int descending, typename... WaitEvents>
+PTO_INST RecordEvent TCI(TileData &dst, T start, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3/A5)**:
+    - `TileData::DType` must be exactly the same type as the scalar template parameter `T`.
+    - `dst/scalar` element types must be identical, and must be one of: `int32_t`, `uint32_t`, `int16_t`, `uint16_t`.
+    - `TileData::Cols != 1` (this is the condition enforced by the implementation).
+- **有效区域**:
+    - The implementation uses `dst.GetValidCol()` as the sequence length and does not consult `dst.GetValidRow()`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
+  TileT dst;
+  TCI<TileT, int32_t, /*descending=*/0>(dst, /*S=*/0);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, int32_t, 1, 16>;
+  TileT dst;
+  TASSIGN(dst, 0x1000);
+  TCI<TileT, int32_t, /*descending=*/1>(dst, /*S=*/100);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCMP.md b/designs/outerCube/PTOISA/TCMP.md
new file mode 100644
index 00000000..8e6db2f0
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCMP.md
@@ -0,0 +1,142 @@
+﻿# TCMP
+
+
+## Tile Operation Diagram
+
+![TCMP tile operation](../figures/isa/TCMP.svg)
+
+## Introduction
+
+Compare two tiles and write a packed predicate mask.
+
+## Math Interpretation
+
+Conceptually, for each element `(i, j)` in the valid region, define a predicate:
+
+$$ p_{i,j} = \left(\mathrm{src0}_{i,j}\ \mathrm{cmpMode}\ \mathrm{src1}_{i,j}\right) $$
+
+The predicate mask is stored in `dst` using an implementation-defined packed layout.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TCMP(TileDataDst &dst, TileDataSrc &src0, TileDataSrc &src1, CmpMode cmpMode, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Input type must be one of: `int32_t`, `half`, `float`.
+    - Output type must be `uint8_t`.
+    - `src0/src1/dst` tile location must be `TileType::Vec`.
+    - Static valid bounds: `TileDataSrc::ValidRow <= TileDataSrc::Rows` and `TileDataSrc::ValidCol <= TileDataSrc::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Note: `src1` shape/valid is not validated by explicit runtime assertions in this implementation.
+    - For `TileDataSrc::DType == int32_t`, the implementation uses the `EQ` compare path regardless of `cmpMode`.
+- **Implementation checks (A5)**:
+    - Input type must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Output type must be `uint32_t`.
+    - Implemented (see `include/pto/npu/a5/TCmp.hpp`).
+    - The A5 implementation uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain and writes a packed predicate mask into `dst` (target-defined packing).
+- **Mask encoding**:
+    - The mask tile is interpreted as packed predicate bits in a target-defined layout.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src0, src1;
+  MaskT mask(16, 2);
+  TCMP(mask, src0, src1, CmpMode::GT);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src0, src1;
+  MaskT mask(16, 2);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(mask, 0x3000);
+  TCMP(mask, src0, src1, CmpMode::GT);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCMPS.md b/designs/outerCube/PTOISA/TCMPS.md
new file mode 100644
index 00000000..d3d199f6
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCMPS.md
@@ -0,0 +1,140 @@
+﻿# TCMPS
+
+
+## Tile Operation Diagram
+
+![TCMPS tile operation](../figures/isa/TCMPS.svg)
+
+## Introduction
+
+Compare a tile against a scalar and write per-element comparison results.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \left(\mathrm{src}_{i,j}\ \mathrm{cmpMode}\ \mathrm{scalar}\right) $$
+
+The encoding/type of `dst` is implementation-defined (often a mask-like tile).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename T, typename... WaitEvents>
+PTO_INST RecordEvent TCMPS(TileDataDst& dst, TileDataSrc0& src0, T src1, CmpMode cmpMode, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0` and `dst` must have the same valid row/col.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Comparison modes**:
+    - Supports `CmpMode::EQ`, `CmpMode::NE`, `CmpMode::LT`, `CmpMode::GT`, `CmpMode::LE`, `CmpMode::GE`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src;
+  DstT dst(16, 2);
+  TCMPS(dst, src, 0.0f, CmpMode::GT);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src;
+  DstT dst(16, 2);
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCMPS(dst, src, 0.0f, CmpMode::GT);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCMPS_zh.md b/designs/outerCube/PTOISA/TCMPS_zh.md
new file mode 100644
index 00000000..a697b3e0
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCMPS_zh.md
@@ -0,0 +1,113 @@
+# TCMPS
+
+## 指令示意图
+
+![TCMPS tile operation](../figures/isa/TCMPS.svg)
+
+## 简介
+
+将 Tile 与标量比较并写入逐元素比较结果。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \left(\mathrm{src}_{i,j}\ \mathrm{cmpMode}\ \mathrm{scalar}\right) $$
+
+The encoding/type of `dst` is implementation-defined (often a mask-like tile).
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tcmps %src, %scalar {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcmps %src, %scalar {cmpMode = #pto<cmp xx>} : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcmps ins(%src, %scalar{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename T, typename... WaitEvents>
+PTO_INST RecordEvent TCMPS(TileDataDst& dst, TileDataSrc0& src0, T src1, CmpMode cmpMode, WaitEvents&... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `float`, `half`, `uint16_t`, `int16_t`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0` and `dst` must have the same valid row/col.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Comparison modes**:
+    - Supports `CmpMode::EQ`, `CmpMode::NE`, `CmpMode::LT`, `CmpMode::GT`, `CmpMode::LE`, `CmpMode::GE`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src;
+  DstT dst(16, 2);
+  TCMPS(dst, src, 0.0f, CmpMode::GT);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src;
+  DstT dst(16, 2);
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCMPS(dst, src, 0.0f, CmpMode::GT);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCMP_zh.md b/designs/outerCube/PTOISA/TCMP_zh.md
new file mode 100644
index 00000000..1d7b93dd
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCMP_zh.md
@@ -0,0 +1,115 @@
+# TCMP
+
+## 指令示意图
+
+![TCMP tile operation](../figures/isa/TCMP.svg)
+
+## 简介
+
+比较两个 Tile 并写入一个打包的谓词掩码。
+
+## 数学语义
+
+Conceptually, for each element `(i, j)` in the valid region, define a predicate:
+
+$$ p_{i,j} = \left(\mathrm{src0}_{i,j}\ \mathrm{cmpMode}\ \mathrm{src1}_{i,j}\right) $$
+
+The predicate mask is stored in `dst` using an implementation-defined packed layout.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tcmp %src0, %src1 {cmpMode = #pto.cmp<EQ>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcmp %src0, %src1{cmpMode = #pto<cmp xx>}: (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcmp ins(%src0, %src1{cmpMode = #pto<cmp xx>}: !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/type.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TCMP(TileDataDst &dst, TileDataSrc &src0, TileDataSrc &src1, CmpMode cmpMode, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - Input type must be one of: `int32_t`, `half`, `float`.
+    - Output type must be `uint8_t`.
+    - `src0/src1/dst` tile location must be `TileType::Vec`.
+    - Static valid bounds: `TileDataSrc::ValidRow <= TileDataSrc::Rows` and `TileDataSrc::ValidCol <= TileDataSrc::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Note: `src1` shape/valid is not validated by explicit runtime assertions in this implementation.
+    - For `TileDataSrc::DType == int32_t`, the implementation uses the `EQ` compare path regardless of `cmpMode`.
+- **实现检查 (A5)**:
+    - Input type must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Output type must be `uint32_t`.
+    - Implemented (see `include/pto/npu/a5/TCmp.hpp`).
+    - The A5 implementation uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain and writes a packed predicate mask into `dst` (target-defined packing).
+- **Mask encoding**:
+    - The mask tile is interpreted as packed predicate bits in a target-defined layout.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src0, src1;
+  MaskT mask(16, 2);
+  TCMP(mask, src0, src1, CmpMode::GT);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src0, src1;
+  MaskT mask(16, 2);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(mask, 0x3000);
+  TCMP(mask, src0, src1, CmpMode::GT);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCOLARGMAX.md b/designs/outerCube/PTOISA/TCOLARGMAX.md
new file mode 100644
index 00000000..a69a8a3c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLARGMAX.md
@@ -0,0 +1,178 @@
+# TCOLARGMAX
+
+
+## Tile Operation Diagram
+
+![TCOLARGMAX tile operation](../figures/isa/TCOLARGMAX.svg)
+
+## Introduction
+
+Get the row index of the maximum element for each column.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/grammar/PTO-AS.md`.
+
+Synchronous form:
+
+```text
+%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TCOLARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile layout of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Source data types: `half`, `float`, `uint16_t`, `uint32_t`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - `tmp` data type must be consistent with `src` data type.
+  - Compile-time check: `src.ValidCol` must be `1` or `-1` (dynamic).
+  - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `dstValidRow == 1`.
+    - `srcValidCol == dstValidCol`.
+- A5:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile layout of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Source data types: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - Compile-time check: `src.ValidCol` must be `1` or `-1` (dynamic).
+  - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `dstValidRow == 1`.
+    - `srcValidCol == dstValidCol`.
+  - `tmp` temporary tile is not used, only for compatibility.
+
+### About temporary tile `tmp` for A2A3
+
+* `tmp` is always used in the A2A3 implementation as scratch space for intermediate results (current index, argmax index, and current max elements).
+* `tmp` tile's data type must be the same as `src`'s data type.
+* `tmp` tile is organized into three regions within a single row:
+  - Region 0 (`[0, tmpGapEles)`): current row index counter (incremented per row).
+  - Region 1 (`[tmpGapEles, 2 * tmpGapEles)`): current maximum elements for comparison.
+  - Region 2 (`[2 * tmpGapEles, 3 * tmpGapEles)`): argmax index result (before final conversion to `dst`).
+* `tmpGapEles` is determined as follows:
+  - When `srcValidCol >= elemPerRpt`: `tmpGapEles = elemPerRpt`.
+  - When `srcValidCol < elemPerRpt`: `tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`.
+* Simply set `tmp` tile size the same as `src` when `src` is small, or calculate the required stride based on `src`'s `validCol` using the following formula:
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+### About temporary tile `tmp` for A5
+
+* `tmp` temporary tile is **not used** in the A5 implementation. The A5 uses vector register-based computation (`__VEC_SCOPE__`) and does not require scratch tile storage.
+* `tmp` is retained in the C++ intrinsic signature solely for API compatibility with A2A3.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TCOLARGMAX(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TASSIGN(src, 0x0);
+  TASSIGN(dst, 0x1000);
+  TASSIGN(tmp, 0x2000);
+  TCOLARGMAX(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+</task_progress>
+- [x] Explore existing docs/isa for documentation style and format
+- [x] Read tcolargmax and tcolargmin A2A3 implementation in include/
+- [x] Read tcolargmax and tcolargmin A5 implementation in include/
+- [x] Read test cases for tcolargmax and tcolargmin
+- [x] Understand A2A3 vs A5 differences and tmp handling
+- [x] Write tcolargmax English documentation (docs/isa/TCOLARGMAX.md)
+- [ ] Write tcolargmax Chinese documentation (docs/isa/TCOLARGMAX_zh.md)
+- [ ] Verify documentation completeness and accuracy
+</task_progress>
+</write_to_file>
\ No newline at end of file
diff --git a/designs/outerCube/PTOISA/TCOLARGMAX_zh.md b/designs/outerCube/PTOISA/TCOLARGMAX_zh.md
new file mode 100644
index 00000000..91268287
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLARGMAX_zh.md
@@ -0,0 +1,178 @@
+# TCOLARGMAX
+
+## 指令示意图
+
+![TCOLARGMAX tile operation](../figures/isa/TCOLARGMAX.svg)
+
+## 简介
+
+获取每列最大值对应行索引。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmax}} \; \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/grammar/PTO-AS.md`.
+
+同步形式：
+
+```text
+%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1（SSA）
+
+```text
+%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2（DPS）
+
+```text
+pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TCOLARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+  - Tile location: `dst` 和 `src` 必须为 `TileType::Vec`。
+  - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - Tile 布局 of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - 源数据类型: `half`、`float`、`uint16_t`、`uint32_t`。
+  - 目标数据类型: `uint32_t` 或 `int32_t`。
+  - `tmp` 数据类型必须与 `src` 数据类型一致。
+  - 编译期检查: `src.ValidCol` 必须为 `1` 或 `-1`（动态）。
+  - 运行期有效区域检查:
+    - `srcValidCol != 0` 且 `srcValidRow != 0`。
+    - `dstValidRow == 1`。
+    - `srcValidCol == dstValidCol`。
+- A5:
+  - Tile location: `dst` 和 `src` 必须为 `TileType::Vec`。
+  - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - Tile 布局 of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - 源数据类型: `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float`。
+  - 目标数据类型: `uint32_t` 或 `int32_t`。
+  - 编译期检查: `src.ValidCol` 必须为 `1` 或 `-1`（动态）。
+  - 运行期有效区域检查:
+    - `srcValidCol != 0` 且 `srcValidRow != 0`。
+    - `dstValidRow == 1`。
+    - `srcValidCol == dstValidCol`。
+  - `tmp` 临时 Tile 不使用，仅做兼容。
+
+### A2A3 `tmp` 临时 Tile 相关说明
+
+* A2A3 实现中 `tmp` **始终被使用**，作为中间结果的临时存储空间（当前行索引、argmax 索引、当前最大值元素）。
+* `tmp` Tile 的数据类型必须与 `src` 的数据类型一致。
+* `tmp` Tile 在单行内被划分为三个区域：
+  - 区域 0（`[0, tmpGapEles)`）：当前行索引计数器（每行递增）。
+  - 区域 1（`[tmpGapEles, 2 * tmpGapEles)`）：当前最大值元素，用于比较。
+  - 区域 2（`[2 * tmpGapEles, 3 * tmpGapEles)`）：argmax 索引结果（最终转换后写入 `dst`）。
+* `tmpGapEles` 的确定方式：
+  - 当 `srcValidCol >= elemPerRpt` 时：`tmpGapEles = elemPerRpt`。
+  - 当 `srcValidCol < elemPerRpt` 时：`tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`。
+* 当 `src` 较小时，可直接将 `tmp` Tile 大小设为与 `src` 相同；也可按以下公式根据 `src` 的 `validCol` 算出 `tmp` Tile 所需 stride：
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+### A5 `tmp` 临时 Tile 相关说明
+
+* A5 实现中 `tmp` 临时 Tile **不使用**。A5 使用基于向量寄存器的计算方式（`__VEC_SCOPE__`），不需要临时 Tile 存储。
+* `tmp` 在 C++ 内建接口签名中保留，仅为了与 A2A3 的 API 兼容。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TCOLARGMAX(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TASSIGN(src, 0x0);
+  TASSIGN(dst, 0x1000);
+  TASSIGN(tmp, 0x2000);
+  TCOLARGMAX(dst, src, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+</task_progress>
+- [x] Explore existing docs/isa for documentation style and format
+- [x] Read tcolargmax and tcolargmin A2A3 implementation in include/
+- [x] Read tcolargmax and tcolargmin A5 implementation in include/
+- [x] Read test cases for tcolargmax and tcolargmin
+- [x] Understand A2A3 vs A5 differences and tmp handling
+- [x] Write tcolargmax English documentation (docs/isa/TCOLARGMAX.md)
+- [x] Write tcolargmax Chinese documentation (docs/isa/TCOLARGMAX_zh.md)
+- [ ] Verify documentation completeness and accuracy
+</task_progress>
+</write_to_file>
\ No newline at end of file
diff --git a/designs/outerCube/PTOISA/TCOLARGMIN.md b/designs/outerCube/PTOISA/TCOLARGMIN.md
new file mode 100644
index 00000000..63df789f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLARGMIN.md
@@ -0,0 +1,172 @@
+# TCOLARGMIN
+
+
+## Tile Operation Diagram
+
+![TCOLARGMIN tile operation](../figures/isa/TCOLARGMIN.svg)
+
+## Introduction
+
+Get the row index of the minimum element for each column.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/grammar/PTO-AS.md`.
+
+Synchronous form:
+
+```text
+%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TCOLARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile layout of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Source data types: `half`, `float`, `uint16_t`, `uint32_t`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - `tmp` data type must be consistent with `src` data type.
+  - Compile-time check: `src.ValidCol` must be `1` or `-1` (dynamic).
+  - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `dstValidRow == 1`.
+    - `srcValidCol == dstValidCol`.
+- A5:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile layout of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Source data types: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - Compile-time check: `src.ValidCol` must be `1` or `-1` (dynamic).
+  - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `dstValidRow == 1`.
+    - `srcValidCol == dstValidCol`.
+  - `tmp` temporary tile is not used, only for compatibility.
+
+### About temporary tile `tmp` for A2A3
+
+* `tmp` is always used in the A2A3 implementation as scratch space for intermediate results (current index, argmin index, and current min elements).
+* `tmp` tile's data type must be the same as `src`'s data type.
+* `tmp` tile is organized into three regions within a single row:
+  - Region 0 (`[0, tmpGapEles)`): current row index counter (incremented per row).
+  - Region 1 (`[tmpGapEles, 2 * tmpGapEles)`): current minimum elements for comparison.
+  - Region 2 (`[2 * tmpGapEles, 3 * tmpGapEles)`): argmin index result (before final conversion to `dst`).
+* `tmpGapEles` is determined as follows:
+  - When `srcValidCol >= elemPerRpt`: `tmpGapEles = elemPerRpt`.
+  - When `srcValidCol < elemPerRpt`: `tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`.
+* Simply set `tmp` tile size the same as `src` when `src` is small, or calculate the required stride based on `src`'s `validCol` using the following formula:
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+### About temporary tile `tmp` for A5
+
+* `tmp` temporary tile is **not used** in the A5 implementation. The A5 uses vector register-based computation (`__VEC_SCOPE__`) and does not require scratch tile storage.
+* `tmp` is retained in the C++ intrinsic signature solely for API compatibility with A2A3.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TCOLARGMIN(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TASSIGN(src, 0x0);
+  TASSIGN(dst, 0x1000);
+  TASSIGN(tmp, 0x2000);
+  TCOLARGMIN(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+</task_progress>
+- [x] Write tcolargmin English documentation (docs/isa/TCOLARGMIN.md)
+- [ ] Write tcolargmin Chinese documentation (docs/isa/TCOLARGMIN_zh.md)
+</task_progress>
+</write_to_file>
\ No newline at end of file
diff --git a/designs/outerCube/PTOISA/TCOLARGMIN_zh.md b/designs/outerCube/PTOISA/TCOLARGMIN_zh.md
new file mode 100644
index 00000000..33fe2a24
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLARGMIN_zh.md
@@ -0,0 +1,172 @@
+# TCOLARGMIN
+
+## 指令示意图
+
+![TCOLARGMIN tile operation](../figures/isa/TCOLARGMIN.svg)
+
+## 简介
+
+获取每列最小值对应行索引。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/grammar/PTO-AS.md`.
+
+同步形式：
+
+```text
+%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1（SSA）
+
+```text
+%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2（DPS）
+
+```text
+pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TCOLARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+  - Tile location: `dst` 和 `src` 必须为 `TileType::Vec`。
+  - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - Tile 布局 of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - 源数据类型: `half`、`float`、`uint16_t`、`uint32_t`。
+  - 目标数据类型: `uint32_t` 或 `int32_t`。
+  - `tmp` 数据类型必须与 `src` 数据类型一致。
+  - 编译期检查: `src.ValidCol` 必须为 `1` 或 `-1`（动态）。
+  - 运行期有效区域检查:
+    - `srcValidCol != 0` 且 `srcValidRow != 0`。
+    - `dstValidRow == 1`。
+    - `srcValidCol == dstValidCol`。
+- A5:
+  - Tile location: `dst` 和 `src` 必须为 `TileType::Vec`。
+  - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - Tile 布局 of `dst`: ND fractal (`isRowMajor` and `SLayout::NoneBox`)。
+  - 源数据类型: `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float`。
+  - 目标数据类型: `uint32_t` 或 `int32_t`。
+  - 编译期检查: `src.ValidCol` 必须为 `1` 或 `-1`（动态）。
+  - 运行期有效区域检查:
+    - `srcValidCol != 0` 且 `srcValidRow != 0`。
+    - `dstValidRow == 1`。
+    - `srcValidCol == dstValidCol`。
+  - `tmp` 临时 Tile 不使用，仅做兼容。
+
+### A2A3 `tmp` 临时 Tile 相关说明
+
+* A2A3 实现中 `tmp` **始终被使用**，作为中间结果的临时存储空间（当前行索引、argmin 索引、当前最小值元素）。
+* `tmp` Tile 的数据类型必须与 `src` 的数据类型一致。
+* `tmp` Tile 在单行内被划分为三个区域：
+  - 区域 0（`[0, tmpGapEles)`）：当前行索引计数器（每行递增）。
+  - 区域 1（`[tmpGapEles, 2 * tmpGapEles)`）：当前最小值元素，用于比较。
+  - 区域 2（`[2 * tmpGapEles, 3 * tmpGapEles)`）：argmin 索引结果（最终转换后写入 `dst`）。
+* `tmpGapEles` 的确定方式：
+  - 当 `srcValidCol >= elemPerRpt` 时：`tmpGapEles = elemPerRpt`。
+  - 当 `srcValidCol < elemPerRpt` 时：`tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock`。
+* 当 `src` 较小时，可直接将 `tmp` Tile 大小设为与 `src` 相同；也可按以下公式根据 `src` 的 `validCol` 算出 `tmp` Tile 所需 stride：
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+### A5 `tmp` 临时 Tile 相关说明
+
+* A5 实现中 `tmp` 临时 Tile **不使用**。A5 使用基于向量寄存器的计算方式（`__VEC_SCOPE__`），不需要临时 Tile 存储。
+* `tmp` 在 C++ 内建接口签名中保留，仅为了与 A2A3 的 API 兼容。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TCOLARGMIN(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
+  using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
+  SrcT src(16, 255);
+  DstT dst(1, 255);
+  TmpT tmp(1, 32);
+  TASSIGN(src, 0x0);
+  TASSIGN(dst, 0x1000);
+  TASSIGN(tmp, 0x2000);
+  TCOLARGMIN(dst, src, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+</task_progress>
+- [x] Write tcolargmin English documentation (docs/isa/TCOLARGMIN.md)
+- [x] Write tcolargmin Chinese documentation (docs/isa/TCOLARGMIN_zh.md)
+</task_progress>
+</write_to_file>
\ No newline at end of file
diff --git a/designs/outerCube/PTOISA/TCOLEXPAND.md b/designs/outerCube/PTOISA/TCOLEXPAND.md
new file mode 100644
index 00000000..6ccf4b10
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPAND.md
@@ -0,0 +1,103 @@
+﻿# TCOLEXPAND
+
+
+## Tile Operation Diagram
+
+![TCOLEXPAND tile operation](../figures/isa/TCOLEXPAND.svg)
+
+## Introduction
+
+Broadcast the first element of each source column across the destination column.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{0,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TCOLEXPAND(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDADD.md b/designs/outerCube/PTOISA/TCOLEXPANDADD.md
new file mode 100644
index 00000000..49b22407
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDADD.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDADD
+
+
+## Tile Operation Diagram
+
+![TCOLEXPANDADD tile operation](../figures/isa/TCOLEXPANDADD.svg)
+
+## Introduction
+
+Column-wise broadcast add: add each element of `src0` by a per-column scalar vector `src1`.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_j
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDADD_zh.md b/designs/outerCube/PTOISA/TCOLEXPANDADD_zh.md
new file mode 100644
index 00000000..a4983a95
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDADD_zh.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDADD
+
+## 指令示意图
+
+![TCOLEXPANDADD tile operation](../figures/isa/TCOLEXPANDADD.svg)
+
+## 简介
+
+列广播加法：对每一列加上每列标量向量。
+
+## 数学语义
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
+
+对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_j
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDDIV.md b/designs/outerCube/PTOISA/TCOLEXPANDDIV.md
new file mode 100644
index 00000000..6cc63a92
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDDIV.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDDIV
+
+
+## Tile Operation Diagram
+
+![TCOLEXPANDDIV tile operation](../figures/isa/TCOLEXPANDDIV.svg)
+
+## Introduction
+
+Column-wise broadcast divide: divide each element of `src0` by a per-column scalar vector `src1`.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} / s_j
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDDIV_zh.md b/designs/outerCube/PTOISA/TCOLEXPANDDIV_zh.md
new file mode 100644
index 00000000..8816edbd
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDDIV_zh.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDDIV
+
+## 指令示意图
+
+![TCOLEXPANDDIV tile operation](../figures/isa/TCOLEXPANDDIV.svg)
+
+## 简介
+
+列广播除法：将每一列除以一个每列标量向量。
+
+## 数学语义
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
+
+对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} / s_j
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDEXPDIF.md b/designs/outerCube/PTOISA/TCOLEXPANDEXPDIF.md
new file mode 100644
index 00000000..ea7d54b0
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDEXPDIF.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDEXPDIF
+
+
+## Tile Operation Diagram
+
+![TCOLEXPANDEXPDIF tile operation](../figures/isa/TCOLEXPANDEXPDIF.svg)
+
+## Introduction
+
+Column-wise exp-diff: compute `exp(src0 - src1)` using a per-column scalar vector `src1`.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_j)
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDEXPDIF_zh.md b/designs/outerCube/PTOISA/TCOLEXPANDEXPDIF_zh.md
new file mode 100644
index 00000000..b2fdc7ed
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDEXPDIF_zh.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDEXPDIF
+
+## 指令示意图
+
+![TCOLEXPANDEXPDIF tile operation](../figures/isa/TCOLEXPANDEXPDIF.svg)
+
+## 简介
+
+列指数差运算：计算 exp(src0 - src1)，其中 src1 为每列标量。
+
+## 数学语义
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
+
+对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_j)
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDMAX.md b/designs/outerCube/PTOISA/TCOLEXPANDMAX.md
new file mode 100644
index 00000000..85337c45
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDMAX.md
@@ -0,0 +1,90 @@
+﻿# TCOLEXPANDMAX
+
+
+## Tile Operation Diagram
+
+![TCOLEXPANDMAX tile operation](../figures/isa/TCOLEXPANDMAX.svg)
+
+## Introduction
+
+Column-wise broadcast max: take `max(src0, src1)` where `src1` provides one scalar per column.
+
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_j)
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDMAX_zh.md b/designs/outerCube/PTOISA/TCOLEXPANDMAX_zh.md
new file mode 100644
index 00000000..ddd493b4
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDMAX_zh.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDMAX
+
+## 指令示意图
+
+![TCOLEXPANDMAX tile operation](../figures/isa/TCOLEXPANDMAX.svg)
+
+## 简介
+
+列广播最大值：与每列标量向量取最大值。
+
+## 数学语义
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
+
+对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_j)
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDMIN.md b/designs/outerCube/PTOISA/TCOLEXPANDMIN.md
new file mode 100644
index 00000000..0cd2eec5
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDMIN.md
@@ -0,0 +1,90 @@
+﻿# TCOLEXPANDMIN
+
+
+## Tile Operation Diagram
+
+![TCOLEXPANDMIN tile operation](../figures/isa/TCOLEXPANDMIN.svg)
+
+## Introduction
+
+Column-wise broadcast min: take `min(src0, src1)` where `src1` provides one scalar per column.
+
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_j)
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDMIN_zh.md b/designs/outerCube/PTOISA/TCOLEXPANDMIN_zh.md
new file mode 100644
index 00000000..11c7bae7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDMIN_zh.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDMIN
+
+## 指令示意图
+
+![TCOLEXPANDMIN tile operation](../figures/isa/TCOLEXPANDMIN.svg)
+
+## 简介
+
+列广播最小值：与每列标量向量取最小值。
+
+## 数学语义
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
+
+对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_j)
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDMUL.md b/designs/outerCube/PTOISA/TCOLEXPANDMUL.md
new file mode 100644
index 00000000..c1bcc7d6
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDMUL.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDMUL
+
+
+## Tile Operation Diagram
+
+![TCOLEXPANDMUL tile operation](../figures/isa/TCOLEXPANDMUL.svg)
+
+## Introduction
+
+Column-wise broadcast multiply: multiply each element of `src0` by a per-column scalar vector `src1`.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot s_j
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDMUL_zh.md b/designs/outerCube/PTOISA/TCOLEXPANDMUL_zh.md
new file mode 100644
index 00000000..4bd2711b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDMUL_zh.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDMUL
+
+## 指令示意图
+
+![TCOLEXPANDMUL tile operation](../figures/isa/TCOLEXPANDMUL.svg)
+
+## 简介
+
+列广播乘法：将每一列乘以一个每列标量向量。
+
+## 数学语义
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
+
+对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot s_j
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDSUB.md b/designs/outerCube/PTOISA/TCOLEXPANDSUB.md
new file mode 100644
index 00000000..6330e1a1
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDSUB.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDSUB
+
+
+## Tile Operation Diagram
+
+![TCOLEXPANDSUB tile operation](../figures/isa/TCOLEXPANDSUB.svg)
+
+## Introduction
+
+Column-wise broadcast subtract: subtract a per-column scalar vector `src1` from `src0`.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_j` be the per-column scalar taken from `src1` (one value per column).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - s_j
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- `src1` is expected to provide **one scalar per column** (i.e., its valid shape must cover `C` values).
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TColExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPANDSUB_zh.md b/designs/outerCube/PTOISA/TCOLEXPANDSUB_zh.md
new file mode 100644
index 00000000..5a28c2d1
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPANDSUB_zh.md
@@ -0,0 +1,89 @@
+﻿# TCOLEXPANDSUB
+
+## 指令示意图
+
+![TCOLEXPANDSUB tile operation](../figures/isa/TCOLEXPANDSUB.svg)
+
+## 简介
+
+列广播减法：从每一列中减去一个每列标量向量。
+
+## 数学语义
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。设 `s_j` 为从 `src1` 中获取的每列标量（每列一个值）。
+
+对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - s_j
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+- Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+- `src1` 预期提供**每列一个标量**（即，其有效形状必须覆盖 `C` 个值）。
+- 确切的布局/分形约束是目标特定的；参见 `include/pto/npu/*/TColExpand*.hpp` 下的后端头文件。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcolexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCOLEXPAND_zh.md b/designs/outerCube/PTOISA/TCOLEXPAND_zh.md
new file mode 100644
index 00000000..6d966964
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLEXPAND_zh.md
@@ -0,0 +1,76 @@
+# TCOLEXPAND
+
+## 指令示意图
+
+![TCOLEXPAND tile operation](../figures/isa/TCOLEXPAND.svg)
+
+## 简介
+
+将每个源列的第一个元素广播到目标列中。
+
+## 数学语义
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{0,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TCOLEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TCOLEXPAND(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCOLMAX.md b/designs/outerCube/PTOISA/TCOLMAX.md
new file mode 100644
index 00000000..fdfed987
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLMAX.md
@@ -0,0 +1,135 @@
+﻿# TCOLMAX
+
+
+## Tile Operation Diagram
+
+![TCOLMAX tile operation](../figures/isa/TCOLMAX.svg)
+
+## Introduction
+
+Reduce each column by taking the maximum across rows.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \max_{0 \le i < R} \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLMAX(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- Tile location: `dst` and `src` must be `TileType::Vec`.
+- Tile layout: both tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- Data types:
+    - A2A3: `half`, `float`, `int16_t`, `int32_t`.
+    - A5: `half`, `float`, `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `bfloat16_t`.
+- DType consistency: `dst.DType == src.DType`.
+- Runtime valid checks:
+    - `src.GetValidCol() == dst.GetValidCol()`.
+    - If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TCOLMAX(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCOLMAX(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCOLMAX_zh.md b/designs/outerCube/PTOISA/TCOLMAX_zh.md
new file mode 100644
index 00000000..a958f1c4
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLMAX_zh.md
@@ -0,0 +1,108 @@
+# TCOLMAX
+
+## 指令示意图
+
+![TCOLMAX tile operation](../figures/isa/TCOLMAX.svg)
+
+## 简介
+
+通过取行间最大值来归约每一列。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \max_{0 \le i < R} \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolmax ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLMAX(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- Tile location: `dst` and `src` must be `TileType::Vec`.
+- Tile 布局: both tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- 数据类型:
+    - A2A3: `half`, `float`, `int16_t`, `int32_t`.
+    - A5: `half`, `float`, `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `bfloat16_t`.
+- 数据类型一致性: `dst.DType == src.DType`.
+- 运行期有效区域检查:
+    - `src.GetValidCol() == dst.GetValidCol()`.
+    - If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TCOLMAX(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCOLMAX(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCOLMIN.md b/designs/outerCube/PTOISA/TCOLMIN.md
new file mode 100644
index 00000000..1cfdcd35
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLMIN.md
@@ -0,0 +1,135 @@
+﻿# TCOLMIN
+
+
+## Tile Operation Diagram
+
+![TCOLMIN tile operation](../figures/isa/TCOLMIN.svg)
+
+## Introduction
+
+Reduce each column by taking the minimum across rows.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \min_{0 \le i < R} \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLMIN(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- Tile location: `dst` and `src` must be `TileType::Vec`.
+- Tile layout: both tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- Data types:
+    - A2A3: `half`, `float`, `int16_t`, `int32_t`.
+    - A5: `half`, `float`, `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `bfloat16_t`.
+- DType consistency: `dst.DType == src.DType`.
+- Runtime valid checks:
+    - `src.GetValidCol() == dst.GetValidCol()`.
+    - If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TCOLMIN(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCOLMIN(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCOLMIN_zh.md b/designs/outerCube/PTOISA/TCOLMIN_zh.md
new file mode 100644
index 00000000..9a6fcadf
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLMIN_zh.md
@@ -0,0 +1,108 @@
+# TCOLMIN
+
+## 指令示意图
+
+![TCOLMIN tile operation](../figures/isa/TCOLMIN.svg)
+
+## 简介
+
+通过取行间最小值来归约每一列。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \min_{0 \le i < R} \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolmin ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLMIN(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- Tile location: `dst` and `src` must be `TileType::Vec`.
+- Tile 布局: both tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- 数据类型:
+    - A2A3: `half`, `float`, `int16_t`, `int32_t`.
+    - A5: `half`, `float`, `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `bfloat16_t`.
+- 数据类型一致性: `dst.DType == src.DType`.
+- 运行期有效区域检查:
+    - `src.GetValidCol() == dst.GetValidCol()`.
+    - If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TCOLMIN(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCOLMIN(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCOLPROD.md b/designs/outerCube/PTOISA/TCOLPROD.md
new file mode 100644
index 00000000..d0d050b6
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLPROD.md
@@ -0,0 +1,135 @@
+﻿# TCOLPROD
+
+
+## Tile Operation Diagram
+
+![TCOLPROD tile operation](../figures/isa/TCOLPROD.svg)
+
+## Introduction
+
+Reduce each column by multiplying across rows.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \prod_{i=0}^{R-1} \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLPROD(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- Tile location: `dst` and `src` must be `TileType::Vec`.
+- Tile layout: both tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- DType consistency: `dst.DType == src.DType`.
+- Supported `src.DType`:
+    - A2A3: `half`, `float`, `int16_t`, `int32_t`.
+    - A5: `half`, `float`, `bfloat16`, `int16_t`, `int32_t`, `uint16_t`, `uint32_t`.
+- Runtime valid checks:
+    - `src.GetValidCol() == dst.GetValidCol()`.
+    - If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TCOLPROD(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCOLPROD(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCOLPROD_zh.md b/designs/outerCube/PTOISA/TCOLPROD_zh.md
new file mode 100644
index 00000000..36ef0001
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLPROD_zh.md
@@ -0,0 +1,108 @@
+# TCOLPROD
+
+## 指令示意图
+
+![TCOLPROD tile operation](../figures/isa/TCOLPROD.svg)
+
+## 简介
+
+通过跨行乘积来归约每一列。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \prod_{i=0}^{R-1} \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLPROD(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- Tile location: `dst` and `src` must be `TileType::Vec`.
+- Tile 布局: both tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- 数据类型一致性: `dst.DType == src.DType`.
+- Supported `src.DType`:
+    - A2A3: `half`, `float`, `int16_t`, `int32_t`.
+    - A5: `half`, `float`, `bfloat16`, `int16_t`, `int32_t`, `uint16_t`, `uint32_t`.
+- 运行期有效区域检查:
+    - `src.GetValidCol() == dst.GetValidCol()`.
+    - If `src.GetValidRow() == 0` or `src.GetValidCol() == 0`, the implementation returns early.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TCOLPROD(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCOLPROD(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCOLSUM.md b/designs/outerCube/PTOISA/TCOLSUM.md
new file mode 100644
index 00000000..ef4ec250
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLSUM.md
@@ -0,0 +1,149 @@
+﻿# TCOLSUM
+
+
+## Tile Operation Diagram
+
+![TCOLSUM tile operation](../figures/isa/TCOLSUM.svg)
+
+## Introduction
+
+Reduce each column by summing across rows.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \sum_{i=0}^{R-1} \mathrm{src}_{i,j} $$
+
+`isBinary` selects the implementation path (binary-tree accumulation vs. sequential accumulation).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
+%dst = pto.tcolsum %src, %tmp {isBinary = false} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+pto.tcolsum ins(%src, %tmp {isBinary = false} : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
+%dst = pto.tcolsum %src, %tmp {isBinary = false} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+pto.tcolsum ins(%src, %tmp {isBinary = false} : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, bool isBinary, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- Tile location: `dst`, `src`, `tmp` must be `TileType::Vec`.
+- Tile layout: all tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- DType consistency:
+    - A2A3: `src.DType` must be one of `half`, `float`, `int16_t`, `int32_t`, and `dst.DType == tmp.DType == src.DType`.
+    - A5: `dst.DType == src.DType` is required by `TColReduceCheck`; the exact supported `src.DType` set is target-defined (see `include/pto/npu/a5/TColReduceOps.hpp`).
+- Runtime valid checks:
+    - A2A3: `src.GetValidCol() == dst.GetValidCol()`; returns early if `src.GetValidRow() == 0` or `src.GetValidCol() == 0`.
+    - A5: `srcValidRow` and `srcValidCol` must be non-zero; `srcValidCol == dstValidCol` is asserted by `TColReduceCheck`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TCOLSUM_zh.md b/designs/outerCube/PTOISA/TCOLSUM_zh.md
new file mode 100644
index 00000000..b15c5ad6
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCOLSUM_zh.md
@@ -0,0 +1,122 @@
+# TCOLSUM
+
+## 指令示意图
+
+![TCOLSUM tile operation](../figures/isa/TCOLSUM.svg)
+
+## 简介
+
+通过对行求和来归约每一列。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= j < C`:
+
+$$ \mathrm{dst}_{0,j} = \sum_{i=0}^{R-1} \mathrm{src}_{i,j} $$
+
+`isBinary` selects the implementation path (binary-tree accumulation vs. sequential accumulation).
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tcolsum %src {isBinary = false} : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
+%dst = pto.tcolsum %src, %tmp {isBinary = false} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+pto.tcolsum ins(%src, %tmp {isBinary = false} : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcolsum %src : !pto.tile<...> -> !pto.tile<...>
+%dst = pto.tcolsum %src, %tmp {isBinary = false} : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcolsum ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+pto.tcolsum ins(%src, %tmp {isBinary = false} : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TCOLSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, bool isBinary, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- Tile location: `dst`, `src`, `tmp` must be `TileType::Vec`.
+- Tile 布局: all tiles must be ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+- 数据类型一致性:
+    - A2A3: `src.DType` must be one of `half`, `float`, `int16_t`, `int32_t`, and `dst.DType == tmp.DType == src.DType`.
+    - A5: `dst.DType == src.DType` is required by `TColReduceCheck`; the exact supported `src.DType` set is target-defined (see `include/pto/npu/a5/TColReduceOps.hpp`).
+- 运行期有效区域检查:
+    - A2A3: `src.GetValidCol() == dst.GetValidCol()`; returns early if `src.GetValidRow() == 0` or `src.GetValidCol() == 0`.
+    - A5: `srcValidRow` and `srcValidCol` must be non-zero; `srcValidCol == dstValidCol` is asserted by `TColReduceCheck`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TCOLSUM(dst, src, tmp, /*isBinary=*/false);
+}
+```
diff --git a/designs/outerCube/PTOISA/TCONCAT.md b/designs/outerCube/PTOISA/TCONCAT.md
new file mode 100644
index 00000000..e24be1e1
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCONCAT.md
@@ -0,0 +1,40 @@
+# TCONCAT
+
+## Tile Operation Diagram
+
+![TCONCAT tile operation](../figures/isa/TCONCAT.svg)
+
+## Introduction
+
+Concatenate two source tiles along the column dimension into a destination tile.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tconcat ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tconcat ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TCONCAT_zh.md b/designs/outerCube/PTOISA/TCONCAT_zh.md
new file mode 100644
index 00000000..3a050b7e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCONCAT_zh.md
@@ -0,0 +1,41 @@
+# TCONCAT
+
+## 指令示意图
+
+![TCONCAT tile operation](../figures/isa/TCONCAT.svg)
+
+## 简介
+
+沿列维将两个源 Tile 拼接到目标 Tile。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tconcat ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tconcat ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TCVT.md b/designs/outerCube/PTOISA/TCVT.md
new file mode 100644
index 00000000..72bdccf0
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCVT.md
@@ -0,0 +1,124 @@
+﻿# TCVT
+
+
+## Tile Operation Diagram
+
+![TCVT tile operation](../figures/isa/TCVT.svg)
+
+## Introduction
+
+Elementwise type conversion with a specified rounding mode.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{cast}_{\mathrm{rmode}}\!\left(\mathrm{src}_{i,j}\right) $$
+
+where `rmode` is a rounding policy (see `pto::RoundMode`).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tcvt ins(%src{rmode = #pto<round_mode xx>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename TileDataD, typename TileDataS, typename... WaitEvents>
+PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, SaturationMode satMode, WaitEvents &... events);
+
+template <typename TileDataD, typename TileDataS, typename... WaitEvents>
+PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, WaitEvents &... events);
+```
+
+## Constraints
+
+- `dst` and `src` must be compatible in shape/valid region as required by the implementation.
+- The conversion `(src element type) -> (dst element type)` must be supported by the target for the given `RoundMode`.
+- **Implementation notes (A2A3/A5)**:
+    - One form accepts an explicit `SaturationMode`, and the specified saturation behavior is forwarded directly to the implementation.
+    - The other form omits `SaturationMode`; in that case, the implementation chooses a target-defined default saturation behavior for the specific type pair.
+    - On CPU, only the form without explicit `SaturationMode` is currently implemented.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, half, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TCVT(dst, src, RoundMode::CAST_RINT);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, half, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCVT(dst, src, RoundMode::CAST_RINT);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcvt ins(%src{rmode = #pto<round_mode xx>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TCVT_zh.md b/designs/outerCube/PTOISA/TCVT_zh.md
new file mode 100644
index 00000000..fddd187a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TCVT_zh.md
@@ -0,0 +1,124 @@
+﻿# TCVT
+
+## 指令示意图
+
+![TCVT tile operation](../figures/isa/TCVT.svg)
+
+## 简介
+
+带指定舍入模式的逐元素类型转换。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{cast}_{\mathrm{rmode}}\!\left(\mathrm{src}_{i,j}\right) $$
+
+其中 `rmode` 是舍入策略（参见 `pto::RoundMode`）。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tcvt ins(%src{rmode = #pto<round_mode xx>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp` 和 `include/pto/common/constants.hpp`：
+
+```cpp
+template <typename TileDataD, typename TileDataS, typename... WaitEvents>
+PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, SaturationMode satMode, WaitEvents &... events);
+
+template <typename TileDataD, typename TileDataS, typename... WaitEvents>
+PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, WaitEvents &... events);
+```
+
+## 约束
+
+- `dst` 和 `src` 必须在形状/有效区域方面兼容，如实现所要求的。
+- 对于给定的 `RoundMode`，转换 `(src 元素类型) -> (dst 元素类型)` 必须被目标支持。
+- **实现说明 (A2A3/A5)**:
+    - 一种形式接受显式的 `SaturationMode`，指定的饱和行为会直接传递给实现。
+    - 另一种形式不显式给出 `SaturationMode`；此时实现会针对具体类型对选择目标定义的默认饱和行为。
+    - 在 CPU 实现中，目前仅实现了不显式传入 `SaturationMode` 的形式。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, half, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TCVT(dst, src, RoundMode::CAST_RINT);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, half, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TCVT(dst, src, RoundMode::CAST_RINT);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tcvt %src{rmode = #pto<round_mode xx>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tcvt ins(%src{rmode = #pto<round_mode xx>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TDEQUANT.md b/designs/outerCube/PTOISA/TDEQUANT.md
new file mode 100644
index 00000000..8204ce03
--- /dev/null
+++ b/designs/outerCube/PTOISA/TDEQUANT.md
@@ -0,0 +1,40 @@
+# TDEQUANT
+
+## Tile Operation Diagram
+
+![TDEQUANT tile operation](../figures/isa/TDEQUANT.svg)
+
+## Introduction
+
+Dequantize an integer tile into a floating-point tile using scale and offset tiles.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tdequant ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tdequant ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TDEQUANT_zh.md b/designs/outerCube/PTOISA/TDEQUANT_zh.md
new file mode 100644
index 00000000..9f07b6c4
--- /dev/null
+++ b/designs/outerCube/PTOISA/TDEQUANT_zh.md
@@ -0,0 +1,41 @@
+# TDEQUANT
+
+## 指令示意图
+
+![TDEQUANT tile operation](../figures/isa/TDEQUANT.svg)
+
+## 简介
+
+使用 scale 与 offset Tile 将整数量化 Tile 反量化为浮点 Tile。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tdequant ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tdequant ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TDIV.md b/designs/outerCube/PTOISA/TDIV.md
new file mode 100644
index 00000000..7c10a875
--- /dev/null
+++ b/designs/outerCube/PTOISA/TDIV.md
@@ -0,0 +1,137 @@
+﻿# TDIV
+
+
+## Tile Operation Diagram
+
+![TDIV tile operation](../figures/isa/TDIV.svg)
+
+## Introduction
+
+Elementwise division of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{i,j}} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tdiv %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain;.
+- **Division-by-zero**:
+    - Behavior is target-defined.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TDIV(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TDIV(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tdiv %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TDIVS.md b/designs/outerCube/PTOISA/TDIVS.md
new file mode 100644
index 00000000..23905d8d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TDIVS.md
@@ -0,0 +1,155 @@
+﻿# TDIVS
+
+
+## Tile Operation Diagram
+
+![TDIVS tile operation](../figures/isa/TDIVS.svg)
+
+## Introduction
+
+Elementwise division with a scalar (tile/scalar or scalar/tile).
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+- Tile/scalar:
+
+  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}} $$
+
+- Scalar/tile:
+
+  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Tile/scalar form:
+
+```text
+%dst = tdivs %src, %scalar : !pto.tile<...>, f32
+```
+
+Scalar/tile form:
+
+```text
+%dst = tdivs %scalar, %src : f32, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TDIVS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TDIVS(TileDataDst &dst, typename TileDataDst::DType scalar, TileDataSrc &src0, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)** (both overloads):
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)** (both overloads):
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Division-by-zero**:
+    - Behavior is target-defined; on A5 the tile/scalar form maps to multiply-by-reciprocal and uses `1/0 -> +inf` for `scalar == 0`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TDIVS(dst, src, 2.0f);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TDIVS(dst, 2.0f, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TDIVS_zh.md b/designs/outerCube/PTOISA/TDIVS_zh.md
new file mode 100644
index 00000000..060f5c64
--- /dev/null
+++ b/designs/outerCube/PTOISA/TDIVS_zh.md
@@ -0,0 +1,128 @@
+# TDIVS
+
+## 指令示意图
+
+![TDIVS tile operation](../figures/isa/TDIVS.svg)
+
+## 简介
+
+与标量的逐元素除法（Tile/标量 或 标量/Tile）。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+- Tile/scalar:
+
+  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{src}_{i,j}}{\mathrm{scalar}} $$
+
+- Scalar/tile:
+
+  $$ \mathrm{dst}_{i,j} = \frac{\mathrm{scalar}}{\mathrm{src}_{i,j}} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Tile/scalar form:
+
+```text
+%dst = tdivs %src, %scalar : !pto.tile<...>, f32
+```
+
+Scalar/tile form:
+
+```text
+%dst = tdivs %scalar, %src : f32, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tdivs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst = pto.tdivs %scalar, %src : (dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tdivs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+pto.tdivs ins(%scalar, %src : dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TDIVS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TDIVS(TileDataDst &dst, typename TileDataDst::DType scalar, TileDataSrc &src0, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)** (both overloads):
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **实现检查 (A5)** (both overloads):
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Division-by-zero**:
+    - Behavior is target-defined; on A5 the tile/scalar form maps to multiply-by-reciprocal and uses `1/0 -> +inf` for `scalar == 0`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TDIVS(dst, src, 2.0f);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TDIVS(dst, 2.0f, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TDIV_zh.md b/designs/outerCube/PTOISA/TDIV_zh.md
new file mode 100644
index 00000000..c9d13a40
--- /dev/null
+++ b/designs/outerCube/PTOISA/TDIV_zh.md
@@ -0,0 +1,110 @@
+# TDIV
+
+## 指令示意图
+
+![TDIV tile operation](../figures/isa/TDIV.svg)
+
+## 简介
+
+两个 Tile 的逐元素除法。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{i,j}} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tdiv %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tdiv %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tdiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain;.
+- **Division-by-zero**:
+    - Behavior is target-defined.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TDIV(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TDIV(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TEXP.md b/designs/outerCube/PTOISA/TEXP.md
new file mode 100644
index 00000000..f4abc35a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXP.md
@@ -0,0 +1,128 @@
+﻿# TEXP
+
+
+## Tile Operation Diagram
+
+![TEXP tile operation](../figures/isa/TEXP.svg)
+
+## Introduction
+
+Elementwise exponential.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \exp(\mathrm{src}_{i,j}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = texp %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TEXP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TEXP(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TEXP(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = texp %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TEXPANDS.md b/designs/outerCube/PTOISA/TEXPANDS.md
new file mode 100644
index 00000000..2fa9271d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXPANDS.md
@@ -0,0 +1,146 @@
+# TEXPANDS
+
+## Tile Operation Diagram
+
+![TEXPANDS tile operation](../figures/isa/TEXPANDS.svg)
+
+## Introduction
+
+Broadcast a scalar into a destination tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = texpands %scalar : f32, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TEXPANDS(TileData &dst, typename TileData::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+  - For `TileType::Vec`:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
+    - Row-major and col-major vector tiles are both supported.
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+  - For `TileType::Mat`:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
+    - Static valid bounds: `TileData::Rows * TileData::Cols * sizeof(T) / 32` must be in `[1, 32767]`.
+- **Implementation checks (A5)**:
+  - For `TileType::Vec`:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
+    - Row-major and col-major vector tiles are both supported.
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+  - For `TileType::Mat`:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `bfloat16_t`, `float`.
+    - For `TileData::layout == pto::Layout::NC1HWC0 || TileData::layout == pto::Layout::FRACTAL_Z`:
+      - `shape0 * shape1 * shape2 * shape3` must be in `[1, 32767]`.
+    - For `TileData::layout == pto::Layout::NDC1HWC0 || TileData::layout == pto::Layout::FRACTAL_Z_3D`:
+      - `shape0 * shape1 * shape2 * shape3 * shape4` must be in `[1, 32767]`.
+- **Valid region**:
+  - For `TileType::Vec`:
+    - The op fills `dst` over `dst.GetValidRow()` / `dst.GetValidCol()`.
+  - For `TileType::Mat`:
+    - For tile operands, the op fills `dst` over `TileData::Rows` / `TileData::Cols`.
+    - For conv tiles, the op fills `dst` over the conv-tile shape.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto()
+{
+    using TileT = Tile<TileType::Vec, float, 16, 16>;
+    TileT dst;
+    TEXPANDS(dst, 0.0f);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual()
+{
+    using TileT = Tile<TileType::Vec, float, 16, 16>;
+    TileT dst;
+    TASSIGN(dst, 0x1000);
+    TEXPANDS(dst, 0.0f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = texpands %scalar : f32, !pto.tile<...>
+# AS Level 2 (DPS)
+pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TEXPANDS_zh.md b/designs/outerCube/PTOISA/TEXPANDS_zh.md
new file mode 100644
index 00000000..63753d2f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXPANDS_zh.md
@@ -0,0 +1,134 @@
+# TEXPANDS
+
+## 指令示意图
+
+![TEXPANDS tile operation](../figures/isa/TEXPANDS.svg)
+
+## 简介
+
+将标量广播到目标 Tile 中。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = texpands %scalar : f32, !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TEXPANDS(TileData &dst, typename TileData::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查（A2A3）**：
+  - 对于 `TileType::Vec`：
+    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`bfloat16_t`、`float`。
+    - 支持行优先和列优先向量 Tile。
+    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+  - 对于 `TileType::Mat`：
+    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`bfloat16_t`、`float`。
+    - 静态有效边界：`TileData::Rows * TileData::Cols * sizeof(T) / 32` 必须在 `[1, 32767]` 范围内。
+- **实现检查（A5）**：
+  - 对于 `TileType::Vec`：
+    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`bfloat16_t`、`float`。
+    - 支持行优先和列优先向量 Tile。
+    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+  - 对于 `TileType::Mat`：
+    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`bfloat16_t`、`float`。
+    - 对于 `TileData::layout == pto::Layout::NC1HWC0 || TileData::layout == pto::Layout::FRACTAL_Z`：
+      - `shape0 * shape1 * shape2 * shape3` 必须在 `[1, 32767]` 范围内。
+    - 对于 `TileData::layout == pto::Layout::NDC1HWC0 || TileData::layout == pto::Layout::FRACTAL_Z_3D`：
+      - `shape0 * shape1 * shape2 * shape3 * shape4` 必须在 `[1, 32767]` 范围内。
+- **有效区域**：
+  - 对于 `TileType::Vec`：
+    - 该操作在 `dst.GetValidRow()` / `dst.GetValidCol()` 上填充 `dst`。
+  - 对于 `TileType::Mat`：
+    - 对于普通 Tile，该操作在 `TileData::Rows` / `TileData::Cols` 上填充 `dst`。
+    - 对于 ConvTile，该操作在 conv-tile 的 shape 范围内填充 `dst`。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto()
+{
+    using TileT = Tile<TileType::Vec, float, 16, 16>;
+    TileT dst;
+    TEXPANDS(dst, 0.0f);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual()
+{
+    using TileT = Tile<TileType::Vec, float, 16, 16>;
+    TileT dst;
+    TASSIGN(dst, 0x1000);
+    TEXPANDS(dst, 0.0f);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.texpands %scalar : dtype -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = texpands %scalar : f32, !pto.tile<...>
+# AS Level 2 (DPS)
+pto.texpands ins(%scalar : dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TEXP_zh.md b/designs/outerCube/PTOISA/TEXP_zh.md
new file mode 100644
index 00000000..d871a692
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXP_zh.md
@@ -0,0 +1,101 @@
+# TEXP
+
+## 指令示意图
+
+![TEXP tile operation](../figures/isa/TEXP.svg)
+
+## 简介
+
+逐元素指数运算。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \exp(\mathrm{src}_{i,j}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = texp %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.texp %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.texp ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TEXP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TEXP(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TEXP(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TEXTRACT.md b/designs/outerCube/PTOISA/TEXTRACT.md
new file mode 100644
index 00000000..957e2e68
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXTRACT.md
@@ -0,0 +1,138 @@
+﻿# TEXTRACT
+
+
+## Tile Operation Diagram
+
+![TEXTRACT tile operation](../figures/isa/TEXTRACT.svg)
+
+## Introduction
+
+Extract a sub-tile from a source tile.
+
+## Math Interpretation
+
+Conceptually copies a window starting at `(indexRow, indexCol)` from `src` into `dst`. Exact mapping depends on layouts.
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{indexRow}+i,\; \mathrm{indexCol}+j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow = 0, uint16_t indexCol = 0, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `DstTileData::DType` must equal `SrcTileData::DType` and must be one of: `int8_t`, `half`, `bfloat16_t`, `float`.
+    - Source fractal must satisfy: `(SFractal == ColMajor && isRowMajor)` or `(SFractal == RowMajor && !isRowMajor)`. In GEMV scenarios, the source fractal satisfies `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)` for Left.
+    - Runtime bounds checks:
+    - `indexRow + DstTileData::Rows <= SrcTileData::Rows`
+    - `indexCol + DstTileData::Cols <= SrcTileData::Cols`
+    - Destination must be `TileType::Left` or `TileType::Right` with a target-supported fractal configuration.
+- **Implementation checks (A5)**:
+    - `DstTileData::DType` must equal `SrcTileData::DType` and must be one of: `int8_t`, `hifloat8_t`, `float8_e5m2_t`, `float8_e4m3_t`, `half`, `bfloat16_t`, `float`, `float4_e2m1x2_t`, `float4_e1m2x2_t`, `float8_e8m0_t`.
+    - Source fractal must satisfy: `(SFractal == ColMajor && isRowMajor)` or `(SFractal == RowMajor && !isRowMajor)` for Left/Right,  In GEMV scenarios, the source fractal satisfies `(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)` for Left. `(SFractal == RowMajor && isRowMajor)` for ScaleLeft, `(SFractal == ColMajor && !isRowMajor)` for ScaleRight.
+    - Destination supports `Mat -> Left/Right/Scale`, `Acc -> Mat` (including relu/scalar-quant/vector-quant forms), and also supports specific `Vec -> Mat` extraction paths.
+    - The vector-quantized form additionally requires an `FpTileData` scaling operand, matching the `TEXTRACT_FP(...)` interface shown above.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
+  using DstT = TileLeft<float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
+  using DstT = TileLeft<float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TEXTRACT_FP.md b/designs/outerCube/PTOISA/TEXTRACT_FP.md
new file mode 100644
index 00000000..757e4a5b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXTRACT_FP.md
@@ -0,0 +1,90 @@
+﻿# TEXTRACT_FP
+
+
+## Tile Operation Diagram
+
+![TEXTRACT_FP tile operation](../figures/isa/TEXTRACT_FP.svg)
+
+## Introduction
+
+Extract a sub-tile from a source tile, while also providing an `fp` (scaling) tile used for vector quantization parameters (target/implementation-defined).
+
+## See also
+
+- TEXTRACT base instruction: `docs/isa/TEXTRACT.md`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+```
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## Constraints
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TEXTRACT_FP_zh.md b/designs/outerCube/PTOISA/TEXTRACT_FP_zh.md
new file mode 100644
index 00000000..9f18d61c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXTRACT_FP_zh.md
@@ -0,0 +1,59 @@
+# TEXTRACT_FP
+
+## 指令示意图
+
+![TEXTRACT_FP tile operation](../figures/isa/TEXTRACT_FP.svg)
+
+## 简介
+
+带 fp/缩放 Tile 的提取（向量量化参数）。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.textract_fp %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.textract_fp ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+```
+
+## 约束
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TEXTRACT_zh.md b/designs/outerCube/PTOISA/TEXTRACT_zh.md
new file mode 100644
index 00000000..827a6046
--- /dev/null
+++ b/designs/outerCube/PTOISA/TEXTRACT_zh.md
@@ -0,0 +1,138 @@
+﻿# TEXTRACT
+
+## 指令示意图
+
+![TEXTRACT tile operation](../figures/isa/TEXTRACT.svg)
+
+## 简介
+
+从源 Tile 中提取子 Tile。
+
+## 数学语义
+
+概念上从 `src` 中复制从 `(indexRow, indexCol)` 开始的窗口到 `dst`。确切的映射取决于布局。
+
+设 `R = dst.GetValidRow()` 和 `C = dst.GetValidCol()`。对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{indexRow}+i,\; \mathrm{indexCol}+j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow = 0, uint16_t indexCol = 0, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `DstTileData::DType` 必须等于 `SrcTileData::DType` 且必须是以下之一：`int8_t`、`half`、`bfloat16_t`、`float`。
+    - 源分形必须满足：`(SFractal == ColMajor && isRowMajor)` 或 `(SFractal == RowMajor && !isRowMajor)`，GEMV场景中，目标为Left时，源分形满足`(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`
+    - 运行时边界检查：
+    - `indexRow + DstTileData::Rows <= SrcTileData::Rows`
+    - `indexCol + DstTileData::Cols <= SrcTileData::Cols`
+    - 目标必须是 `TileType::Left` 或 `TileType::Right`，具有目标支持的分形配置。
+- **实现检查 (A5)**:
+    - `DstTileData::DType` 必须等于 `SrcTileData::DType` 且必须是以下之一：`int8_t`、`hifloat8_t`、`float8_e5m2_t`、`float8_e4m3_t`、`half`、`bfloat16_t`、`float`、`float4_e2m1x2_t`、`float4_e1m2x2_t`、`float8_e8m0_t`。
+    - 源分形必须满足：对于 Left/Right 为 `(SFractal == ColMajor && isRowMajor)` 或 `(SFractal == RowMajor && !isRowMajor)`，GEMV场景中，目标为Left时，源分形满足`(SrcTileData::Rows == 1 && SrcTileData::isRowMajor)`；对于 ScaleLeft 为 `(SFractal == RowMajor && isRowMajor)`，对于 ScaleRight 为 `(SFractal == ColMajor && !isRowMajor)`。
+    - 目标支持 `Mat -> Left/Right/Scale`、`Acc -> Mat`（含 relu / 标量量化 / 向量量化形式），也支持特定 tile 位置的 `Vec -> Mat` 提取路径。
+    - 向量量化形式额外要求提供 `FpTileData` 缩放操作数，对应上文展示的 `TEXTRACT_FP(...)` 接口。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
+  using DstT = TileLeft<float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
+  using DstT = TileLeft<float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TEXTRACT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.textract %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = textract %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.textract ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TFILLPAD.md b/designs/outerCube/PTOISA/TFILLPAD.md
new file mode 100644
index 00000000..b670bb02
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFILLPAD.md
@@ -0,0 +1,133 @@
+﻿# TFILLPAD
+
+
+## Tile Operation Diagram
+
+![TFILLPAD tile operation](../figures/isa/TFILLPAD.svg)
+
+## Introduction
+
+Copy a source tile into a destination tile and fill the remaining (padded) elements with a compile-time pad value
+selected by `TileDataDst::PadVal` (e.g., `PadValue::Min`/`PadValue::Max`).
+
+This is commonly used to materialize deterministic values outside the runtime valid region so that subsequent ops can
+operate on a full static tile shape.
+
+## Math Interpretation
+
+Let `VR = src.GetValidRow()` and `VC = src.GetValidCol()`. For each destination element `(i, j)`:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src}_{i,j} & \text{if } i < VR \text{ and } j < VC \\
+\mathrm{pad}       & \text{otherwise}
+\end{cases}
+$$
+
+`pad` is determined by `TileDataDst::PadVal` and the element type (e.g., `+inf/-inf` for floating types when available,
+otherwise `std::numeric_limits<T>::max()/min()`).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form (conceptual):
+
+```text
+%dst = tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Implemented in the backend headers pulled in by `include/pto/common/pto_instr_impl.hpp`:
+
+```cpp
+template <typename TileData, PadValue PadVal = PadValue::Zero, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD(TileData &dst, TileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::PadVal != PadValue::Null`.
+- `sizeof(TileDataDst::DType) == sizeof(TileDataSrc::DType)` and element size must be `1`, `2`, or `4` bytes.
+- `TFILLPAD`: `TileDataDst::Rows/Cols` must match `TileDataSrc::Rows/Cols`.
+- `TFILLPAD_EXPAND`: `TileDataDst::Rows >= TileDataSrc::Rows` and `TileDataDst::Cols >= TileDataSrc::Cols`.
+- `TFILLPAD(TileData &dst, TileData &src)`:`if TileData::TileType is Mat, layout only support (!TileData::isRowMajor && TileData::Slayout::RowMajor), and PadVal only support PadValue::Zero`
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example1() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::NoneBox, TileConfig::fractalABSize, PadValue::Min>;
+
+  SrcT src;
+  DstT dst;
+  TFILLPAD(dst, src);
+}
+
+void example2() {
+  using TileMatData = Tile<TileType::Mat, float, 16, 256, BLayout::ColMajor, 1, 224, SLayout::RowMajor, 512>;
+
+  TileMatData matTile;
+  TFILLPAD(matTile, matTile);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TFILLPAD_EXPAND.md b/designs/outerCube/PTOISA/TFILLPAD_EXPAND.md
new file mode 100644
index 00000000..e8d59a45
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFILLPAD_EXPAND.md
@@ -0,0 +1,89 @@
+﻿# TFILLPAD_EXPAND
+
+
+## Tile Operation Diagram
+
+![TFILLPAD_EXPAND tile operation](../figures/isa/TFILLPAD_EXPAND.svg)
+
+## Introduction
+
+Expand fill/pad variant of TFILLPAD (allows dst to be larger than src; implementation-defined).
+
+## See also
+
+- TFILLPAD overview and constraints: `docs/isa/TFILLPAD.md`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD_EXPAND(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## Constraints
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TFILLPAD_EXPAND_zh.md b/designs/outerCube/PTOISA/TFILLPAD_EXPAND_zh.md
new file mode 100644
index 00000000..1711c218
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFILLPAD_EXPAND_zh.md
@@ -0,0 +1,58 @@
+# TFILLPAD_EXPAND
+
+## 指令示意图
+
+![TFILLPAD_EXPAND tile operation](../figures/isa/TFILLPAD_EXPAND.svg)
+
+## 简介
+
+填充/填充时允许目标大于源。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tfillpad_expand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD_EXPAND(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## 约束
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TFILLPAD_INPLACE.md b/designs/outerCube/PTOISA/TFILLPAD_INPLACE.md
new file mode 100644
index 00000000..29134773
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFILLPAD_INPLACE.md
@@ -0,0 +1,89 @@
+﻿# TFILLPAD_INPLACE
+
+
+## Tile Operation Diagram
+
+![TFILLPAD_INPLACE tile operation](../figures/isa/TFILLPAD_INPLACE.svg)
+
+## Introduction
+
+In-place fill/pad variant of TFILLPAD (implementation-defined).
+
+## See also
+
+- TFILLPAD overview and constraints: `docs/isa/TFILLPAD.md`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD_INPLACE(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## Constraints
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TFILLPAD_INPLACE_zh.md b/designs/outerCube/PTOISA/TFILLPAD_INPLACE_zh.md
new file mode 100644
index 00000000..9900b46f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFILLPAD_INPLACE_zh.md
@@ -0,0 +1,58 @@
+# TFILLPAD_INPLACE
+
+## 指令示意图
+
+![TFILLPAD_INPLACE tile operation](../figures/isa/TFILLPAD_INPLACE.svg)
+
+## 简介
+
+原地填充/填充变体。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tfillpad_inplace %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD_INPLACE(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## 约束
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TFILLPAD_zh.md b/designs/outerCube/PTOISA/TFILLPAD_zh.md
new file mode 100644
index 00000000..d6975b6f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFILLPAD_zh.md
@@ -0,0 +1,108 @@
+# TFILLPAD
+
+## 指令示意图
+
+![TFILLPAD tile operation](../figures/isa/TFILLPAD.svg)
+
+## 简介
+
+复制 Tile 并在有效区域外使用编译时填充值进行填充。
+
+Copy a source tile into a destination tile and fill the remaining (padded) elements with a compile-time pad value
+selected by `TileDataDst::PadVal` (e.g., `PadValue::Min`/`PadValue::Max`).
+
+This is commonly used to materialize deterministic values outside the runtime valid region so that subsequent ops can
+operate on a full static tile shape.
+
+## 数学语义
+
+Let `VR = src.GetValidRow()` and `VC = src.GetValidCol()`. 对每个 destination element `(i, j)`:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src}_{i,j} & \text{if } i < VR \text{ and } j < VC \\
+\mathrm{pad}       & \text{otherwise}
+\end{cases}
+$$
+
+`pad` is determined by `TileDataDst::PadVal` and the element type (e.g., `+inf/-inf` for floating types when available,
+otherwise `std::numeric_limits<T>::max()/min()`).
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form (conceptual):
+
+```text
+%dst = tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tfillpad %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+Implemented in the backend headers pulled in by `include/pto/common/pto_instr_impl.hpp`:
+
+```cpp
+template <typename TileData, PadValue PadVal = PadValue::Zero, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD(TileData &dst, TileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TFILLPAD(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::PadVal != PadValue::Null`.
+- `sizeof(TileDataDst::DType) == sizeof(TileDataSrc::DType)` and element size must be `1`, `2`, or `4` bytes.
+- `TFILLPAD`: `TileDataDst::Rows/Cols` must match `TileDataSrc::Rows/Cols`.
+- `TFILLPAD_EXPAND`: `TileDataDst::Rows >= TileDataSrc::Rows` and `TileDataDst::Cols >= TileDataSrc::Cols`.
+- `TFILLPAD(TileData &dst, TileData &src)`:`if TileData::TileType is Mat, layout only support (!TileData::isRowMajor && TileData::Slayout::RowMajor), and PadVal only support PadValue::Zero`
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example1() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::NoneBox, TileConfig::fractalABSize, PadValue::Min>;
+
+  SrcT src;
+  DstT dst;
+  TFILLPAD(dst, src);
+}
+
+void example2() {
+  using TileMatData = Tile<TileType::Mat, float, 16, 256, BLayout::ColMajor, 1, 224, SLayout::RowMajor, 512>;
+
+  TileMatData matTile;
+  TFILLPAD(matTile, matTile);
+}
+```
diff --git a/designs/outerCube/PTOISA/TFMOD.md b/designs/outerCube/PTOISA/TFMOD.md
new file mode 100644
index 00000000..bf647996
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFMOD.md
@@ -0,0 +1,104 @@
+﻿# TFMOD
+
+
+## Tile Operation Diagram
+
+![TFMOD tile operation](../figures/isa/TFMOD.svg)
+
+## Introduction
+
+Elementwise floor of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j})$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TFMOD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+- Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT out, a, b;
+  TFMOD(out, a, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tfmod %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TFMODS.md b/designs/outerCube/PTOISA/TFMODS.md
new file mode 100644
index 00000000..41034bb0
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFMODS.md
@@ -0,0 +1,107 @@
+﻿# TFMODS
+
+
+## Tile Operation Diagram
+
+![TFMODS tile operation](../figures/isa/TFMODS.svg)
+
+## Introduction
+
+Elementwise floor with a scalar: `fmod(src, scalar)`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src}_{i,j}, \mathrm{scalar})$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TFMODS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `dst` and `src` must use the same element type.
+    - Supported element types are `float` and `float32_t`.
+    - `dst` and `src` must be vector tiles.
+    - `dst` and `src` must be row-major.
+    - Runtime: `dst.GetValidRow() == src.GetValidRow() > 0` and `dst.GetValidCol() == src.GetValidCol() > 0`.
+- **Implementation checks (A5)**:
+    - `dst` and `src` must use the same element type.
+    - Supported element types are 2-byte or 4-byte types supported by the target implementation (including `half` and `float`).
+    - `dst` and `src` must be vector tiles.
+    - Static valid bounds must satisfy `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
+    - Runtime: `dst.GetValidRow() == src.GetValidRow()` and `dst.GetValidCol() == src.GetValidCol()`.
+- **Division-by-zero**:
+    - Behavior is target-defined; the CPU simulator asserts in debug builds.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TFMODS(out, x, 3.0f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tfmods %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TFMODS_zh.md b/designs/outerCube/PTOISA/TFMODS_zh.md
new file mode 100644
index 00000000..4726099f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFMODS_zh.md
@@ -0,0 +1,107 @@
+﻿# TFMODS
+
+## 指令示意图
+
+![TFMODS tile operation](../figures/isa/TFMODS.svg)
+
+## 简介
+
+与标量的逐元素余数：`fmod(src, scalar)`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src}_{i,j}, \mathrm{scalar})$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TFMODS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - 支持的元素类型为 `float` 和 `float32_t`。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - `dst` 和 `src` 必须是行主序。
+    - 运行时：`dst.GetValidRow() == src.GetValidRow() > 0` 且 `dst.GetValidCol() == src.GetValidCol() > 0`。
+- **实现检查 (A5)**:
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - 支持的元素类型为目标实现支持的 2 字节或 4 字节类型（包括 `half` 和 `float`）。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
+    - 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
+- **除零**:
+    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TFMODS(out, x, 3.0f);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tfmods %src, %scalar : !pto.tile<...>, f32
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tfmods %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tfmods ins(%src, %scalar : !pto.tile_buf<...>, f32) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TFMOD_zh.md b/designs/outerCube/PTOISA/TFMOD_zh.md
new file mode 100644
index 00000000..1dfe85fa
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFMOD_zh.md
@@ -0,0 +1,77 @@
+# TFMOD
+
+## 指令示意图
+
+![TFMOD tile operation](../figures/isa/TFMOD.svg)
+
+## 简介
+
+两个 Tile 的逐元素余数，余数符号与被除数相同。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$\mathrm{dst}_{i,j} = \mathrm{fmod}(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j})$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tfmod %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tfmod ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TFMOD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+- Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT out, a, b;
+  TFMOD(out, a, b);
+}
+```
diff --git a/designs/outerCube/PTOISA/TFREE.md b/designs/outerCube/PTOISA/TFREE.md
new file mode 100644
index 00000000..f13f9709
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFREE.md
@@ -0,0 +1,40 @@
+# TFREE
+
+## Tile Operation Diagram
+
+![TFREE tile operation](../figures/isa/TFREE.svg)
+
+## Introduction
+
+Release the currently held pipe or FIFO slot back to the producer.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tfree ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TFREE_zh.md b/designs/outerCube/PTOISA/TFREE_zh.md
new file mode 100644
index 00000000..3e7be0d2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TFREE_zh.md
@@ -0,0 +1,41 @@
+# TFREE
+
+## 指令示意图
+
+![TFREE tile operation](../figures/isa/TFREE.svg)
+
+## 简介
+
+将当前占用的 pipe 或 FIFO 槽位释放回生产者。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tfree ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TGATHER.md b/designs/outerCube/PTOISA/TGATHER.md
new file mode 100644
index 00000000..911dfba7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGATHER.md
@@ -0,0 +1,156 @@
+﻿# TGATHER
+
+
+## Tile Operation Diagram
+
+![TGATHER tile operation](../figures/isa/TGATHER.svg)
+
+## Introduction
+
+Gather/select elements using either an index tile or a compile-time mask pattern.
+
+## Math Interpretation
+
+Index-based gather (conceptual):
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}\!\left[\mathrm{indices}_{i,j}\right] $$
+
+Exact index interpretation and bounds behavior are implementation-defined.
+
+Mask-pattern gather is an implementation-defined selection/reduction controlled by `pto::MaskPattern`.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Index-based gather:
+
+```text
+%dst = tgather %src0, %indices : !pto.tile<...> -> !pto.tile<...>
+```
+
+Mask-pattern gather:
+
+```text
+%dst = tgather %src {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%dst = pto.tgather %src {maskPattern = #pto.mask_pattern<P0101>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+pto.tgather ins(%src, {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataD, typename TileDataS0, typename TileDataS1, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS0 &src0, TileDataS1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, MaskPattern maskPattern, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Index-based gather: implementation checks (A2A3)**:
+    - `sizeof(DstTileData::DType)` must be must be `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
+    - `sizeof(Src1TileData::DType)` must be must be `int32_t`, `uint32_t`.
+    - `DstTileData::DType` must be the same type as `Src0TileData::DType`.
+    - `src1.GetValidCol() == Src1TileData::Cols` and `dst.GetValidCol() == DstTileData::Cols`.
+- **Index-based gather: implementation checks (A5)**:
+    - `sizeof(DstTileData::DType)` must be must be `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `float`.
+    - `sizeof(Src1TileData::DType)` must be must be `int16_t`, `uint16_t`, `int32_t`, `uint32_t`.
+    - `DstTileData::DType` must be the same type as `Src0TileData::DType`.
+    - `src1.GetValidCol() == Src1TileData::Cols` and `dst.GetValidCol() == DstTileData::Cols`.
+- **Mask-pattern gather: implementation checks (A2A3)**:
+    - Source element size must be `2` or `4` bytes.
+    - `SrcTileData::DType`/`DstTileData::DType` must be `int16_t` or `uint16_t` or `int32_t` or `uint32_t`
+    or `half` or `bfloat16_t` or `float`.
+    - `dst` and `src` must both be `TileType::Vec` and row-major.
+    - `sizeof(dst element) == sizeof(src element)` and `dst.GetValidCol() == DstTileData::Cols` (continuous dst storage).
+- **Mask-pattern gather: implementation checks (A5)**:
+    - Source element size must be `1` or `2` or `4` bytes.
+    - `dst` and `src` must both be `TileType::Vec` and row-major.
+    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t`
+    or `half` or `bfloat16_t` or `float` or `float8_e4m3_t`or `float8_e5m2_t` or `hifloat8_t`.
+    - Supported dtypes are restricted to a target-defined set (checked via `static_assert` in the implementation), and `sizeof(dst element) == sizeof(src element)`, `dst.GetValidCol() == DstTileData::Cols` (continuous dst storage).
+- **Bounds / validity**:
+    - Index bounds are not validated by explicit runtime assertions; out-of-range indices are target-defined.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using IdxT = Tile<TileType::Vec, int32_t, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src0;
+  IdxT idx;
+  DstT dst;
+  TGATHER(dst, src0, idx);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TGATHER<DstT, SrcT, MaskPattern::P0101>(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TGATHERB.md b/designs/outerCube/PTOISA/TGATHERB.md
new file mode 100644
index 00000000..3ee1b4ef
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGATHERB.md
@@ -0,0 +1,141 @@
+﻿# TGATHERB
+
+
+## Tile Operation Diagram
+
+![TGATHERB tile operation](../figures/isa/TGATHERB.svg)
+
+## Introduction
+
+Gather elements using byte offsets.
+
+## Math Interpretation
+
+For each element in the valid region:
+
+$$ \mathrm{dst}_{i,j} = *\left(\mathrm{srcBase} + \mathrm{offset}_{i,j}\right) $$
+
+Exact bounds behavior is implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tgatherb %src, %offsets : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataOffset, typename... WaitEvents>
+PTO_INST RecordEvent TGATHERB(TileDataDst &dst, TileDataSrc &src, TileDataOffset &offset, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Destination layout must be row-major (`TileDataDst::isRowMajor`).
+    - Destination element size must be `1`, `2`, or `4` bytes (enforced via `static_assert` in the helper).
+    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+- **Implementation checks (A5)**:
+    - Destination element size must be `1`, `2`, or `4` bytes.
+    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+- **Offset interpretation**:
+    - Offsets are interpreted as `uint32_t` values (byte offsets) by the implementation.
+    - Offset bounds are not validated by explicit runtime assertions; out-of-range offsets are target-defined.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
+  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  SrcT src;
+  OffT off;
+  DstT dst;
+  TGATHERB(dst, src, off);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
+  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  SrcT src;
+  OffT off;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(off, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TGATHERB(dst, src, off);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tgatherb %src, %offsets : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TGATHERB_zh.md b/designs/outerCube/PTOISA/TGATHERB_zh.md
new file mode 100644
index 00000000..64652ef2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGATHERB_zh.md
@@ -0,0 +1,114 @@
+# TGATHERB
+
+## 指令示意图
+
+![TGATHERB tile operation](../figures/isa/TGATHERB.svg)
+
+## 简介
+
+使用字节偏移量收集元素。
+
+## 数学语义
+
+对每个元素 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = *\left(\mathrm{srcBase} + \mathrm{offset}_{i,j}\right) $$
+
+Exact bounds behavior is implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tgatherb %src, %offsets : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tgatherb %src, %offsets : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataOffset, typename... WaitEvents>
+PTO_INST RecordEvent TGATHERB(TileDataDst &dst, TileDataSrc &src, TileDataOffset &offset, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - Destination layout must be row-major (`TileDataDst::isRowMajor`).
+    - Destination element size must be `1`, `2`, or `4` bytes (enforced via `static_assert` in the helper).
+    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+- **实现检查 (A5)**:
+    - Destination element size must be `1`, `2`, or `4` bytes.
+    - `SrcTileData::DType`/`DstTileData::DType` must be `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+- **Offset interpretation**:
+    - Offsets are interpreted as `uint32_t` values (byte offsets) by the implementation.
+    - Offset bounds are not validated by explicit runtime assertions; out-of-range offsets are target-defined.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
+  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  SrcT src;
+  OffT off;
+  DstT dst;
+  TGATHERB(dst, src, off);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  using OffT = Tile<TileType::Vec, uint32_t, 1, 256>;
+  using DstT = Tile<TileType::Vec, uint8_t, 1, 256>;
+  SrcT src;
+  OffT off;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(off, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TGATHERB(dst, src, off);
+}
+```
diff --git a/designs/outerCube/PTOISA/TGATHER_zh.md b/designs/outerCube/PTOISA/TGATHER_zh.md
new file mode 100644
index 00000000..731801d2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGATHER_zh.md
@@ -0,0 +1,154 @@
+﻿# TGATHER
+
+## 指令示意图
+
+![TGATHER tile operation](../figures/isa/TGATHER.svg)
+
+## 简介
+
+使用索引 Tile 或编译时掩码模式来收集/选择元素。
+
+## 数学语义
+
+基于索引的 gather（概念性定义）：
+
+设 `R = dst.GetValidRow()`，`C = dst.GetValidCol()`。对于 `0 <= i < R` 且 `0 <= j < C`：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}\!\left[\mathrm{indices}_{i,j}\right] $$
+
+确切的索引解释和边界行为由实现定义。
+
+基于掩码模式的 gather 是由 `pto::MaskPattern` 控制的实现定义的选择/归约操作。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+基于索引的 gather：
+
+```text
+%dst = tgather %src0, %indices : !pto.tile<...> -> !pto.tile<...>
+```
+
+基于掩码模式的 gather：
+
+```text
+%dst = tgather %src {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%dst = pto.tgather %src {maskPattern = #pto.mask_pattern<P0101>}: !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+pto.tgather ins(%src, {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataD, typename TileDataS0, typename TileDataS1, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS0 &src0, TileDataS1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, MaskPattern maskPattern, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **基于索引的 gather：实现检查 (A2A3)**:
+    - `sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一。
+    - `sizeof(Src1TileData::DType)` 对应类型必须是 `int32_t`、`uint32_t` 之一。
+    - `DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同。
+    - `src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
+- **基于索引的 gather：实现检查 (A5)**:
+    - `sizeof(DstTileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`float` 之一。
+    - `sizeof(Src1TileData::DType)` 对应类型必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t` 之一。
+    - `DstTileData::DType` 必须与 `Src0TileData::DType` 类型相同。
+    - `src1.GetValidCol() == Src1TileData::Cols` 且 `dst.GetValidCol() == DstTileData::Cols`。
+- **基于掩码模式的 gather：实现检查 (A2A3)**:
+    - 源元素大小必须是 `2` 或 `4` 字节。
+    - `SrcTileData::DType`/`DstTileData::DType` 必须是 `int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t` 或 `float` 之一。
+    - `dst` 和 `src` 必须都是 `TileType::Vec` 且行主序。
+    - `sizeof(dst element) == sizeof(src element)` 且 `dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
+- **基于掩码模式的 gather：实现检查 (A5)**:
+    - 源元素大小必须是 `1`、`2` 或 `4` 字节。
+    - `dst` 和 `src` 必须都是 `TileType::Vec` 且行主序。
+    - `SrcTileData::DType`/`DstTileData::DType` 必须是 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half`、`bfloat16_t`、`float`、`float8_e4m3_t`、`float8_e5m2_t` 或 `hifloat8_t` 之一。
+    - 支持的数据类型限制为目标定义的集合（通过实现中的 `static_assert` 强制执行），且 `sizeof(dst element) == sizeof(src element)`，`dst.GetValidCol() == DstTileData::Cols`（连续的目标存储）。
+- **边界 / 有效性**:
+    - 索引边界不通过显式运行时断言进行验证；超出范围的索引行为由目标定义。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using IdxT = Tile<TileType::Vec, int32_t, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src0;
+  IdxT idx;
+  DstT dst;
+  TGATHER(dst, src0, idx);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 1, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TGATHER<DstT, SrcT, MaskPattern::P0101>(dst, src);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TGEMV.md b/designs/outerCube/PTOISA/TGEMV.md
new file mode 100644
index 00000000..1a8253ee
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV.md
@@ -0,0 +1,282 @@
+﻿# TGEMV
+
+
+## Tile Operation Diagram
+
+![TGEMV tile operation](../figures/isa/TGEMV.svg)
+
+## Introduction
+
+General Matrix-Vector multiplication (GEMV) producing an accumulator/output tile.
+
+## Math Interpretation
+
+Let:
+
+- `M = 1`
+- `K = bMatrix.GetValidRow()`
+- `N = bMatrix.GetValidCol()`
+
+### 1. TGEMV (Tile-based GEMV)
+
+For `0 <= j < N` (output elements in the effective matmul domain):
+
+$$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+### 2. TGEMV_ACC (Tile-based GEMV with Accumulation)
+
+For `0 <= j < N` (accumulates into existing tile):
+
+$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+### 3. TGEMV_BIAS (Tile-based GEMV with Bias)
+
+For `0 <= j < N` (adds bias term to matrix product):
+
+$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+**Note:** Exact accumulator behavior and datatype promotion are target/implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+
+%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+
+%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
+
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
+
+template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents&... events);
+```
+
+## Constraints
+
+### Common shape and location constraints
+
+These constraints apply to `TGEMV`, `TGEMV_ACC`, and `TGEMV_BIAS` unless otherwise noted.
+
+- Static shape constraints:
+    - `TileLeft::Rows == TileRes::Rows`
+    - `TileLeft::Cols == TileRight::Rows`
+    - `TileRight::Cols == TileRes::Cols`
+- Tile locations:
+    - `TileLeft::Loc == Left`
+    - `TileRight::Loc == Right`
+    - `TileRes::Loc == Acc`
+- Runtime valid-size constraints:
+    - `m` must be `1`
+    - `k` and `n` (taken from `bMatrix.GetValidRow()` and `bMatrix.GetValidCol()`) must be in `[1, 4095]`
+
+### TGEMV / TGEMV_ACC datatype constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported `(CType, AType, BType)` triples:
+        - `(int32_t, int8_t, int8_t)`
+        - `(float, half, half)`
+        - `(float, float, float)`
+        - `(float, bfloat16_t, bfloat16_t)`
+- **Implementation checks (A5)**:
+    - Accumulator type must be `int32_t` or `float`.
+    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
+    - If `float`: supports `half`, `bfloat16_t`, `float`, and selected fp8 pairs (target-defined).
+    - Fractal/layout constraints are enforced:
+        - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
+        - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
+        - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
+
+### TGEMV_BIAS additional constraints
+
+- Bias tile datatype must exactly match `TileRes::DType`.
+- Bias tile must be configured as a single row.
+- Bias tile location must be `TileType::Bias`.
+- **Additional A5 note**:
+    - No separate explicit `m/k/n` runtime assertions are enforced in the underlying A5 matmul implementation beyond the GEMV contract described above.
+
+## Examples
+
+### Auto
+
+#### 1. TGEMV
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c;
+  TGEMV(c, a, b);
+}
+```
+
+#### 2. TGEMV_ACC
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+#### 3. TGEMV_BIAS
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+### Manual
+
+#### 1. TGEMV
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c, 0x3000);
+  TGEMV(c, a, b);
+}
+```
+
+#### 2. TGEMV_ACC
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c0, 0x3000);
+  TASSIGN(c1, 0x4000);
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+#### 3. TGEMV_BIAS
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TGEMV_ACC.md b/designs/outerCube/PTOISA/TGEMV_ACC.md
new file mode 100644
index 00000000..7bad9382
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV_ACC.md
@@ -0,0 +1,166 @@
+﻿# TGEMV_ACC
+
+
+## Tile Operation Diagram
+
+![TGEMV_ACC tile operation](../figures/isa/TGEMV_ACC.svg)
+
+## Introduction
+
+Tile-based GEMV with explicit accumulator input tile (`cInMatrix`) and output tile (`cOutMatrix`).
+
+## See also
+
+- Base GEMV instruction: `docs/isa/TGEMV.md`.
+- Bias variant: `docs/isa/TGEMV_BIAS.md`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+```
+
+## Math Interpretation
+
+Let:
+
+- `M = 1`
+- `K = bMatrix.GetValidRow()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= j < N` (accumulates into the existing output tile):
+
+$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+**Note:** Exact accumulator behavior and datatype promotion are target/implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
+
+## Constraints
+
+### Common shape and location constraints
+
+- Static shape constraints:
+    - `TileLeft::Rows == TileRes::Rows`
+    - `TileLeft::Cols == TileRight::Rows`
+    - `TileRight::Cols == TileRes::Cols`
+- Tile locations:
+    - `TileLeft::Loc == Left`
+    - `TileRight::Loc == Right`
+    - `TileRes::Loc == Acc`
+- Runtime valid-size constraints:
+    - `m` must be `1`
+    - `k` and `n` (taken from `bMatrix.GetValidRow()` and `bMatrix.GetValidCol()`) must be in `[1, 4095]`
+
+### Datatype constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported `(CType, AType, BType)` triples:
+        - `(int32_t, int8_t, int8_t)`
+        - `(float, half, half)`
+        - `(float, float, float)`
+        - `(float, bfloat16_t, bfloat16_t)`
+- **Implementation checks (A5)**:
+    - Accumulator type must be `int32_t` or `float`.
+    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
+    - If `float`: supports `half`, `bfloat16_t`, `float`, and selected fp8 pairs (target-defined).
+    - Fractal/layout constraints are enforced:
+        - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
+        - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
+        - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
+    - No separate explicit `m/k/n` runtime assertions are enforced in the underlying A5 matmul implementation beyond the GEMV contract described above.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c0, 0x3000);
+  TASSIGN(c1, 0x4000);
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TGEMV_ACC_zh.md b/designs/outerCube/PTOISA/TGEMV_ACC_zh.md
new file mode 100644
index 00000000..7a79476d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV_ACC_zh.md
@@ -0,0 +1,165 @@
+﻿# TGEMV_ACC
+
+## 指令示意图
+
+![TGEMV_ACC tile operation](../figures/isa/TGEMV_ACC.svg)
+
+## 简介
+
+带显式累加器输入 Tile（`cInMatrix`）和输出 Tile（`cOutMatrix`）的 GEMV。
+
+## 另请参见
+
+- 基础 GEMV 指令：`docs/isa/TGEMV.md`。
+- 偏置变体：`docs/isa/TGEMV_BIAS.md`。
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+```
+
+## 数学语义
+
+设：
+
+- `M = 1`
+- `K = bMatrix.GetValidRow()`
+- `N = bMatrix.GetValidCol()`
+
+对于 `0 <= j < N`（累加到已有输出 Tile）：
+
+$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+**注意：** 精确的累加器行为和数据类型提升由目标/实现定义。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
+
+## 约束
+
+### 通用形状与位置约束
+
+- 静态形状约束：
+    - `TileLeft::Rows == TileRes::Rows`
+    - `TileLeft::Cols == TileRight::Rows`
+    - `TileRight::Cols == TileRes::Cols`
+- Tile 位置约束：
+    - `TileLeft::Loc == Left`
+    - `TileRight::Loc == Right`
+    - `TileRes::Loc == Acc`
+- 运行时有效尺寸约束：
+    - `m` 必须为 `1`
+    - `k` 和 `n`（取自 `bMatrix.GetValidRow()` 与 `bMatrix.GetValidCol()`）必须位于 `[1, 4095]`
+
+### 数据类型约束
+
+- **实现检查 (A2A3)**:
+    - 支持的 `(CType, AType, BType)` 三元组：
+        - `(int32_t, int8_t, int8_t)`
+        - `(float, half, half)`
+        - `(float, float, float)`
+        - `(float, bfloat16_t, bfloat16_t)`
+- **实现检查 (A5)**:
+    - 累加器类型必须是 `int32_t` 或 `float`。
+    - 如果为 `int32_t`：`AType == int8_t` 且 `BType == int8_t`。
+    - 如果为 `float`：支持 `half`、`bfloat16_t`、`float` 以及选定的 fp8 组合（目标定义）。
+    - 会强制执行以下分形/布局约束：
+        - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
+        - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
+        - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
+    - 除上述 GEMV 约定外，底层 A5 matmul 实现不会再单独补充一组显式的 `m/k/n` 运行时断言。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c0, 0x3000);
+  TASSIGN(c1, 0x4000);
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TGEMV_BIAS.md b/designs/outerCube/PTOISA/TGEMV_BIAS.md
new file mode 100644
index 00000000..08a40fcf
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV_BIAS.md
@@ -0,0 +1,178 @@
+﻿# TGEMV_BIAS
+
+
+## Tile Operation Diagram
+
+![TGEMV_BIAS tile operation](../figures/isa/TGEMV_BIAS.svg)
+
+## Introduction
+
+Tile-based GEMV with bias add.
+
+## See also
+
+- Base GEMV instruction: `docs/isa/TGEMV.md`.
+- Accumulation variant: `docs/isa/TGEMV_ACC.md`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
+          typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+## Math Interpretation
+
+Let:
+
+- `M = 1`
+- `K = bMatrix.GetValidRow()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= j < N` (adds a bias term to the matrix product):
+
+$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+**Note:** Exact accumulator behavior and datatype promotion are target/implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+## Constraints
+
+### Common shape and location constraints
+
+- Static shape constraints:
+    - `TileLeft::Rows == TileRes::Rows`
+    - `TileLeft::Cols == TileRight::Rows`
+    - `TileRight::Cols == TileRes::Cols`
+- Tile locations:
+    - `TileLeft::Loc == Left`
+    - `TileRight::Loc == Right`
+    - `TileRes::Loc == Acc`
+- Runtime valid-size constraints:
+    - `m` must be `1`
+    - `k` and `n` (taken from `bMatrix.GetValidRow()` and `bMatrix.GetValidCol()`) must be in `[1, 4095]`
+
+### Datatype constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported `(CType, AType, BType)` triples:
+        - `(int32_t, int8_t, int8_t)`
+        - `(float, half, half)`
+        - `(float, float, float)`
+        - `(float, bfloat16_t, bfloat16_t)`
+- **Implementation checks (A5)**:
+    - Accumulator type must be `int32_t` or `float`.
+    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
+    - If `float`: supports `half`, `bfloat16_t`, `float`, and selected fp8 pairs (target-defined).
+    - Fractal/layout constraints are enforced:
+        - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
+        - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
+        - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
+
+### Bias-specific constraints
+
+- Bias tile datatype must exactly match `TileRes::DType`.
+- Bias tile must be configured as a single row.
+- Bias tile location must be `TileType::Bias`.
+- **Additional A5 note**:
+    - No separate explicit `m/k/n` runtime assertions are enforced in the underlying A5 matmul implementation beyond the GEMV contract described above.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TGEMV_BIAS_zh.md b/designs/outerCube/PTOISA/TGEMV_BIAS_zh.md
new file mode 100644
index 00000000..5e78f56a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV_BIAS_zh.md
@@ -0,0 +1,177 @@
+﻿# TGEMV_BIAS
+
+## 指令示意图
+
+![TGEMV_BIAS tile operation](../figures/isa/TGEMV_BIAS.svg)
+
+## 简介
+
+带偏置加法的 GEMV。
+
+## 另请参见
+
+- 基础 GEMV 指令：`docs/isa/TGEMV.md`。
+- 累加变体：`docs/isa/TGEMV_ACC.md`。
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
+          typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+## 数学语义
+
+设：
+
+- `M = 1`
+- `K = bMatrix.GetValidRow()`
+- `N = bMatrix.GetValidCol()`
+
+对于 `0 <= j < N`（将偏置项加入矩阵乘积）：
+
+$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+**注意：** 精确的累加器行为和数据类型提升由目标/实现定义。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+## 约束
+
+### 通用形状与位置约束
+
+- 静态形状约束：
+    - `TileLeft::Rows == TileRes::Rows`
+    - `TileLeft::Cols == TileRight::Rows`
+    - `TileRight::Cols == TileRes::Cols`
+- Tile 位置约束：
+    - `TileLeft::Loc == Left`
+    - `TileRight::Loc == Right`
+    - `TileRes::Loc == Acc`
+- 运行时有效尺寸约束：
+    - `m` 必须为 `1`
+    - `k` 和 `n`（取自 `bMatrix.GetValidRow()` 与 `bMatrix.GetValidCol()`）必须位于 `[1, 4095]`
+
+### 数据类型约束
+
+- **实现检查 (A2A3)**:
+    - 支持的 `(CType, AType, BType)` 三元组：
+        - `(int32_t, int8_t, int8_t)`
+        - `(float, half, half)`
+        - `(float, float, float)`
+        - `(float, bfloat16_t, bfloat16_t)`
+- **实现检查 (A5)**:
+    - 累加器类型必须是 `int32_t` 或 `float`。
+    - 如果为 `int32_t`：`AType == int8_t` 且 `BType == int8_t`。
+    - 如果为 `float`：支持 `half`、`bfloat16_t`、`float` 以及选定的 fp8 组合（目标定义）。
+    - 会强制执行以下分形/布局约束：
+        - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
+        - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
+        - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
+
+### 偏置专属约束
+
+- 偏置 tile 的数据类型必须与 `TileRes::DType` 完全一致。
+- 偏置 tile 必须配置为单行。
+- 偏置 tile 的位置必须为 `TileType::Bias`。
+- **A5 附加说明**：
+    - 除上述 GEMV 约定外，底层 A5 matmul 实现不会再单独补充一组显式的 `m/k/n` 运行时断言。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TGEMV_MX.md b/designs/outerCube/PTOISA/TGEMV_MX.md
new file mode 100644
index 00000000..7d117ba7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV_MX.md
@@ -0,0 +1,127 @@
+﻿# TGEMV_MX
+
+
+## Tile Operation Diagram
+
+![TGEMV_MX tile operation](../figures/isa/TGEMV_MX.svg)
+
+## Introduction
+
+GEMV with scaling tiles for mixed-precision / quantized matrix-vector compute on supported targets.
+
+This instruction family extends `TGEMV` with additional scale operands (mx path). Accumulator and scale handling are target-dependent.
+
+## Math Interpretation
+
+Conceptually (base GEMV path):
+
+$$
+\mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j}
+$$
+
+For `TGEMV_MX`, scale tiles participate in implementation-defined mixed-precision reconstruction / scaling. The architectural contract is that output corresponds to the target-defined mx GEMV semantics.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Schematic form:
+
+```text
+%acc = tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+Additional overloads support accumulation/bias variants and `AccPhase` selection.
+
+## Constraints
+
+- Uses backend-specific mx legality checks for data types, tile locations, fractal/layout combinations, and scaling formats.
+- Scale tile compatibility and accumulator promotion are implementation-defined by target backend.
+- For portability, validate the exact `(A, B, scaleA, scaleB, C)` type tuple and tile layout against target implementation constraints.
+
+## Examples
+
+For practical usage patterns, see:
+
+- `docs/isa/TMATMUL_MX.md`
+- `docs/isa/TGEMV.md`
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TGEMV_MX_zh.md b/designs/outerCube/PTOISA/TGEMV_MX_zh.md
new file mode 100644
index 00000000..1b7e9d31
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV_MX_zh.md
@@ -0,0 +1,98 @@
+# TGEMV_MX
+
+## 指令示意图
+
+![TGEMV_MX tile operation](../figures/isa/TGEMV_MX.svg)
+
+## 简介
+
+带缩放 Tile 的 GEMV 变体，支持混合精度/量化矩阵向量计算。
+
+## 数学语义
+
+Conceptually (base GEMV path):
+
+$$
+\mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j}
+$$
+
+For `TGEMV_MX`, scale tiles participate in implementation-defined mixed-precision reconstruction / scaling. The architectural contract is that output corresponds to the target-defined mx GEMV semantics.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Schematic form:
+
+```text
+%acc = tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%acc = pto.tgemv.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tgemv.mx ins(%a, %a_scale, %b, %b_scale : (!pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)) outs(%acc : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+Additional overloads support accumulation/bias variants and `AccPhase` selection.
+
+## 约束
+
+- Uses backend-specific mx legality checks for data types, tile locations, fractal/layout combinations, and scaling formats.
+- Scale tile compatibility and accumulator promotion are implementation-defined by target backend.
+- For portability, validate the exact `(A, B, scaleA, scaleB, C)` type tuple and tile layout against target implementation constraints.
+
+## 示例
+
+For practical usage patterns, see:
+
+- `docs/isa/TMATMUL_MX.md`
+- `docs/isa/TGEMV.md`
diff --git a/designs/outerCube/PTOISA/TGEMV_zh.md b/designs/outerCube/PTOISA/TGEMV_zh.md
new file mode 100644
index 00000000..f17168fe
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGEMV_zh.md
@@ -0,0 +1,282 @@
+﻿# TGEMV
+
+## 指令示意图
+
+![TGEMV tile operation](../figures/isa/TGEMV.svg)
+
+## 简介
+
+通用矩阵-向量乘法，生成累加器/输出 Tile。
+
+## 数学语义
+
+设：
+
+- `M = 1`
+- `K = bMatrix.GetValidRow()`
+- `N = bMatrix.GetValidCol()`
+
+### 1. TGEMV（基于 Tile 的 GEMV）
+
+对于 `0 <= j < N`（有效矩阵乘法域中的输出元素）：
+
+$$ \mathrm{C}_{0,j} = \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+### 2. TGEMV_ACC（带累加的基于 Tile 的 GEMV）
+
+对于 `0 <= j < N`（累加到现有 tile）：
+
+$$ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+### 3. TGEMV_BIAS（带偏置的基于 Tile 的 GEMV）
+
+对于 `0 <= j < N`（将偏置项添加到矩阵乘积）：
+
+$$ \mathrm{C}_{0,j} = \mathrm{Bias}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} $$
+
+**注意：** 精确的累加器行为和数据类型提升由目标/实现定义。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+
+%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+
+%acc = tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c = pto.tgemv.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+pto.tgemv.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
+
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents&... events);
+
+template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TGEMV_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents&... events);
+```
+
+## 约束
+
+### 通用形状与位置约束
+
+以下约束在未特别说明时同时适用于 `TGEMV`、`TGEMV_ACC` 和 `TGEMV_BIAS`。
+
+- 静态形状约束：
+    - `TileLeft::Rows == TileRes::Rows`
+    - `TileLeft::Cols == TileRight::Rows`
+    - `TileRight::Cols == TileRes::Cols`
+- Tile 位置约束：
+    - `TileLeft::Loc == Left`
+    - `TileRight::Loc == Right`
+    - `TileRes::Loc == Acc`
+- 运行时有效尺寸约束：
+    - `m` 必须为 `1`
+    - `k` 和 `n`（取自 `bMatrix.GetValidRow()` 与 `bMatrix.GetValidCol()`）必须位于 `[1, 4095]`
+
+### TGEMV / TGEMV_ACC 数据类型约束
+
+- **实现检查 (A2A3)**:
+    - 支持的 `(CType, AType, BType)` 三元组：
+        - `(int32_t, int8_t, int8_t)`
+        - `(float, half, half)`
+        - `(float, float, float)`
+        - `(float, bfloat16_t, bfloat16_t)`
+- **实现检查 (A5)**:
+    - 累加器类型必须是 `int32_t` 或 `float`。
+    - 如果为 `int32_t`：`AType == int8_t` 且 `BType == int8_t`。
+    - 如果为 `float`：支持 `half`、`bfloat16_t`、`float` 以及选定的 fp8 组合（目标定义）。
+    - 会强制执行以下分形/布局约束：
+        - Left：`Loc == Left`、`!isRowMajor`、`SFractal == RowMajor`
+        - Right：`Loc == Right`、`isRowMajor`、`SFractal == ColMajor`
+        - Acc：`Loc == Acc`、`!isRowMajor`、`SFractal == RowMajor`
+
+### TGEMV_BIAS 的附加约束
+
+- 偏置 tile 的数据类型必须与 `TileRes::DType` 完全一致。
+- 偏置 tile 必须配置为单行。
+- 偏置 tile 的位置必须为 `TileType::Bias`。
+- **A5 附加说明**：
+    - 除上述 GEMV 约定外，底层 A5 matmul 实现不会再单独补充一组显式的 `m/k/n` 运行时断言。
+
+## 示例
+
+### 自动（Auto）
+
+#### 1. TGEMV
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c;
+  TGEMV(c, a, b);
+}
+```
+
+#### 2. TGEMV_ACC
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+#### 3. TGEMV_BIAS
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+### 手动（Manual）
+
+#### 1. TGEMV
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c, 0x3000);
+  TGEMV(c, a, b);
+}
+```
+
+#### 2. TGEMV_ACC
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c0, 0x3000);
+  TASSIGN(c1, 0x4000);
+  TGEMV_ACC(c1, c0, a, b);
+}
+```
+
+#### 3. TGEMV_BIAS
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 1, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 1, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TGEMV_BIAS(c, a, b, bias);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c = pto.tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%acc = tgemv %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tgemv ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TGET_SCALE_ADDR.md b/designs/outerCube/PTOISA/TGET_SCALE_ADDR.md
new file mode 100644
index 00000000..458b05ba
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGET_SCALE_ADDR.md
@@ -0,0 +1,83 @@
+# TGET_SCALE_ADDR
+
+## Tile Operation Diagram
+
+![TGET_SCALE_ADDR tile operation](../figures/isa/TGET_SCALE_ADDR.svg)
+
+## Introduction
+
+Bind the on-chip address of output tile as a scaled address of the input tile.
+
+The scaling factor is defined by a right-shift amount `SHIFT_MX_ADDR` in `include/pto/npu/a5/utils.hpp`.
+
+## Math Interpretation
+
+Address(`dst`) = Address(`src`) >> `SHIFT_MX_ADDR`
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### IR Level 1 (SSA)
+
+TODO
+
+### IR Level 2 (DPS)
+
+TODO
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TGET_SCALE_ADDR(TileDataDst &dst, TileDataSrc &src, aitEvents&... events);
+```
+
+## Constraints
+
+Enforced by `TGET_SCALE_ADDR_IMPL`:
+
+- **Both `src` and `dst` must be Tile instances**
+- **Currently only work in auto mode** (will support manual mode in the future)
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+> wa
+using namespace pto;
+
+template <typename T, int ARows, int ACols, BRows, BCols> 
+void example() {
+    using LeftTile = TileLeft<T, ARows, ACols>;
+    using RightTile = TileRight<T, BRows, BCols>;
+
+    using LeftScaleTile = TileLeftScale<T, ARows, ACols>;
+    using RightScaleTile = TileRightScale<T, BRows, BCols>;
+
+    LeftTile aTile;
+    RightTile bTile;
+    LeftScaleTile aScaleTile;
+    RightScaleTile bScaleTile;
+
+    TGET_SCALE_ADDR(aScaleTile, aTile);
+    TGET_SCALE_ADDR(bScaleTile, bTile);
+}
+```
+
+## asm form examples
+
+### Auto Mode
+
+TODO
+
+### Manual Mode
+
+TODO
+
+### PTO Assembly Form
+
+TODO
diff --git a/designs/outerCube/PTOISA/TGET_SCALE_ADDR_zh.md b/designs/outerCube/PTOISA/TGET_SCALE_ADDR_zh.md
new file mode 100644
index 00000000..0dad8946
--- /dev/null
+++ b/designs/outerCube/PTOISA/TGET_SCALE_ADDR_zh.md
@@ -0,0 +1,81 @@
+# TGET_SCALE_ADDR
+
+## Tile Operation Diagram
+
+![TGET_SCALE_ADDR tile operation](../figures/isa/TGET_SCALE_ADDR.svg)
+
+## Introduction
+
+将输入Tile的片上地址数值按比例扩展，将其结果数值绑定为输出Tile的片上地址。
+
+这个扩展因子是由`include/pto/npu/a5/utils.hpp`中的右移值`SHIFT_MX_ADDR`定义的。
+
+## 数学语义
+
+Address(`dst`) = Address(`src`) >> `SHIFT_MX_ADDR`
+
+## 汇编语法
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### IR Level 1 (SSA)
+
+TODO
+
+### IR Level 2 (DPS)
+
+TODO
+
+## C++ 内建接口
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TGET_SCALE_ADDR(TileDataDst &dst, TileDataSrc &src, aitEvents&... events);
+```
+
+## 约束
+
+- **输入和输出都必须为Tile对象**
+- **目前只能用在auto模式下**（以后会将支持manual模式下的实现）
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+> wa
+using namespace pto;
+
+template <typename T, int ARows, int ACols, BRows, BCols> 
+void example() {
+    using LeftTile = TileLeft<T, ARows, ACols>;
+    using RightTile = TileRight<T, BRows, BCols>;
+
+    using LeftScaleTile = TileLeftScale<T, ARows, ACols>;
+    using RightScaleTile = TileRightScale<T, BRows, BCols>;
+
+    LeftTile aTile;
+    RightTile bTile;
+    LeftScaleTile aScaleTile;
+    RightScaleTile bScaleTile;
+
+    TGET_SCALE_ADDR(aScaleTile, aTile);
+    TGET_SCALE_ADDR(bScaleTile, bTile);
+}
+```
+
+## asm form examples
+
+### Auto Mode
+
+TODO
+
+### Manual Mode
+
+TODO
+
+### PTO Assembly Form
+
+TODO
diff --git a/designs/outerCube/PTOISA/THISTOGRAM.md b/designs/outerCube/PTOISA/THISTOGRAM.md
new file mode 100644
index 00000000..62186da0
--- /dev/null
+++ b/designs/outerCube/PTOISA/THISTOGRAM.md
@@ -0,0 +1,40 @@
+# THISTOGRAM
+
+## Tile Operation Diagram
+
+![THISTOGRAM tile operation](../figures/isa/THISTOGRAM.svg)
+
+## Introduction
+
+Accumulate histogram bin counts from source values using an index tile.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.thistogram ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.thistogram ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/THISTOGRAM_zh.md b/designs/outerCube/PTOISA/THISTOGRAM_zh.md
new file mode 100644
index 00000000..1f352be5
--- /dev/null
+++ b/designs/outerCube/PTOISA/THISTOGRAM_zh.md
@@ -0,0 +1,41 @@
+# THISTOGRAM
+
+## 指令示意图
+
+![THISTOGRAM tile operation](../figures/isa/THISTOGRAM.svg)
+
+## 简介
+
+使用索引 Tile 从源值中累计直方图 bin 计数。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.thistogram ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.thistogram ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TIMG2COL.md b/designs/outerCube/PTOISA/TIMG2COL.md
new file mode 100644
index 00000000..5cba652d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TIMG2COL.md
@@ -0,0 +1,85 @@
+﻿# TIMG2COL
+
+
+## Tile Operation Diagram
+
+![TIMG2COL tile operation](../figures/isa/TIMG2COL.svg)
+
+## Introduction
+
+Transform an input feature-map tile (e.g. NC1HWC0 layout) into an im2col-style matrix tile for convolution-like workloads. Parameters are provided via `Img2colTileConfig` and `(posM, posK)` offsets.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+PTO_INST RecordEvent TIMG2COL(TileData &dst, ConvTileData &src, uint16_t posM = 0, uint16_t posK = 0,
+                              WaitEvents&... events);
+```
+
+## Constraints
+
+- This instruction is target/implementation-specific. See `include/pto/npu/*/TImg2col.hpp` for the supported tile types/layouts and config fields.
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TIMG2COL_zh.md b/designs/outerCube/PTOISA/TIMG2COL_zh.md
new file mode 100644
index 00000000..c13dacf8
--- /dev/null
+++ b/designs/outerCube/PTOISA/TIMG2COL_zh.md
@@ -0,0 +1,58 @@
+# TIMG2COL
+
+## 指令示意图
+
+![TIMG2COL tile operation](../figures/isa/TIMG2COL.svg)
+
+## 简介
+
+用于类卷积工作负载的图像到列变换。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.timg2col %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.timg2col ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+PTO_INST RecordEvent TIMG2COL(TileData &dst, ConvTileData &src, uint16_t posM = 0, uint16_t posK = 0,
+                              WaitEvents&... events);
+```
+
+## 约束
+
+- This instruction is target/implementation-specific. See `include/pto/npu/*/TImg2col.hpp` for the supported tile types/layouts and config fields.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TINSERT.md b/designs/outerCube/PTOISA/TINSERT.md
new file mode 100644
index 00000000..c5f02509
--- /dev/null
+++ b/designs/outerCube/PTOISA/TINSERT.md
@@ -0,0 +1,107 @@
+﻿# TINSERT
+
+
+## Tile Operation Diagram
+
+![TINSERT tile operation](../figures/isa/TINSERT.svg)
+
+## Introduction
+
+Insert a source sub-tile into a destination tile at `(indexRow, indexCol)`. This is conceptually the inverse of `TEXTRACT` for many layouts.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. Conceptually, for `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{\mathrm{indexRow}+i,\;\mathrm{indexCol}+j} = \mathrm{src}_{i,j}
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+#ifdef PTO_NPU_ARCH_A5
+template <TInsertMode mode, typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint32_t indexRow = 0, uint32_t indexCol = 0, WaitEvents &... events);
+#endif
+```
+
+## Constraints
+
+- **A2/A3**:
+    - The documented overloads map to `Acc -> Mat` insertion paths, including plain, `reluMode`, scalar pre-quant, and vector pre-quant (`TINSERT_FP`) forms.
+    - Runtime bounds must satisfy `indexRow + src.Rows <= dst.Rows` and `indexCol + src.Cols <= dst.Cols`.
+- **A5**:
+    - In addition to the `Acc -> Mat` insertion paths above, A5 also exposes `template <TInsertMode mode, ...> TINSERT(...)` for `Vec -> Mat` and `Vec -> Vec` insertion variants.
+    - `mode == TInsertMode::ND` requires a row-major source vector tile and inserts into a matrix tile in ND layout.
+    - `mode == TInsertMode::ND_VEC` requires both source and destination to be row-major vector tiles.
+    - NZ-family modes (`NZ`, `NZ_PLUS_1`, `SPLIT2_NZ_PLUS_1`, `SPLIT4_NZ_PLUS_1`) require an NZ-format source vector tile and a matrix destination tile.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TINSERT_FP.md b/designs/outerCube/PTOISA/TINSERT_FP.md
new file mode 100644
index 00000000..ce12ec57
--- /dev/null
+++ b/designs/outerCube/PTOISA/TINSERT_FP.md
@@ -0,0 +1,90 @@
+﻿# TINSERT_FP
+
+
+## Tile Operation Diagram
+
+![TINSERT_FP tile operation](../figures/isa/TINSERT_FP.svg)
+
+## Introduction
+
+Vector-quantization variant of `TINSERT` that also takes an `fp` (scaling) tile.
+
+## See also
+
+- TINSERT base instruction: `docs/isa/TINSERT.md`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+```
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## Constraints
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TINSERT_FP_zh.md b/designs/outerCube/PTOISA/TINSERT_FP_zh.md
new file mode 100644
index 00000000..5dc56ee3
--- /dev/null
+++ b/designs/outerCube/PTOISA/TINSERT_FP_zh.md
@@ -0,0 +1,59 @@
+# TINSERT_FP
+
+## 指令示意图
+
+![TINSERT_FP tile operation](../figures/isa/TINSERT_FP.svg)
+
+## 简介
+
+带 fp/缩放 Tile 的插入（向量量化参数）。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tinsert_fp %src, %fp, %idxrow, %idxcol : (!pto.tile<...>, !pto.tile<...>, dtype, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tinsert_fp ins(%src, %fp, %idxrow, %idxcol : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+```
+
+## 约束
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TINSERT_zh.md b/designs/outerCube/PTOISA/TINSERT_zh.md
new file mode 100644
index 00000000..80448b6d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TINSERT_zh.md
@@ -0,0 +1,107 @@
+﻿# TINSERT
+
+## 指令示意图
+
+![TINSERT tile operation](../figures/isa/TINSERT.svg)
+
+## 简介
+
+在 (indexRow, indexCol) 偏移处将子 Tile 插入到目标 Tile 中。
+
+## 数学语义
+
+设 `R = src.GetValidRow()` 和 `C = src.GetValidCol()`。概念上，对于 `0 <= i < R` 和 `0 <= j < C`：
+
+$$
+\mathrm{dst}_{\mathrm{indexRow}+i,\;\mathrm{indexCol}+j} = \mathrm{src}_{i,j}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, uint16_t indexRow, uint16_t indexCol, WaitEvents &... events);
+
+#ifdef PTO_NPU_ARCH_A5
+template <TInsertMode mode, typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, uint32_t indexRow = 0, uint32_t indexCol = 0, WaitEvents &... events);
+#endif
+```
+
+## 约束
+
+- **A2/A3**:
+    - 文档中列出的这些重载对应 `Acc -> Mat` 插入路径，包括普通形式、`reluMode` 形式、标量预量化形式以及向量预量化（`TINSERT_FP`）形式。
+    - 运行时边界必须满足 `indexRow + src.Rows <= dst.Rows` 且 `indexCol + src.Cols <= dst.Cols`。
+- **A5**:
+    - 除了上面的 `Acc -> Mat` 插入路径外，A5 还额外提供 `template <TInsertMode mode, ...> TINSERT(...)`，用于 `Vec -> Mat` 与 `Vec -> Vec` 插入变体。
+    - `mode == TInsertMode::ND` 要求源向量 tile 为行优先，并以 ND 布局插入到矩阵 tile。
+    - `mode == TInsertMode::ND_VEC` 要求源和目的都为行优先向量 tile。
+    - NZ 系列模式（`NZ`、`NZ_PLUS_1`、`SPLIT2_NZ_PLUS_1`、`SPLIT4_NZ_PLUS_1`）要求源向量 tile 为 NZ 格式，目的为矩阵 tile。
+
+## 示例
+
+参见 `docs/isa/` 和 `docs/coding/tutorials/` 中的相关示例。
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tinsert ins(%src[%r0, %r1] : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TLOAD.md b/designs/outerCube/PTOISA/TLOAD.md
new file mode 100644
index 00000000..3f36e1ee
--- /dev/null
+++ b/designs/outerCube/PTOISA/TLOAD.md
@@ -0,0 +1,159 @@
+﻿# TLOAD
+
+
+## Tile Operation Diagram
+
+![TLOAD tile operation](../figures/isa/TLOAD.svg)
+
+## Introduction
+
+Load data from a GlobalTensor (GM) into a Tile.
+
+## Math Interpretation
+
+Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{r_0 + i,\; c_0 + j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
+!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
+!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData, typename... WaitEvents>
+PTO_INST RecordEvent TLOAD(TileData &dst, GlobalData &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
+    - Destination tile location must be `TileType::Vec` or `TileType::Mat`.
+    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
+    - Runtime: all `src.GetShape(dim)` values and `dst.GetValidRow()/GetValidCol()` must be `> 0`.
+    - `TileType::Vec` loads only support matching layouts: ND->ND, DN->DN, NZ->NZ.
+    - `TileType::Mat` loads support: ND->ND, DN->DN, NZ->NZ, plus ND->NZ and DN->ZN.
+    - For ND->NZ or DN->ZN: `GlobalData::staticShape[0..2] == 1` and `TileData::SFractalSize == 512`.
+    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
+- **Implementation checks (A5)**:
+    - `sizeof(TileData::DType)` must be `1`, `2`, `4`, or `8` bytes, and must match `sizeof(GlobalData::DType)`.
+    - For `int64_t/uint64_t`, `TileData::PadVal` must be `PadValue::Null` or `PadValue::Zero`.
+    - `TileType::Vec` loads require one of the following layout pairs:
+    - ND with row-major + `SLayout::NoneBox` (ND->ND),
+    - DN with col-major + `SLayout::NoneBox` (DN->DN),
+    - NZ with `SLayout::RowMajor` (NZ->NZ).
+    - For row-major ND->ND with compile-time-known shapes, `TileData::ValidCol` must equal `GlobalData::staticShape[4]`, and `TileData::ValidRow` must equal the product of `GlobalData::staticShape[0..3]`.
+    - `TileType::Mat` loads are additionally constrained by `TLoadCubeCheck` (e.g., only specific ND/DN/NZ conversions and L1-size limits).
+    - `TileType::Mat` loads also handle loads for mx format, which include `MX_A_ZZ/MX_A_ND/MX_A_DN` to ZZ for scalarA and `MX_B_NN/MX_B_ND/MX_B_DN` to NN for scalarB.
+    - for `MX_A_ZZ/MX_B_NN`: `GlobalData::staticShape[3] == 16` and `GlobalData::staticShape[4] == 2`.
+    - for `MX_A_ND/MX_ADN/MX_B_ND/MX_B_DN`: `GlobalData::staticShape[0] == 1` and `GlobalData::staticShape[1] == 1` and `GlobalData::staticShape[4] == 2`.
+    - for scaleA, `dst.GetValidCol() % 2 == 0`.
+    - for scaleB, `dst.GetValidRow() % 2 == 0`
+
+- **Valid region**:
+    - The implementation uses `dst.GetValidRow()` / `dst.GetValidCol()` as the transfer size.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_auto(__gm__ T* in) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gin(in);
+  TileT t;
+  TLOAD(t, gin);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_manual(__gm__ T* in) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gin(in);
+  TileT t;
+  TASSIGN(t, 0x1000);
+  TLOAD(t, gin);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
+```
+
+### PTO Assembly Form
+
+```text
+%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TLOAD_zh.md b/designs/outerCube/PTOISA/TLOAD_zh.md
new file mode 100644
index 00000000..47b25f99
--- /dev/null
+++ b/designs/outerCube/PTOISA/TLOAD_zh.md
@@ -0,0 +1,132 @@
+# TLOAD
+
+## 指令示意图
+
+![TLOAD tile operation](../figures/isa/TLOAD.svg)
+
+## 简介
+
+从 GlobalTensor (GM) 加载数据到 Tile。
+
+## 数学语义
+
+Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{r_0 + i,\; c_0 + j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
+!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
+!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData, typename... WaitEvents>
+PTO_INST RecordEvent TLOAD(TileData &dst, GlobalData &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
+    - Destination tile location must be `TileType::Vec` or `TileType::Mat`.
+    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
+    - Runtime: all `src.GetShape(dim)` values and `dst.GetValidRow()/GetValidCol()` must be `> 0`.
+    - `TileType::Vec` loads only support matching layouts: ND->ND, DN->DN, NZ->NZ.
+    - `TileType::Mat` loads support: ND->ND, DN->DN, NZ->NZ, plus ND->NZ and DN->ZN.
+    - For ND->NZ or DN->ZN: `GlobalData::staticShape[0..2] == 1` and `TileData::SFractalSize == 512`.
+    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
+- **实现检查 (A5)**:
+    - `sizeof(TileData::DType)` must be `1`, `2`, `4`, or `8` bytes, and must match `sizeof(GlobalData::DType)`.
+    - For `int64_t/uint64_t`, `TileData::PadVal` must be `PadValue::Null` or `PadValue::Zero`.
+    - `TileType::Vec` loads require one of the following layout pairs:
+    - ND with row-major + `SLayout::NoneBox` (ND->ND),
+    - DN with col-major + `SLayout::NoneBox` (DN->DN),
+    - NZ with `SLayout::RowMajor` (NZ->NZ).
+    - For row-major ND->ND with compile-time-known shapes, `TileData::ValidCol` must equal `GlobalData::staticShape[4]`, and `TileData::ValidRow` must equal the product of `GlobalData::staticShape[0..3]`.
+    - `TileType::Mat` loads are additionally constrained by `TLoadCubeCheck` (e.g., only specific ND/DN/NZ conversions and L1-size limits).
+    - `TileType::Mat` loads also handle loads for mx format, which include `MX_A_ZZ/MX_A_ND/MX_A_DN` to ZZ for scalarA and `MX_B_NN/MX_B_ND/MX_B_DN` to NN for scalarB.
+    - for `MX_A_ZZ/MX_B_NN`: `GlobalData::staticShape[3] == 16` and `GlobalData::staticShape[4] == 2`.
+    - for `MX_A_ND/MX_ADN/MX_B_ND/MX_B_DN`: `GlobalData::staticShape[0] == 1` and `GlobalData::staticShape[1] == 1` and `GlobalData::staticShape[4] == 2`.
+    - for scaleA, `dst.GetValidCol() % 2 == 0`.
+    - for scaleB, `dst.GetValidRow() % 2 == 0`
+
+- **有效区域**:
+    - The implementation uses `dst.GetValidRow()` / `dst.GetValidCol()` as the transfer size.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_auto(__gm__ T* in) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gin(in);
+  TileT t;
+  TLOAD(t, gin);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_manual(__gm__ T* in) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gin(in);
+  TileT t;
+  TASSIGN(t, 0x1000);
+  TLOAD(t, gin);
+}
+```
diff --git a/designs/outerCube/PTOISA/TLOG.md b/designs/outerCube/PTOISA/TLOG.md
new file mode 100644
index 00000000..46148513
--- /dev/null
+++ b/designs/outerCube/PTOISA/TLOG.md
@@ -0,0 +1,114 @@
+﻿# TLOG
+
+
+## Tile Operation Diagram
+
+![TLOG tile operation](../figures/isa/TLOG.svg)
+
+## Introduction
+
+Elementwise natural logarithm of a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \log(\mathrm{src}_{i,j}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tlog %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TLOG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Domain behavior (e.g., `log(<=0)`) is target-defined.
+
+
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TLOG(out, x);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tlog %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TLOG_zh.md b/designs/outerCube/PTOISA/TLOG_zh.md
new file mode 100644
index 00000000..67f5924b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TLOG_zh.md
@@ -0,0 +1,85 @@
+# TLOG
+
+## 指令示意图
+
+![TLOG tile operation](../figures/isa/TLOG.svg)
+
+## 简介
+
+Tile 的逐元素自然对数。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \log(\mathrm{src}_{i,j}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tlog %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tlog %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tlog ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TLOG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Domain behavior (e.g., `log(<=0)`) is target-defined.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TLOG(out, x);
+}
+```
diff --git a/designs/outerCube/PTOISA/TLRELU.md b/designs/outerCube/PTOISA/TLRELU.md
new file mode 100644
index 00000000..3f66bc86
--- /dev/null
+++ b/designs/outerCube/PTOISA/TLRELU.md
@@ -0,0 +1,104 @@
+﻿# TLRELU
+
+
+## Tile Operation Diagram
+
+![TLRELU tile operation](../figures/isa/TLRELU.svg)
+
+## Introduction
+
+Leaky ReLU with a scalar slope.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = (\mathrm{src}_{i,j} > 0) ? \mathrm{src}_{i,j} : (\mathrm{src}_{i,j} \cdot \mathrm{slope}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tlrelu %src, %slope : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TLRELU(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `half`, `float16_t`, `float`, `float32_t` (floating-point types only).
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `half`, `float` (floating-point types only).
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `dst` and `src` must have the same valid row/col.
+    - Slope scalar type must match the Tile data type.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TLRELU(out, x, 0.1f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tlrelu %src, %slope : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TLRELU_zh.md b/designs/outerCube/PTOISA/TLRELU_zh.md
new file mode 100644
index 00000000..7fc11ff7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TLRELU_zh.md
@@ -0,0 +1,104 @@
+﻿# TLRELU
+
+## 指令示意图
+
+![TLRELU tile operation](../figures/isa/TLRELU.svg)
+
+## 简介
+
+带标量斜率的 Leaky ReLU。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = (\mathrm{src}_{i,j} > 0) ? \mathrm{src}_{i,j} : (\mathrm{src}_{i,j} \cdot \mathrm{slope}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tlrelu %src, %slope : !pto.tile<...>, f32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TLRELU(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` 必须是以下之一：`half`、`float16_t`、`float`、`float32_t`（仅浮点类型）。
+    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
+- **实现检查 (A5)**:
+    - `TileData::DType` 必须是以下之一：`half`、`float`（仅浮点类型）。
+    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
+- **通用约束**:
+    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
+    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+    - 运行时：`dst` 和 `src` 的有效行列数必须相同。
+    - 斜率标量类型必须与 Tile 数据类型一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TLRELU(out, x, 0.1f);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tlrelu %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tlrelu %src, %slope : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tlrelu ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TMATMUL.md b/designs/outerCube/PTOISA/TMATMUL.md
new file mode 100644
index 00000000..3d79405b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL.md
@@ -0,0 +1,159 @@
+﻿# TMATMUL
+
+
+## Tile Operation Diagram
+
+![TMATMUL tile operation](../figures/isa/TMATMUL.svg)
+
+## Introduction
+
+Matrix multiply (GEMM) producing an accumulator/output tile.
+
+## Math Interpretation
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= i < M` and `0 <= j < N` (output elements in the effective matmul domain):
+
+$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
+
+Exact accumulator behavior and datatype promotion are target/implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported `(CType, AType, BType)` triples:
+    - `(int32_t, int8_t, int8_t)`
+    - `(float, half, half)`
+    - `(float, float, float)`
+    - `(float, bfloat16_t, bfloat16_t)`
+    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
+    - Tile locations: `TileLeft::Loc == Left`, `TileRight::Loc == Right`, `TileRes::Loc == Acc`.
+    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
+- **Implementation checks (A5)**:
+    - Accumulator type must be `int32_t` or `float`.
+    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
+    - If `float`: supports `half/bfloat16_t/float` and selected fp8 pairs (target-defined).
+    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
+    - Fractal/layout constraints are enforced:
+    - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
+    - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
+    - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
+    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c;
+  TMATMUL(c, a, b);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c, 0x3000);
+  TMATMUL(c, a, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMATMUL_ACC.md b/designs/outerCube/PTOISA/TMATMUL_ACC.md
new file mode 100644
index 00000000..e16b45c6
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL_ACC.md
@@ -0,0 +1,147 @@
+﻿# TMATMUL_ACC
+
+
+## Tile Operation Diagram
+
+![TMATMUL_ACC tile operation](../figures/isa/TMATMUL_ACC.svg)
+
+## Introduction
+
+Matrix multiply with accumulator input (fused accumulate).
+
+## Math Interpretation
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= i < M` and `0 <= j < N`:
+
+$$ \mathrm{C1}_{i,j} = \mathrm{C0}_{i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase = AccPhase::Unspecified, typename TileRes, typename TileLeft, typename TileRight,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_ACC(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+```
+
+## Constraints
+
+- All constraints from `TMATMUL` apply to the `(cOutMatrix, aMatrix, bMatrix)` triple.
+- **Implementation notes (A2A3/A5)**:
+    - `TMATMUL_ACC_IMPL` uses `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, and `bMatrix.GetValidCol()` for `m/k/n`.
+    - `cInMatrix` is not validated by explicit assertions in the current implementations (target-defined behavior).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TMATMUL_ACC(c1, c0, a, b);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c0, 0x3000);
+  TASSIGN(c1, 0x4000);
+  TMATMUL_ACC(c1, c0, a, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMATMUL_ACC_zh.md b/designs/outerCube/PTOISA/TMATMUL_ACC_zh.md
new file mode 100644
index 00000000..a7868230
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL_ACC_zh.md
@@ -0,0 +1,120 @@
+# TMATMUL_ACC
+
+## 指令示意图
+
+![TMATMUL_ACC tile operation](../figures/isa/TMATMUL_ACC.svg)
+
+## 简介
+
+带累加器输入的矩阵乘法（融合累加）。
+
+## 数学语义
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= i < M` and `0 <= j < N`:
+
+$$ \mathrm{C1}_{i,j} = \mathrm{C0}_{i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%acc1 = tmatmul.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%c_out = pto.tmatmul.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmatmul.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase = AccPhase::Unspecified, typename TileRes, typename TileLeft, typename TileRight,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_ACC(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+```
+
+## 约束
+
+- All constraints from `TMATMUL` apply to the `(cOutMatrix, aMatrix, bMatrix)` triple.
+- **Implementation notes (A2A3/A5)**:
+    - `TMATMUL_ACC_IMPL` uses `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, and `bMatrix.GetValidCol()` for `m/k/n`.
+    - `cInMatrix` is not validated by explicit assertions in the current implementations (target-defined behavior).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TMATMUL_ACC(c1, c0, a, b);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c0, c1;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c0, 0x3000);
+  TASSIGN(c1, 0x4000);
+  TMATMUL_ACC(c1, c0, a, b);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMATMUL_BIAS.md b/designs/outerCube/PTOISA/TMATMUL_BIAS.md
new file mode 100644
index 00000000..e3326e46
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL_BIAS.md
@@ -0,0 +1,153 @@
+﻿# TMATMUL_BIAS
+
+
+## Tile Operation Diagram
+
+![TMATMUL_BIAS tile operation](../figures/isa/TMATMUL_BIAS.svg)
+
+## Introduction
+
+Matrix multiply with bias add.
+
+## Math Interpretation
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= i < M` and `0 <= j < N`:
+
+$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} + \mathrm{Bias}_{0,j} $$
+
+Bias broadcasting behavior is implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+## Constraints
+
+- All constraints from `TMATMUL` apply to the `(cMatrix, aMatrix, bMatrix)` triple.
+- **Bias constraints (A2A3)**:
+    - `TileBias::DType` must match `TileRes::DType`.
+    - `TileBias::Loc == TileType::Bias` and `TileBias::Rows == 1`.
+- **Bias constraints (A5)**:
+    - `TileBias::DType` must match `TileRes::DType`.
+    - `TileBias::Loc == TileType::Bias`, `TileBias::Rows == 1`, and `TileBias::isRowMajor`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TMATMUL_BIAS(c, a, b, bias);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TMATMUL_BIAS(c, a, b, bias);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMATMUL_BIAS_zh.md b/designs/outerCube/PTOISA/TMATMUL_BIAS_zh.md
new file mode 100644
index 00000000..daa5bad0
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL_BIAS_zh.md
@@ -0,0 +1,126 @@
+# TMATMUL_BIAS
+
+## 指令示意图
+
+![TMATMUL_BIAS tile operation](../figures/isa/TMATMUL_BIAS.svg)
+
+## 简介
+
+带偏置加法的矩阵乘法。
+
+## 数学语义
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= i < M` and `0 <= j < N`:
+
+$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} + \mathrm{Bias}_{0,j} $$
+
+Bias broadcasting behavior is implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%acc = tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%c = pto.tmatmul.bias %a, %b, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmatmul.bias ins(%a, %b, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename TileBias,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_BIAS(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+## 约束
+
+- All constraints from `TMATMUL` apply to the `(cMatrix, aMatrix, bMatrix)` triple.
+- **Bias constraints (A2A3)**:
+    - `TileBias::DType` must match `TileRes::DType`.
+    - `TileBias::Loc == TileType::Bias` and `TileBias::Rows == 1`.
+- **Bias constraints (A5)**:
+    - `TileBias::DType` must match `TileRes::DType`.
+    - `TileBias::Loc == TileType::Bias`, `TileBias::Rows == 1`, and `TileBias::isRowMajor`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TMATMUL_BIAS(c, a, b, bias);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using Bias = Tile<TileType::Bias, half, 1, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TMATMUL_BIAS(c, a, b, bias);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMATMUL_MX.md b/designs/outerCube/PTOISA/TMATMUL_MX.md
new file mode 100644
index 00000000..935d4c1e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL_MX.md
@@ -0,0 +1,202 @@
+﻿# TMATMUL_MX
+
+
+## Tile Operation Diagram
+
+![TMATMUL_MX tile operation](../figures/isa/TMATMUL_MX.svg)
+
+## Introduction
+
+Matrix multiply (GEMM) with additional scaling tiles for mixed-precision / quantized matmul on supported targets.
+
+This instruction is currently implemented on A5 (see `include/pto/npu/a5/TMatmul.hpp`).
+
+## Math Interpretation
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+Conceptually, the result corresponds to a matrix multiply over the effective matmul domain (`0 <= i < M`, `0 <= j < N`), with the scaling tiles `aScaleMatrix` / `bScaleMatrix` configuring implementation-defined mixed-precision behavior:
+
+$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
+
+The exact role of `aScaleMatrix` / `bScaleMatrix` (and any dequant/quant semantics) is target-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous forms (conceptual):
+
+```text
+%c = tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c_out = tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c = tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+-> !pto.tile<...>
+%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%c :  !pto.tile_buf<...>)
+pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+-> !pto.tile<...>
+%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%c :  !pto.tile_buf<...>)
+pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A5)**:
+    - `m/k/n` are taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`.
+    - Static legality checks are enforced via `CheckMadMxValid<...>()` (types, shapes, fractals, and scaling tile legality).
+- **Bias form**:
+    - `TileBias::DType` must be `float` and `TileBias::Loc == TileType::Bias` with `TileBias::Rows == 1` (A5 checks via `static_assert`).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<float8_e5m2_t, 16, 64>;
+  using B = TileRight<float8_e5m2_t, 64, 32>;
+  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
+  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
+  using Bias = Tile<TileType::Bias, float, 1, 32>;
+  using C = TileAcc<float, 16, 32>;
+  A a;
+  B b;
+  ScaleA scaleA;
+  ScaleB scaleB;
+  Bias bias;
+  C c;
+  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<float8_e5m2_t, 16, 64>;
+  using B = TileRight<float8_e5m2_t, 64, 32>;
+  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
+  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
+  using Bias = Tile<TileType::Bias, float, 1, 32>;
+  using C = TileAcc<float, 16, 32>;
+  A a;
+  B b;
+  ScaleA scaleA;
+  ScaleB scaleB;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(scaleA, GetScaleAddr(a.data()));
+  TASSIGN(scaleB, GetScaleAddr(b.data()));
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+```
+
+### PTO Assembly Form
+
+```text
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+# AS Level 2 (DPS)
+pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMATMUL_MX_zh.md b/designs/outerCube/PTOISA/TMATMUL_MX_zh.md
new file mode 100644
index 00000000..738035e6
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL_MX_zh.md
@@ -0,0 +1,173 @@
+# TMATMUL_MX
+
+## 指令示意图
+
+![TMATMUL_MX tile operation](../figures/isa/TMATMUL_MX.svg)
+
+## 简介
+
+带额外缩放 Tile 的矩阵乘法 (GEMM)，用于支持目标上的混合精度/量化矩阵乘法。
+
+## 数学语义
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+Conceptually, the result corresponds to a matrix multiply over the effective matmul domain (`0 <= i < M`, `0 <= j < N`), with the scaling tiles `aScaleMatrix` / `bScaleMatrix` configuring implementation-defined mixed-precision behavior:
+
+$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
+
+The exact role of `aScaleMatrix` / `bScaleMatrix` (and any dequant/quant semantics) is target-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous forms (conceptual):
+
+```text
+%c = tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c_out = tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+%c = tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+-> !pto.tile<...>
+%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%c :  !pto.tile_buf<...>)
+pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%c = pto.tmatmul.mx %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>)
+-> !pto.tile<...>
+%c_out = pto.tmatmul.mx.acc %c_in, %a, %a_scale, %b, %b_scale : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+%c = pto.tmatmul.mx.bias %a, %a_scale, %b, %b_scale, %bias : (!pto.tile<...>, !pto.tile<...>,
+!pto.tile<...>, !pto.tile<...>, !pto.tile<...>)  -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmatmul.mx ins(%a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%c :  !pto.tile_buf<...>)
+pto.tmatmul.mx.acc ins(%c_in, %a, %a_scale, %b, %b_scale : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)
+pto.tmatmul.mx.bias ins(%a, %a_scale, %b, %b_scale, %bias : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>,
+!pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, WaitEvents &... events);
+
+template <typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight, typename TileRightScale,
+          typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileLeftScale, typename TileRight,
+          typename TileRightScale, typename TileBias, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL_MX(TileRes &cMatrix, TileLeft &aMatrix, TileLeftScale &aScaleMatrix, TileRight &bMatrix, TileRightScale &bScaleMatrix, TileBias &biasData, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A5)**:
+    - `m/k/n` are taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`.
+    - Static legality checks are enforced via `CheckMadMxValid<...>()` (types, shapes, fractals, and scaling tile legality).
+- **Bias form**:
+    - `TileBias::DType` must be `float` and `TileBias::Loc == TileType::Bias` with `TileBias::Rows == 1` (A5 checks via `static_assert`).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<float8_e5m2_t, 16, 64>;
+  using B = TileRight<float8_e5m2_t, 64, 32>;
+  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
+  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
+  using Bias = Tile<TileType::Bias, float, 1, 32>;
+  using C = TileAcc<float, 16, 32>;
+  A a;
+  B b;
+  ScaleA scaleA;
+  ScaleB scaleB;
+  Bias bias;
+  C c;
+  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<float8_e5m2_t, 16, 64>;
+  using B = TileRight<float8_e5m2_t, 64, 32>;
+  using ScaleA = TileLeftScale<float8_e8m0_t, 16, 2>;
+  using ScaleB = TileRightScale<float8_e8m0_t, 2, 32>;
+  using Bias = Tile<TileType::Bias, float, 1, 32>;
+  using C = TileAcc<float, 16, 32>;
+  A a;
+  B b;
+  ScaleA scaleA;
+  ScaleB scaleB;
+  Bias bias;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(scaleA, GetScaleAddr(a.data()));
+  TASSIGN(scaleB, GetScaleAddr(b.data()));
+  TASSIGN(bias, 0x3000);
+  TASSIGN(c, 0x4000);
+  TMATMUL_MX(c, a, scaleA, b, scaleB, bias);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMATMUL_zh.md b/designs/outerCube/PTOISA/TMATMUL_zh.md
new file mode 100644
index 00000000..dbe425a8
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMATMUL_zh.md
@@ -0,0 +1,132 @@
+# TMATMUL
+
+## 指令示意图
+
+![TMATMUL tile operation](../figures/isa/TMATMUL.svg)
+
+## 简介
+
+矩阵乘法 (GEMM)，生成累加器/输出 Tile。
+
+## 数学语义
+
+Let:
+
+- `M = aMatrix.GetValidRow()`
+- `K = aMatrix.GetValidCol()`
+- `N = bMatrix.GetValidCol()`
+
+For `0 <= i < M` and `0 <= j < N` (output elements in the effective matmul domain):
+
+$$ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} $$
+
+Exact accumulator behavior and datatype promotion are target/implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%acc = tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%c = pto.tmatmul %a, %b : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmatmul ins(%a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+
+template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
+PTO_INST RecordEvent TMATMUL(TileRes &cMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - Supported `(CType, AType, BType)` triples:
+    - `(int32_t, int8_t, int8_t)`
+    - `(float, half, half)`
+    - `(float, float, float)`
+    - `(float, bfloat16_t, bfloat16_t)`
+    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
+    - Tile locations: `TileLeft::Loc == Left`, `TileRight::Loc == Right`, `TileRes::Loc == Acc`.
+    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
+- **实现检查 (A5)**:
+    - Accumulator type must be `int32_t` or `float`.
+    - If `int32_t`: `AType == int8_t` and `BType == int8_t`.
+    - If `float`: supports `half/bfloat16_t/float` and selected fp8 pairs (target-defined).
+    - Static shape constraints: `TileLeft::Rows == TileRes::Rows`, `TileLeft::Cols == TileRight::Rows`, `TileRight::Cols == TileRes::Cols`.
+    - Fractal/layout constraints are enforced:
+    - Left: `Loc == Left`, `!isRowMajor`, `SFractal == RowMajor`
+    - Right: `Loc == Right`, `isRowMajor`, `SFractal == ColMajor`
+    - Acc: `Loc == Acc`, `!isRowMajor`, `SFractal == RowMajor`
+    - Runtime: `m/k/n` (taken from `aMatrix.GetValidRow()`, `aMatrix.GetValidCol()`, `bMatrix.GetValidCol()`) must be in `[1, 4095]`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c;
+  TMATMUL(c, a, b);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using A = TileLeft<half, 16, 16>;
+  using B = TileRight<half, 16, 16>;
+  using C = TileAcc<float, 16, 16>;
+  A a;
+  B b;
+  C c;
+  TASSIGN(a, 0x1000);
+  TASSIGN(b, 0x2000);
+  TASSIGN(c, 0x3000);
+  TMATMUL(c, a, b);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMAX.md b/designs/outerCube/PTOISA/TMAX.md
new file mode 100644
index 00000000..b46691b1
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMAX.md
@@ -0,0 +1,135 @@
+﻿# TMAX
+
+
+## Tile Operation Diagram
+
+![TMAX tile operation](../figures/isa/TMAX.svg)
+
+## Introduction
+
+Elementwise maximum of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tmax %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TMAX(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TMAX(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tmax %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMAXS.md b/designs/outerCube/PTOISA/TMAXS.md
new file mode 100644
index 00000000..268aa2ff
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMAXS.md
@@ -0,0 +1,104 @@
+﻿# TMAXS
+
+
+## Tile Operation Diagram
+
+![TMAXS tile operation](../figures/isa/TMAXS.svg)
+
+## Introduction
+
+Elementwise max of a tile and a scalar: `max(src, scalar)`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TMAXS(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`, `bfloat16_t`, `uint8_t`, `int8_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `dst` and `src` must have the same valid row/col.
+    - Scalar type must match the Tile data type.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TMAXS(out, x, 0.0f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TMAXS_zh.md b/designs/outerCube/PTOISA/TMAXS_zh.md
new file mode 100644
index 00000000..a77d73c8
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMAXS_zh.md
@@ -0,0 +1,104 @@
+﻿# TMAXS
+
+## 指令示意图
+
+![TMAXS tile operation](../figures/isa/TMAXS.svg)
+
+## 简介
+
+Tile 与标量的逐元素最大值：`max(src, scalar)`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TMAXS(TileDataDst& dst, TileDataSrc& src, typename TileDataSrc::DType scalar, WaitEvents&... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` 必须是以下之一：`int32_t`、`int16_t`、`half`、`float`。
+    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
+- **实现检查 (A5)**:
+    - `TileData::DType` 必须是以下之一：`int32_t`、`uint32_t`、`float`、`int16_t`、`uint16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`。
+    - Tile 布局必须是行主序（`TileData::isRowMajor`）。
+- **通用约束**:
+    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
+    - 静态有效边界：`TileData::ValidRow <= TileData::Rows` 且 `TileData::ValidCol <= TileData::Cols`。
+    - 运行时：`dst` 和 `src` 的有效行列数必须相同。
+    - 标量类型必须与 Tile 数据类型一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TMAXS(out, x, 0.0f);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmaxs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tmaxs %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tmaxs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TMAX_zh.md b/designs/outerCube/PTOISA/TMAX_zh.md
new file mode 100644
index 00000000..ee11f117
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMAX_zh.md
@@ -0,0 +1,108 @@
+# TMAX
+
+## 指令示意图
+
+![TMAX tile operation](../figures/isa/TMAX.svg)
+
+## 简介
+
+两个 Tile 的逐元素最大值。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tmax %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TMAX(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TMAX(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMIN.md b/designs/outerCube/PTOISA/TMIN.md
new file mode 100644
index 00000000..5feca244
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMIN.md
@@ -0,0 +1,135 @@
+﻿# TMIN
+
+
+## Tile Operation Diagram
+
+![TMIN tile operation](../figures/isa/TMIN.svg)
+
+## Introduction
+
+Elementwise minimum of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tmin %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TMIN(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TMIN(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tmin %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMINS.md b/designs/outerCube/PTOISA/TMINS.md
new file mode 100644
index 00000000..0ded1cc5
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMINS.md
@@ -0,0 +1,121 @@
+﻿# TMINS
+
+
+## Tile Operation Diagram
+
+![TMINS tile operation](../figures/isa/TMINS.svg)
+
+## Introduction
+
+Elementwise minimum of a tile and a scalar.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \min(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tmins %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TMINS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
+    - Runtime: `src.GetValidCol() == dst.GetValidCol()`.
+- **Common constraints**:
+    - `dst` and `src` must use the same element type.
+    - Scalar type must match the tile data type.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TMINS(dst, src, 0.0f);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TMINS(dst, src, 0.0f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tmins %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TMINS_zh.md b/designs/outerCube/PTOISA/TMINS_zh.md
new file mode 100644
index 00000000..c53c5ddb
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMINS_zh.md
@@ -0,0 +1,121 @@
+﻿# TMINS
+
+## 指令示意图
+
+![TMINS tile operation](../figures/isa/TMINS.svg)
+
+## 简介
+
+Tile 与标量的逐元素最小值。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \min(\mathrm{src}_{i,j}, \mathrm{scalar}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tmins %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TMINS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+- **实现检查 (A5)**:
+    - `TileData::DType` 必须是以下之一：`uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t`、`int32_t`、`half`、`float`、`bfloat16_t`。
+    - 运行时：`src.GetValidCol() == dst.GetValidCol()`。
+- **通用约束**:
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - 标量类型必须与 Tile 数据类型一致。
+    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TMINS(dst, src, 0.0f);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TMINS(dst, src, 0.0f);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmins %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tmins %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tmins ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TMIN_zh.md b/designs/outerCube/PTOISA/TMIN_zh.md
new file mode 100644
index 00000000..33ecee2e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMIN_zh.md
@@ -0,0 +1,108 @@
+# TMIN
+
+## 指令示意图
+
+![TMIN tile operation](../figures/isa/TMIN.svg)
+
+## 简介
+
+两个 Tile 的逐元素最小值。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tmin %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TMIN(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TMIN(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMOV.md b/designs/outerCube/PTOISA/TMOV.md
new file mode 100644
index 00000000..6fa16879
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMOV.md
@@ -0,0 +1,161 @@
+﻿# TMOV
+
+
+## Tile Operation Diagram
+
+![TMOV tile operation](../figures/isa/TMOV.svg)
+
+## Introduction
+
+Move/copy between tiles, optionally applying implementation-defined conversion modes selected by template parameters and overloads.
+
+`TMOV` is used for:
+
+- Vec -> Vec moves
+- Mat -> Left/Right/Bias/Scaling/Scale(Microscaling) moves (target-dependent)
+- Acc -> Vec moves (target-dependent)
+
+## Math Interpretation
+
+Conceptually copies or transforms elements from `src` into `dst` over the valid region. Exact transformation depends on the selected mode and target.
+
+For the pure copy case:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+The PTO AS design recommends splitting `TMOV` into a family of ops:
+
+```text
+%left  = tmov.m2l %mat  : !pto.tile<...> -> !pto.tile<...>
+%right = tmov.m2r %mat  : !pto.tile<...> -> !pto.tile<...>
+%bias  = tmov.m2b %mat  : !pto.tile<...> -> !pto.tile<...>
+%scale = tmov.m2s %mat  : !pto.tile<...> -> !pto.tile<...>
+%vec   = tmov.a2v %acc  : !pto.tile<...> -> !pto.tile<...>
+%v1    = tmov.v2v %v0   : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename FpTileData, AccToVecMode mode,
+          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Shape rules:
+        - Shapes must match: `SrcTileData::Rows == DstTileData::Rows` and `SrcTileData::Cols == DstTileData::Cols`.
+    - Supported location pairs (compile-time checked):
+        - `Mat -> Left/Right/Bias/Scaling`
+        - `Vec -> Vec`
+        - `Acc -> Mat`
+    - Additional checks by path:
+        - `Acc -> Mat`: additional fractal and dtype constraints are enforced (for example, `Acc` uses an NZ-like fractal, `Mat` uses a 512B fractal, and only specific dtype conversions are allowed).
+- **Implementation checks (A5)**:
+    - Shape rules:
+        - For `Mat -> Left/Right/Bias/Scaling/Scale`, shapes must match.
+        - For `Vec -> Vec` and `Vec -> Mat`, the effective copy region may be determined by the valid rows/cols of source and destination.
+    - Supported location pairs include (target-dependent):
+        - `Mat -> Left/Right/Bias/Scaling/Scale`
+        - `Vec -> Vec/Mat`
+        - `Acc -> Vec/Mat`
+    - `Acc -> Vec` supports additional `AccToVecMode` forms; some forms also take `FpTileData` or `preQuantScalar`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TMOV(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
+  using DstT = TileLeft<float, 16, 16>;
+  SrcT mat;
+  DstT left;
+  TASSIGN(mat, 0x1000);
+  TASSIGN(left, 0x2000);
+  TMOV(left, mat);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TMOV_FP.md b/designs/outerCube/PTOISA/TMOV_FP.md
new file mode 100644
index 00000000..620431ff
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMOV_FP.md
@@ -0,0 +1,141 @@
+﻿# TMOV_FP
+
+
+## Tile Operation Diagram
+
+![TMOV_FP tile operation](../figures/isa/TMOV_FP.svg)
+
+## Introduction
+
+Move/convert from an accumulator tile into a destination tile, using a scaling (`fp`) tile for vector quantization parameters.
+
+`TMOV_FP` is a named wrapper around the `TMOV_IMPL(..., fp)` path and is part of the `TMOV` family (see `docs/isa/TMOV.md`).
+
+## Math Interpretation
+
+Conceptually converts each element using an implementation-defined quantization/dequantization configuration derived from `fp`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - The fp path is only supported for accumulator conversion and is validated by internal compile-time checks in `TMOV_IMPL(dst, src, fp)`.
+    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
+- **Implementation checks (A5)**:
+    - Validated by `CheckTMovAccValid(...)` and related compile-time checks in `TMOV_IMPL(dst, src, fp)`.
+    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
+    - Destination location is target-dependent (`Vec` or `Mat` are supported in the fp path).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using AccT = TileAcc<float, 16, 16>;
+  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
+
+  AccT acc;
+  DstT dst;
+  FpT fp;
+  TMOV_FP(dst, acc, fp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using AccT = TileAcc<float, 16, 16>;
+  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
+
+  AccT acc;
+  DstT dst;
+  FpT fp;
+  TASSIGN(acc, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(fp,  0x3000);
+  TMOV_FP(dst, acc, fp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMOV_FP_zh.md b/designs/outerCube/PTOISA/TMOV_FP_zh.md
new file mode 100644
index 00000000..28ed5751
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMOV_FP_zh.md
@@ -0,0 +1,112 @@
+# TMOV_FP
+
+## 指令示意图
+
+![TMOV_FP tile operation](../figures/isa/TMOV_FP.svg)
+
+## 简介
+
+使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 移动/转换到目标 Tile。
+
+## 数学语义
+
+Conceptually converts each element using an implementation-defined quantization/dequantization configuration derived from `fp`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename FpTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - The fp path is only supported for accumulator conversion and is validated by internal compile-time checks in `TMOV_IMPL(dst, src, fp)`.
+    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
+- **实现检查 (A5)**:
+    - Validated by `CheckTMovAccValid(...)` and related compile-time checks in `TMOV_IMPL(dst, src, fp)`.
+    - `FpTileData::Loc` must be `TileType::Scaling` (`static_assert`).
+    - Destination location is target-dependent (`Vec` or `Mat` are supported in the fp path).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using AccT = TileAcc<float, 16, 16>;
+  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
+
+  AccT acc;
+  DstT dst;
+  FpT fp;
+  TMOV_FP(dst, acc, fp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using AccT = TileAcc<float, 16, 16>;
+  using DstT = Tile<TileType::Vec, int8_t, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, 16, SLayout::NoneBox>;
+
+  AccT acc;
+  DstT dst;
+  FpT fp;
+  TASSIGN(acc, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(fp,  0x3000);
+  TMOV_FP(dst, acc, fp);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMOV_zh.md b/designs/outerCube/PTOISA/TMOV_zh.md
new file mode 100644
index 00000000..92f4edd2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMOV_zh.md
@@ -0,0 +1,161 @@
+﻿# TMOV
+
+## 指令示意图
+
+![TMOV tile operation](../figures/isa/TMOV.svg)
+
+## 简介
+
+在 Tile 之间移动/复制，可选通过模板参数和重载选择实现定义的转换模式。
+
+`TMOV` 用于：
+
+- Vec -> Vec 移动
+- Mat -> Left/Right/Bias/Scaling/Scale（微缩放）移动（取决于目标）
+- Acc -> Vec 移动（取决于目标）
+
+## 数学语义
+
+概念上在有效区域上将元素从 `src` 复制或转换到 `dst`。确切的转换取决于所选模式和目标。
+
+对于纯复制情况：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+PTO AS 设计建议将 `TMOV` 拆分为一系列操作：
+
+```text
+%left  = tmov.m2l %mat  : !pto.tile<...> -> !pto.tile<...>
+%right = tmov.m2r %mat  : !pto.tile<...> -> !pto.tile<...>
+%bias  = tmov.m2b %mat  : !pto.tile<...> -> !pto.tile<...>
+%scale = tmov.m2s %mat  : !pto.tile<...> -> !pto.tile<...>
+%vec   = tmov.a2v %acc  : !pto.tile<...> -> !pto.tile<...>
+%v1    = tmov.v2v %v0   : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp` 和 `include/pto/common/constants.hpp`：
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename FpTileData, AccToVecMode mode,
+          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 形状规则：
+        - 形状必须匹配：`SrcTileData::Rows == DstTileData::Rows` 且 `SrcTileData::Cols == DstTileData::Cols`。
+    - 支持的位置对（编译时检查）：
+        - `Mat -> Left/Right/Bias/Scaling`
+        - `Vec -> Vec`
+        - `Acc -> Mat`
+    - 按路径附加检查如下：
+        - `Acc -> Mat`：会额外检查分形与数据类型约束（例如 `Acc` 使用类 NZ 分形，`Mat` 使用 512B 分形，且仅允许特定的数据类型转换）。
+- **实现检查 (A5)**:
+    - 形状规则：
+        - 对于 `Mat -> Left/Right/Bias/Scaling/Scale`，形状必须匹配。
+        - 对于 `Vec -> Vec` 和 `Vec -> Mat`，实际复制区域可能由源和目的的有效行/列共同决定。
+    - 支持的位置对包括（取决于目标）：
+        - `Mat -> Left/Right/Bias/Scaling/Scale`
+        - `Vec -> Vec/Mat`
+        - `Acc -> Vec/Mat`
+    - `Acc -> Vec` 还支持额外的 `AccToVecMode` 形式；其中部分形式还会结合 `FpTileData` 或 `preQuantScalar` 使用。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TMOV(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Mat, float, 16, 16, BLayout::RowMajor, 16, 16, SLayout::ColMajor>;
+  using DstT = TileLeft<float, 16, 16>;
+  SrcT mat;
+  DstT left;
+  TASSIGN(mat, 0x1000);
+  TASSIGN(left, 0x2000);
+  TMOV(left, mat);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TMRGSORT.md b/designs/outerCube/PTOISA/TMRGSORT.md
new file mode 100644
index 00000000..f76f1889
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMRGSORT.md
@@ -0,0 +1,155 @@
+﻿# TMRGSORT
+
+
+## Tile Operation Diagram
+
+![TMRGSORT tile operation](../figures/isa/TMRGSORT.svg)
+
+## Introduction
+
+Merge sort for multiple sorted lists (implementation-defined element format and layout).
+
+## Math Interpretation
+
+Merges sorted input lists into `dst`. Ordering, element format (e.g., value/index pairs), and the meaning of executed counts depend on the implementation.
+
+$$ \mathrm{dst} = \mathrm{merge}(\mathrm{src}_0, \mathrm{src}_1, \ldots) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form (conceptual):
+
+```text
+%dst, %executed = tmrgsort %src0, %src1 {exhausted = false}
+    : !pto.tile<...>, !pto.tile<...> -> (!pto.tile<...>, vector<4xi16>)
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
+ : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
+pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
+ : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
+pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
+          typename Src2TileData, typename Src3TileData, bool exhausted, typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, Src3TileData &src3, WaitEvents &... events);
+
+template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
+          typename Src2TileData, bool exhausted, typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, WaitEvents &... events);
+
+template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData, bool exhausted,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, SrcTileData &src, uint32_t blockLen, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3/A5)**:
+    - Element type must be `half` or `float` and must match across `dst/tmp/src*` tiles.
+    - All tiles must be `TileType::Vec`, row-major, and have `Rows == 1` (list stored in a single row).
+    - UB memory usage is checked (compile-time and runtime) against target limits (single `Cols` across inputs plus `tmp`/`dst`).
+- **Single-list variant (`TMRGSORT(dst, src, blockLen)`)**:
+    - `blockLen` must be a multiple of 64 (as checked by the implementation).
+    - `src.GetValidCol()` must be an integer multiple of `blockLen * 4`.
+    - `repeatTimes = src.GetValidCol() / (blockLen * 4)` must be in `[1, 255]`.
+- **Multi-list variants**:
+    - `tmp` is required and `executedNumList` is written by the implementation; supported list counts and exact semantics are target-defined.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 1, 256>;
+  using DstT = Tile<TileType::Vec, float, 1, 256>;
+  SrcT src;
+  DstT dst;
+  TMRGSORT(dst, src, /*blockLen=*/64);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 1, 256>;
+  using DstT = Tile<TileType::Vec, float, 1, 256>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TMRGSORT(dst, src, /*blockLen=*/64);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMRGSORT_zh.md b/designs/outerCube/PTOISA/TMRGSORT_zh.md
new file mode 100644
index 00000000..1b3dcc36
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMRGSORT_zh.md
@@ -0,0 +1,128 @@
+# TMRGSORT
+
+## 指令示意图
+
+![TMRGSORT tile operation](../figures/isa/TMRGSORT.svg)
+
+## 简介
+
+用于多个已排序列表的归并排序（实现定义的元素格式和布局）。
+
+## 数学语义
+
+Merges sorted input lists into `dst`. Ordering, element format (e.g., value/index pairs), and the meaning of executed counts depend on the implementation.
+
+$$ \mathrm{dst} = \mathrm{merge}(\mathrm{src}_0, \mathrm{src}_1, \ldots) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form (conceptual):
+
+```text
+%dst, %executed = tmrgsort %src0, %src1 {exhausted = false}
+    : !pto.tile<...>, !pto.tile<...> -> (!pto.tile<...>, vector<4xi16>)
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
+ : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
+pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmrgsort %src, %blockLen : (!pto.tile<...>, dtype) -> !pto.tile<...>
+%dst, %executed = pto.tmrgsort %src0, %src1, %src2, %src3 {exhausted = false}
+ : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, vector<4xi16>)
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmrgsort ins(%src, %blockLen : !pto.tile_buf<...>, dtype)  outs(%dst : !pto.tile_buf<...>)
+pto.tmrgsort ins(%src0, %src1, %src2, %src3 {exhausted = false} : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>)
+outs(%dst, %executed : !pto.tile_buf<...>, vector<4xi16>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
+          typename Src2TileData, typename Src3TileData, bool exhausted, typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, Src3TileData &src3, WaitEvents &... events);
+
+template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData,
+          typename Src2TileData, bool exhausted, typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, Src2TileData &src2, WaitEvents &... events);
+
+template <typename DstTileData, typename TmpTileData, typename Src0TileData, typename Src1TileData, bool exhausted,
+          typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, MrgSortExecutedNumList &executedNumList, TmpTileData &tmp, Src0TileData &src0, Src1TileData &src1, WaitEvents &... events);
+
+template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
+PTO_INST RecordEvent TMRGSORT(DstTileData &dst, SrcTileData &src, uint32_t blockLen, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3/A5)**:
+    - Element type must be `half` or `float` and must match across `dst/tmp/src*` tiles.
+    - All tiles must be `TileType::Vec`, row-major, and have `Rows == 1` (list stored in a single row).
+    - UB memory usage is checked (compile-time and runtime) against target limits (single `Cols` across inputs plus `tmp`/`dst`).
+- **Single-list variant (`TMRGSORT(dst, src, blockLen)`)**:
+    - `blockLen` must be a multiple of 64 (as checked by the implementation).
+    - `src.GetValidCol()` must be an integer multiple of `blockLen * 4`.
+    - `repeatTimes = src.GetValidCol() / (blockLen * 4)` must be in `[1, 255]`.
+- **Multi-list variants**:
+    - `tmp` is required and `executedNumList` is written by the implementation; supported list counts and exact semantics are target-defined.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 1, 256>;
+  using DstT = Tile<TileType::Vec, float, 1, 256>;
+  SrcT src;
+  DstT dst;
+  TMRGSORT(dst, src, /*blockLen=*/64);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 1, 256>;
+  using DstT = Tile<TileType::Vec, float, 1, 256>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TMRGSORT(dst, src, /*blockLen=*/64);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMUL.md b/designs/outerCube/PTOISA/TMUL.md
new file mode 100644
index 00000000..b4cfc637
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMUL.md
@@ -0,0 +1,135 @@
+﻿# TMUL
+
+
+## Tile Operation Diagram
+
+![TMUL tile operation](../figures/isa/TMUL.svg)
+
+## Introduction
+
+Elementwise multiply of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tmul %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; .
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TMUL(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TMUL(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tmul %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMULS.md b/designs/outerCube/PTOISA/TMULS.md
new file mode 100644
index 00000000..c163afb3
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMULS.md
@@ -0,0 +1,134 @@
+﻿# TMULS
+
+
+## Tile Operation Diagram
+
+![TMULS tile operation](../figures/isa/TMULS.svg)
+
+## Introduction
+
+Elementwise multiply a tile by a scalar.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \cdot \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tmuls %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TMULS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TMULS(dst, src, 2.0f);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TMULS(dst, src, 2.0f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tmuls %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TMULS_zh.md b/designs/outerCube/PTOISA/TMULS_zh.md
new file mode 100644
index 00000000..20425d19
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMULS_zh.md
@@ -0,0 +1,107 @@
+# TMULS
+
+## 指令示意图
+
+![TMULS tile operation](../figures/isa/TMULS.svg)
+
+## 简介
+
+Tile 与标量的逐元素乘法。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \cdot \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tmuls %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmuls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmuls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TMULS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0.GetValidCol() == dst.GetValidCol()`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TMULS(dst, src, 2.0f);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TMULS(dst, src, 2.0f);
+}
+```
diff --git a/designs/outerCube/PTOISA/TMUL_zh.md b/designs/outerCube/PTOISA/TMUL_zh.md
new file mode 100644
index 00000000..14b6351a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TMUL_zh.md
@@ -0,0 +1,108 @@
+# TMUL
+
+## 指令示意图
+
+![TMUL tile operation](../figures/isa/TMUL.svg)
+
+## 简介
+
+两个 Tile 的逐元素乘法。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tmul %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tmul %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `uint32_t`, `float`, `int16_t`, `uint16_t`, `half`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; .
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TMUL(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TMUL(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TNEG.md b/designs/outerCube/PTOISA/TNEG.md
new file mode 100644
index 00000000..c29ae21b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TNEG.md
@@ -0,0 +1,103 @@
+﻿# TNEG
+
+
+## Tile Operation Diagram
+
+![TNEG tile operation](../figures/isa/TNEG.svg)
+
+## Introduction
+
+Elementwise negation of a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = -\mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tneg %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TNEG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TNEG(out, x);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tneg %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TNEG_zh.md b/designs/outerCube/PTOISA/TNEG_zh.md
new file mode 100644
index 00000000..ae533466
--- /dev/null
+++ b/designs/outerCube/PTOISA/TNEG_zh.md
@@ -0,0 +1,76 @@
+# TNEG
+
+## 指令示意图
+
+![TNEG tile operation](../figures/isa/TNEG.svg)
+
+## 简介
+
+Tile 的逐元素取负。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = -\mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tneg %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tneg %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tneg ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TNEG(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TNEG(out, x);
+}
+```
diff --git a/designs/outerCube/PTOISA/TNOT.md b/designs/outerCube/PTOISA/TNOT.md
new file mode 100644
index 00000000..84c439fd
--- /dev/null
+++ b/designs/outerCube/PTOISA/TNOT.md
@@ -0,0 +1,119 @@
+﻿# TNOT
+
+
+## Tile Operation Diagram
+
+![TNOT tile operation](../figures/isa/TNOT.svg)
+
+## Introduction
+
+Elementwise bitwise NOT of a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \sim\mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tnot %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TNOT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int16_t`, `uint16_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileT x, out;
+  TNOT(out, x);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tnot %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TNOT_zh.md b/designs/outerCube/PTOISA/TNOT_zh.md
new file mode 100644
index 00000000..3c88eabb
--- /dev/null
+++ b/designs/outerCube/PTOISA/TNOT_zh.md
@@ -0,0 +1,89 @@
+# TNOT
+
+## 指令示意图
+
+![TNOT tile operation](../figures/isa/TNOT.svg)
+
+## 简介
+
+Tile 的逐元素按位取反。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \sim\mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tnot %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tnot %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tnot ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TNOT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int16_t`, `uint16_t`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileT x, out;
+  TNOT(out, x);
+}
+```
diff --git a/designs/outerCube/PTOISA/TOR.md b/designs/outerCube/PTOISA/TOR.md
new file mode 100644
index 00000000..ece1123c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TOR.md
@@ -0,0 +1,103 @@
+﻿# TOR
+
+
+## Tile Operation Diagram
+
+![TOR tile operation](../figures/isa/TOR.svg)
+
+## Introduction
+
+Elementwise bitwise OR of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;|\; \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tor %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TOR(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported element types are 1-byte or 2-byte integral types.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Implementation checks (A5)**:
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT a, b, out;
+  TOR(out, a, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tor %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TORS.md b/designs/outerCube/PTOISA/TORS.md
new file mode 100644
index 00000000..8bd53389
--- /dev/null
+++ b/designs/outerCube/PTOISA/TORS.md
@@ -0,0 +1,106 @@
+﻿# TORS
+
+
+## Tile Operation Diagram
+
+![TORS tile operation](../figures/isa/TORS.svg)
+
+## Introduction
+
+Elementwise bitwise OR of a tile and a scalar.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;|\; \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tors %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TORS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Intended for integral element types.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
+    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
+- **Implementation checks (A5)**:
+    - Intended for integral element types supported by `TEXPANDS` and `TOR`.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - In manual mode, setting the source tile and destination tile to the same memory is unsupported.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TORS(dst, src, 0xffu);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tors %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TORS_zh.md b/designs/outerCube/PTOISA/TORS_zh.md
new file mode 100644
index 00000000..65c938fb
--- /dev/null
+++ b/designs/outerCube/PTOISA/TORS_zh.md
@@ -0,0 +1,106 @@
+﻿# TORS
+
+## 指令示意图
+
+![TORS tile operation](../figures/isa/TORS.svg)
+
+## 简介
+
+Tile 与标量的逐元素按位或。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \;|\; \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tors %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TORS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 适用于整数元素类型。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
+- **实现检查 (A5)**:
+    - 适用于 `TEXPANDS` 和 `TOR` 支持的整数元素类型。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 在手动模式下，不支持将源 Tile 和目标 Tile 设置为相同的内存。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TORS(dst, src, 0xffu);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tors %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TOR_zh.md b/designs/outerCube/PTOISA/TOR_zh.md
new file mode 100644
index 00000000..8ad38d80
--- /dev/null
+++ b/designs/outerCube/PTOISA/TOR_zh.md
@@ -0,0 +1,103 @@
+﻿# TOR
+
+## 指令示意图
+
+![TOR tile operation](../figures/isa/TOR.svg)
+
+## 简介
+
+两个 Tile 的逐元素按位或。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \;|\; \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tor %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TOR(TileData &dst, TileData &src0, TileData &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 支持的元素类型为 1 字节或 2 字节整数类型。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **实现检查 (A5)**:
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT a, b, out;
+  TOR(out, a, b);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tor %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TPACK.md b/designs/outerCube/PTOISA/TPACK.md
new file mode 100644
index 00000000..b96033de
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPACK.md
@@ -0,0 +1,40 @@
+# TPACK
+
+## Tile Operation Diagram
+
+![TPACK tile operation](../figures/isa/TPACK.svg)
+
+## Introduction
+
+Pack or convert tile elements into a narrower destination representation.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tpack ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tpack ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TPACK_zh.md b/designs/outerCube/PTOISA/TPACK_zh.md
new file mode 100644
index 00000000..71a90d2c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPACK_zh.md
@@ -0,0 +1,41 @@
+# TPACK
+
+## 指令示意图
+
+![TPACK tile operation](../figures/isa/TPACK.svg)
+
+## 简介
+
+将 Tile 元素打包或转换为更窄的目标表示。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tpack ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tpack ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TPARTADD.md b/designs/outerCube/PTOISA/TPARTADD.md
new file mode 100644
index 00000000..8409953f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTADD.md
@@ -0,0 +1,137 @@
+﻿# TPARTADD
+
+
+## Tile Operation Diagram
+
+![TPARTADD tile operation](../figures/isa/TPARTADD.svg)
+
+## Introduction
+
+Partial elementwise add with implementation-defined handling of mismatched valid regions.
+
+## Math Interpretation
+
+For each element `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} & \text{if both inputs are defined at } (i,j) \\
+\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
+\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - All three tiles must be row-major (`isRowMajor`).
+    - Runtime: if `dst.GetValidRow() == 0` or `dst.GetValidCol() == 0`, the op returns early.
+    - Runtime: the implementation requires at least one input's valid region to match `dst`'s valid region, and the other's valid region not greater than `dst`'s valid region (otherwise it asserts).
+- **Implementation checks (A5)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
+    - Runtime: if `dst` has a zero valid region, the op returns early.
+    - Only certain partial-validity patterns are handled (e.g., one source equal to `dst` while the other is smaller by valid-rows or valid-cols); other patterns are not supported (target-defined behavior).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTADD(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTADD(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TPARTADD_zh.md b/designs/outerCube/PTOISA/TPARTADD_zh.md
new file mode 100644
index 00000000..dc0569b5
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTADD_zh.md
@@ -0,0 +1,110 @@
+# TPARTADD
+
+## 指令示意图
+
+![TPARTADD tile operation](../figures/isa/TPARTADD.svg)
+
+## 简介
+
+部分逐元素加法，对不匹配的有效区域具有实现定义的处理方式。
+
+## 数学语义
+
+对每个元素 `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} & \text{if both inputs are defined at } (i,j) \\
+\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
+\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tpartadd %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tpartadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - All three tiles must be row-major (`isRowMajor`).
+    - Runtime: if `dst.GetValidRow() == 0` or `dst.GetValidCol() == 0`, the op returns early.
+    - Runtime: the implementation requires at least one input's valid region to match `dst`'s valid region, and the other's valid region not greater than `dst`'s valid region (otherwise it asserts).
+- **实现检查 (A5)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, `int32_t`, `half`, `float`, `bfloat16_t`.
+    - Runtime: if `dst` has a zero valid region, the op returns early.
+    - Only certain partial-validity patterns are handled (e.g., one source equal to `dst` while the other is smaller by valid-rows or valid-cols); other patterns are not supported (target-defined behavior).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTADD(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTADD(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TPARTMAX.md b/designs/outerCube/PTOISA/TPARTMAX.md
new file mode 100644
index 00000000..a0762900
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTMAX.md
@@ -0,0 +1,137 @@
+﻿# TPARTMAX
+
+
+## Tile Operation Diagram
+
+![TPARTMAX tile operation](../figures/isa/TPARTMAX.svg)
+
+## Introduction
+
+Partial elementwise max with implementation-defined handling of mismatched valid regions.
+
+## Math Interpretation
+
+For each element `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{if both inputs are defined at } (i,j) \\
+\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
+\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - All three tiles must be row-major (`isRowMajor`).
+    - Runtime: if `dst.GetValidRow() == 0` or `dst.GetValidCol() == 0`, the op returns early.
+    - Runtime: the implementation requires at least one input's valid region to match `dst`'s valid region, and the onther's valid region not greater than `dst`'s valid region (otherwise it asserts).
+- **Implementation checks (A5)**:
+    - `dst/src0/src1` element types must be identical and must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
+    - Runtime: if any of `src0/src1/dst` has a zero valid region, the op returns early.
+    - Requires `src0` and `src1` valid region to be `<= dst` valid region in both dimensions; other patterns are not supported (target-defined behavior).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTMAX(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTMAX(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TPARTMAX_zh.md b/designs/outerCube/PTOISA/TPARTMAX_zh.md
new file mode 100644
index 00000000..a2916a9f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTMAX_zh.md
@@ -0,0 +1,110 @@
+# TPARTMAX
+
+## 指令示意图
+
+![TPARTMAX tile operation](../figures/isa/TPARTMAX.svg)
+
+## 简介
+
+部分逐元素最大值，对不匹配的有效区域具有实现定义的处理方式。
+
+## 数学语义
+
+对每个元素 `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\max(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{if both inputs are defined at } (i,j) \\
+\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
+\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tpartmax %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tpartmax %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - All three tiles must be row-major (`isRowMajor`).
+    - Runtime: if `dst.GetValidRow() == 0` or `dst.GetValidCol() == 0`, the op returns early.
+    - Runtime: the implementation requires at least one input's valid region to match `dst`'s valid region, and the onther's valid region not greater than `dst`'s valid region (otherwise it asserts).
+- **实现检查 (A5)**:
+    - `dst/src0/src1` element types must be identical and must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
+    - Runtime: if any of `src0/src1/dst` has a zero valid region, the op returns early.
+    - Requires `src0` and `src1` valid region to be `<= dst` valid region in both dimensions; other patterns are not supported (target-defined behavior).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTMAX(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTMAX(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TPARTMIN.md b/designs/outerCube/PTOISA/TPARTMIN.md
new file mode 100644
index 00000000..fd9a8502
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTMIN.md
@@ -0,0 +1,137 @@
+﻿# TPARTMIN
+
+
+## Tile Operation Diagram
+
+![TPARTMIN tile operation](../figures/isa/TPARTMIN.svg)
+
+## Introduction
+
+Partial elementwise min with implementation-defined handling of mismatched valid regions.
+
+## Math Interpretation
+
+For each element `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{if both inputs are defined at } (i,j) \\
+\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
+\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - All three tiles must be row-major (`isRowMajor`).
+    - Runtime: if `dst.GetValidRow() == 0` or `dst.GetValidCol() == 0`, the op returns early.
+    - Runtime: the implementation requires at least one input's valid region to match `dst`'s valid region, and the onther's valid region not greater than `dst`'s valid region (otherwise it asserts).
+- **Implementation checks (A5)**:
+    - `dst/src0/src1` element types must be identical and must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
+    - Runtime: if any of `src0/src1/dst` has a zero valid region, the op returns early.
+    - Requires `src0` and `src1` valid region to be `<= dst` valid region in both dimensions; other patterns are not supported (target-defined behavior).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTMIN(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTMIN(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TPARTMIN_zh.md b/designs/outerCube/PTOISA/TPARTMIN_zh.md
new file mode 100644
index 00000000..89b0603a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTMIN_zh.md
@@ -0,0 +1,110 @@
+# TPARTMIN
+
+## 指令示意图
+
+![TPARTMIN tile operation](../figures/isa/TPARTMIN.svg)
+
+## 简介
+
+部分逐元素最小值，对不匹配的有效区域具有实现定义的处理方式。
+
+## 数学语义
+
+对每个元素 `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\min(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) & \text{if both inputs are defined at } (i,j) \\
+\mathrm{src0}_{i,j} & \text{if only src0 is defined at } (i,j) \\
+\mathrm{src1}_{i,j} & \text{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tpartmin %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tpartmin %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `dst/src0/src1` element types must be identical, and must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - All three tiles must be row-major (`isRowMajor`).
+    - Runtime: if `dst.GetValidRow() == 0` or `dst.GetValidCol() == 0`, the op returns early.
+    - Runtime: the implementation requires at least one input's valid region to match `dst`'s valid region, and the onther's valid region not greater than `dst`'s valid region (otherwise it asserts).
+- **实现检查 (A5)**:
+    - `dst/src0/src1` element types must be identical and must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, `bfloat16_t`, `float`.
+    - Runtime: if any of `src0/src1/dst` has a zero valid region, the op returns early.
+    - Requires `src0` and `src1` valid region to be `<= dst` valid region in both dimensions; other patterns are not supported (target-defined behavior).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTMIN(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTMIN(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TPARTMUL.md b/designs/outerCube/PTOISA/TPARTMUL.md
new file mode 100644
index 00000000..d0da3546
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTMUL.md
@@ -0,0 +1,129 @@
+﻿# TPARTMUL
+
+
+## Tile Operation Diagram
+
+![TPARTMUL tile operation](../figures/isa/TPARTMUL.svg)
+
+## Introduction
+
+Partial elementwise multiply with implementation-defined handling of mismatched valid regions.
+
+## Math Interpretation
+
+For each element `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+egin{cases}
+\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} & 	ext{if both inputs are defined at } (i,j) \
+\mathrm{src0}_{i,j} & 	ext{if only src0 is defined at } (i,j) \
+\mathrm{src1}_{i,j} & 	ext{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- Element type/layout legality follows backend checks and is analogous to `TPARTADD` / `TPARTMAX` / `TPARTMIN`.
+- Destination valid region defines the result domain.
+- Partial-validity handling is implementation-defined for unsupported shape combinations.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTMUL(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTMUL(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TPARTMUL_zh.md b/designs/outerCube/PTOISA/TPARTMUL_zh.md
new file mode 100644
index 00000000..ca6c30c5
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPARTMUL_zh.md
@@ -0,0 +1,102 @@
+# TPARTMUL
+
+## 指令示意图
+
+![TPARTMUL tile operation](../figures/isa/TPARTMUL.svg)
+
+## 简介
+
+部分逐元素乘法，对有效区域不一致的处理为实现定义。
+
+## 数学语义
+
+对每个元素 `(i, j)` in the destination valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+egin{cases}
+\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j} & 	ext{if both inputs are defined at } (i,j) \
+\mathrm{src0}_{i,j} & 	ext{if only src0 is defined at } (i,j) \
+\mathrm{src1}_{i,j} & 	ext{if only src1 is defined at } (i,j)
+\end{cases}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tpartmul %src0, %src1 : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TPARTMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- Element type/layout legality follows backend checks and is analogous to `TPARTADD` / `TPARTMAX` / `TPARTMIN`.
+- Destination valid region defines the result domain.
+- Partial-validity handling is implementation-defined for unsupported shape combinations.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TPARTMUL(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TPARTMUL(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TPOP.md b/designs/outerCube/PTOISA/TPOP.md
new file mode 100644
index 00000000..1d49b220
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPOP.md
@@ -0,0 +1,40 @@
+# TPOP
+
+## Tile Operation Diagram
+
+![TPOP tile operation](../figures/isa/TPOP.svg)
+
+## Introduction
+
+Pop a tile from a pipe or FIFO consumer endpoint.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tpop ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tpop ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TPOP_zh.md b/designs/outerCube/PTOISA/TPOP_zh.md
new file mode 100644
index 00000000..1c6909d2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPOP_zh.md
@@ -0,0 +1,41 @@
+# TPOP
+
+## 指令示意图
+
+![TPOP tile operation](../figures/isa/TPOP.svg)
+
+## 简介
+
+从 pipe 或 FIFO 的消费者端弹出一个 Tile。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tpop ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tpop ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TPREFETCH.md b/designs/outerCube/PTOISA/TPREFETCH.md
new file mode 100644
index 00000000..ab7fc128
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPREFETCH.md
@@ -0,0 +1,94 @@
+﻿# TPREFETCH
+
+
+## Tile Operation Diagram
+
+![TPREFETCH tile operation](../figures/isa/TPREFETCH.svg)
+
+## Introduction
+
+Prefetch data from global memory into a tile-local cache/buffer (implementation-defined). This is typically used to reduce latency before a subsequent `TLOAD`.
+
+Note: unlike most PTO instructions, `TPREFETCH` does **not** implicitly call `TSYNC(events...)` in the C++ wrapper.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData>
+PTO_INST RecordEvent TPREFETCH(TileData &dst, GlobalData &src);
+```
+
+## Constraints
+
+- Semantics and caching behavior are target/implementation-defined.
+- Some targets may ignore prefetches or treat them as hints.
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tprefetch %src : !pto.global<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TPREFETCH_zh.md b/designs/outerCube/PTOISA/TPREFETCH_zh.md
new file mode 100644
index 00000000..2e22e6ff
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPREFETCH_zh.md
@@ -0,0 +1,65 @@
+# TPREFETCH
+
+## 指令示意图
+
+![TPREFETCH tile operation](../figures/isa/TPREFETCH.svg)
+
+## 简介
+
+将数据从全局内存预取到 Tile 本地缓存/缓冲区（提示）。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tprefetch %src : !pto.global<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData>
+PTO_INST RecordEvent TPREFETCH(TileData &dst, GlobalData &src);
+```
+
+## 约束
+
+- Semantics and caching behavior are target/implementation-defined.
+- Some targets may ignore prefetches or treat them as hints.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TPRELU.md b/designs/outerCube/PTOISA/TPRELU.md
new file mode 100644
index 00000000..74c5ef05
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPRELU.md
@@ -0,0 +1,106 @@
+﻿# TPRELU
+
+
+## Tile Operation Diagram
+
+![TPRELU tile operation](../figures/isa/TPRELU.svg)
+
+## Introduction
+
+Elementwise PReLU (parametric ReLU) with a per-element slope tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = (\mathrm{src0}_{i,j} > 0) ? \mathrm{src0}_{i,j} : (\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tprelu %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TPRELU(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+- Temporary space is required by A3 for calculation, while not used by A5.
+- For A3, 2 source Tile, destination Tile, temporary space must in different memory range without overlapping.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, slope, out, tmp;
+  TPRELU(out, x, slope, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tprelu %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TPRELU_zh.md b/designs/outerCube/PTOISA/TPRELU_zh.md
new file mode 100644
index 00000000..1f5b2a8a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPRELU_zh.md
@@ -0,0 +1,79 @@
+# TPRELU
+
+## 指令示意图
+
+![TPRELU tile operation](../figures/isa/TPRELU.svg)
+
+## 简介
+
+带逐元素斜率 Tile 的逐元素参数化 ReLU (PReLU)。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = (\mathrm{src0}_{i,j} > 0) ? \mathrm{src0}_{i,j} : (\mathrm{src0}_{i,j} \cdot \mathrm{src1}_{i,j}) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tprelu %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tprelu %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tprelu ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TPRELU(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+- Temporary space is required by A3 for calculation, while not used by A5.
+- For A3, 2 source Tile, destination Tile, temporary space must in different memory range without overlapping.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, slope, out, tmp;
+  TPRELU(out, x, slope, tmp);
+}
+```
diff --git a/designs/outerCube/PTOISA/TPRINT.md b/designs/outerCube/PTOISA/TPRINT.md
new file mode 100644
index 00000000..b92db8fe
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPRINT.md
@@ -0,0 +1,157 @@
+﻿# TPRINT
+
+
+## Tile Operation Diagram
+
+![TPRINT tile operation](../figures/isa/TPRINT.svg)
+
+## Introduction
+
+Print the contents of a Tile or GlobalTensor for debugging purposes directly from device code.
+
+The `TPRINT` instruction outputs the logical view of data stored in a Tile or GlobalTensor. It supports common data types (e.g., `float`, `half`, `int8`, `uint32`) and multiple memory layouts (`ND`, `DN`, `NZ` for GlobalTensor; vector tiles for on-chip buffers).
+
+> **Important**:
+> - This instruction is **for development and debugging ONLY**.
+> - It incurs **significant runtime overhead** and **must not be used in production kernels**.
+> - Output may be **truncated** if it exceeds the internal print buffer.
+> - **Requires CCE compilation option `-D_DEBUG --cce-enable-print`** (see [Behavior](#behavior)).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+```text
+tprint %src : !pto.tile<...> | !pto.global<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
+```
+## C++ Intrinsic
+Declared in `include/pto/common/pto_instr.hpp`:
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPRINT(TileData &src, WaitEvents &... events);
+```
+
+### Supported Types for T
+- **Tile**: Must be a vector tile (`TileType::Vec`) with supported element type.
+- **GlobalTensor**: Must use layout `ND`, `DN`, or `NZ`, and have a supported element type.
+
+## Constraints
+
+- **Supported element type**:
+    - Floating-point: `float`, `half`
+    - Signed integers: `int8_t`, `int16_t`, `int32_t`
+    - Unsigned integers: `uint8_t`, `uint16_t`, `uint32_t`
+- **For Tiles**: `TileData::Loc == TileType::Vec` (only vector tiles are printable).
+- **For GlobalTensor**: Layout must be one of `Layout::ND`, `Layout::DN`, or `Layout::NZ`.
+
+## Behavior
+- **Mandatory Compilation Flag**:
+
+  On A2/A3/A5 devices, `TPRINT` uses `cce::printf` to emit output via the device-to-host debug channel. **You must enable the CCE option `-D_DEBUG --cce-enable-print`**.
+
+- **Buffer Limitation:**
+
+  The internal print buffer of `cce::printf` is limited in size. If the output exceeds this buffer, a warning message such as `"Warning: out of bound! try best to print"` may appear, and **only partial data will be printed**.
+
+- **Synchronization**:
+
+  Automatically inserts a `pipe_barrier(PIPE_ALL)` before printing to ensure all prior operations complete and data is consistent.
+
+- **Formatting**:
+
+    - Floating-point values: printed as `%6.2f`
+    - Integer values: printed as `%6d`
+    - For `GlobalTensor`, due to data size and buffer limitations, only elements within its logical shape (defined by `Shape`) are printed.
+    - For `Tile`, invalid regions (beyond `validRows`/`validCols`) are still printed but marked with a `|` separator when partial validity is specified.
+
+## Examples
+
+### Print a Tile
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+PTO_INTERNAL void DebugTile(__gm__ float *src) {
+  using ValidSrcShape = TileShape2D<float, 16, 16>;
+  using NDSrcShape = BaseShape2D<float, 32, 32>;
+  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
+  GlobalDataSrc srcGlobal(src);
+
+  using srcTileData = Tile<TileType::Vec, float, 16, 16>;
+  srcTileData srcTile;
+  TASSIGN(srcTile, 0x0);
+
+  TLOAD(srcTile, srcGlobal);
+  TPRINT(srcTile);
+}
+```
+
+### Print a GlobalTensor
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+PTO_INTERNAL void DebugGlobalTensor(__gm__ float *src) {
+  using ValidSrcShape = TileShape2D<float, 16, 16>;
+  using NDSrcShape = BaseShape2D<float, 32, 32>;
+  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
+  GlobalDataSrc srcGlobal(src);
+
+  TPRINT(srcGlobal);
+}
+```
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+```
+
+### PTO Assembly Form
+
+```text
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+# AS Level 2 (DPS)
+pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
+```
diff --git a/designs/outerCube/PTOISA/TPRINT_zh.md b/designs/outerCube/PTOISA/TPRINT_zh.md
new file mode 100644
index 00000000..1dbacc50
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPRINT_zh.md
@@ -0,0 +1,113 @@
+# TPRINT
+
+## 指令示意图
+
+![TPRINT tile operation](../figures/isa/TPRINT.svg)
+
+## 简介
+
+调试/打印 Tile 中的元素（实现定义）。
+
+Print the contents of a Tile or GlobalTensor for debugging purposes directly from device code.
+
+The `TPRINT` instruction outputs the logical view of data stored in a Tile or GlobalTensor. It supports common data types (e.g., `float`, `half`, `int8`, `uint32`) and multiple memory layouts (`ND`, `DN`, `NZ` for GlobalTensor; vector tiles for on-chip buffers).
+
+> **Important**:
+> - This instruction is **for development and debugging ONLY**.
+> - It incurs **significant runtime overhead** and **must not be used in production kernels**.
+> - Output may be **truncated** if it exceeds the internal print buffer.
+> - **Requires CCE compilation option `-D_DEBUG --cce-enable-print`** (see [Behavior](#behavior)).
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+```text
+tprint %src : !pto.tile<...> | !pto.global<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tprint %src : !pto.tile<...> | !pto.partition_tensor_view<MxNxdtype> -> ()
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tprint ins(%src : !pto.tile_buf<...> | !pto.partition_tensor_view<MxNxdtype>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPRINT(TileData &src, WaitEvents &... events);
+```
+
+### Supported Types for T
+- **Tile**: Must be a vector tile (`TileType::Vec`) with supported element type.
+- **GlobalTensor**: Must use layout `ND`, `DN`, or `NZ`, and have a supported element type.
+
+## 约束
+
+- **Supported element type**:
+    - Floating-point: `float`, `half`
+    - Signed integers: `int8_t`, `int16_t`, `int32_t`
+    - Unsigned integers: `uint8_t`, `uint16_t`, `uint32_t`
+- **For Tiles**: `TileData::Loc == TileType::Vec` (only vector tiles are printable).
+- **For GlobalTensor**: Layout must be one of `Layout::ND`, `Layout::DN`, or `Layout::NZ`.
+
+## 示例
+
+### Print a Tile
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+PTO_INTERNAL void DebugTile(__gm__ float *src) {
+  using ValidSrcShape = TileShape2D<float, 16, 16>;
+  using NDSrcShape = BaseShape2D<float, 32, 32>;
+  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
+  GlobalDataSrc srcGlobal(src);
+
+  using srcTileData = Tile<TileType::Vec, float, 16, 16>;
+  srcTileData srcTile;
+  TASSIGN(srcTile, 0x0);
+
+  TLOAD(srcTile, srcGlobal);
+  TPRINT(srcTile);
+}
+```
+
+### Print a GlobalTensor
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+PTO_INTERNAL void DebugGlobalTensor(__gm__ float *src) {
+  using ValidSrcShape = TileShape2D<float, 16, 16>;
+  using NDSrcShape = BaseShape2D<float, 32, 32>;
+  using GlobalDataSrc = GlobalTensor<float, ValidSrcShape, NDSrcShape>;
+  GlobalDataSrc srcGlobal(src);
+
+  TPRINT(srcGlobal);
+}
+```
diff --git a/designs/outerCube/PTOISA/TPUSH.md b/designs/outerCube/PTOISA/TPUSH.md
new file mode 100644
index 00000000..1a1b8c4c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPUSH.md
@@ -0,0 +1,40 @@
+# TPUSH
+
+## Tile Operation Diagram
+
+![TPUSH tile operation](../figures/isa/TPUSH.svg)
+
+## Introduction
+
+Push a tile into a pipe or FIFO producer endpoint.
+
+## Math Interpretation
+
+Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tpush ...
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tpush ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`.
+
+## Constraints
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## Examples
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TPUSH_zh.md b/designs/outerCube/PTOISA/TPUSH_zh.md
new file mode 100644
index 00000000..f6ebfadc
--- /dev/null
+++ b/designs/outerCube/PTOISA/TPUSH_zh.md
@@ -0,0 +1,41 @@
+# TPUSH
+
+## 指令示意图
+
+![TPUSH tile operation](../figures/isa/TPUSH.svg)
+
+## 简介
+
+将 Tile 推入 pipe 或 FIFO 的生产者端。
+
+## 数学语义
+
+语义随指令而变化。 Unless stated otherwise, behavior is defined over the destination valid region.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tpush ...
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tpush ins(...) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`.
+
+## 约束
+
+Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+
+## 示例
+
+See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
diff --git a/designs/outerCube/PTOISA/TQUANT.md b/designs/outerCube/PTOISA/TQUANT.md
new file mode 100644
index 00000000..945abb5e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TQUANT.md
@@ -0,0 +1,94 @@
+﻿# TQUANT
+
+
+## Tile Operation Diagram
+
+![TQUANT tile operation](../figures/isa/TQUANT.svg)
+
+## Introduction
+
+Quantize an FP32 tile into a lower-precision format (e.g. FP8), producing auxiliary exponent/scaling/max tiles. The quantization mode is a compile-time template parameter (`mode`).
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataExp, typename TileDataMax,
+          typename... WaitEvents>
+PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, WaitEvents &... events);
+
+template <auto quant_type, auto store_mode, typename TileDataOut, typename TileDataSrc, typename TileDataExp,
+          typename TileDataMax, typename TileDataIdx, typename... WaitEvents>
+PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, TileDataExp *exp_zz, TileDataIdx *vgather_idx, WaitEvents &... events);
+
+template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataPara, typename... WaitEvents>
+PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataPara &scale, TileDataPara *offset = nullptr, WaitEvents &... events);
+```
+
+## Constraints
+
+- This instruction is currently implemented for specific targets (see `include/pto/npu/*/TQuant.hpp`).
+- Input type requirements and output tile types are mode/target-dependent.
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TQUANT_zh.md b/designs/outerCube/PTOISA/TQUANT_zh.md
new file mode 100644
index 00000000..85d6d359
--- /dev/null
+++ b/designs/outerCube/PTOISA/TQUANT_zh.md
@@ -0,0 +1,67 @@
+# TQUANT
+
+## 指令示意图
+
+![TQUANT tile operation](../figures/isa/TQUANT.svg)
+
+## 简介
+
+量化 Tile（例如 FP32 到 FP8），生成指数/缩放/最大值输出。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tquant %src, %qp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataExp, typename TileDataMax,
+          typename... WaitEvents>
+PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, WaitEvents &... events);
+
+template <auto quant_type, auto store_mode, typename TileDataOut, typename TileDataSrc, typename TileDataExp,
+          typename TileDataMax, typename TileDataIdx, typename... WaitEvents>
+PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataExp *exp, TileDataMax *max, TileDataSrc *scaling, TileDataExp *exp_zz, TileDataIdx *vgather_idx, WaitEvents &... events);
+
+template <auto quant_type, typename TileDataOut, typename TileDataSrc, typename TileDataPara, typename... WaitEvents>
+PTO_INST RecordEvent TQUANT(TileDataOut &dst, TileDataSrc &src, TileDataPara &scale, TileDataPara *offset = nullptr, WaitEvents &... events);
+```
+
+## 约束
+
+- This instruction is currently implemented for specific targets (see `include/pto/npu/*/TQuant.hpp`).
+- Input type requirements and output tile types are mode/target-dependent.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TRANDOM.md b/designs/outerCube/PTOISA/TRANDOM.md
new file mode 100644
index 00000000..f7aa18a6
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRANDOM.md
@@ -0,0 +1,122 @@
+# TRANDOM
+
+## Tile Operation Diagram
+
+![TRANDOM tile operation](../figures/isa/TRANDOM.svg)
+
+## Introduction
+
+Generates random numbers in the destination tile using a counter-based cipher algorithm.
+
+## Math Interpretation
+
+This instruction implements a counter-based random number generator. For each element in the valid region, it generates
+pseudo-random values based on a key and counter state using a cipher-like transformation with configurable rounds.
+
+The algorithm uses:
+
+- 128-bit state (4 × 32-bit counters)
+- 64-bit key (2 × 32-bit words)
+- ChaCha-like quarter-round operations with vector instructions
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+trandom %dst, %key, %counter : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ Intrinsic
+
+Declared in `include/pto/npu/a5/TRandom.hpp`:
+
+```cpp
+template <uint16_t Rounds = 10, typename DstTile>
+PTO_INST void TRANDOM_IMPL(DstTile &dst, TRandomKey &key, TRandomCounter &counter);
+```
+
+## Constraints
+
+- **Implementation checks (A5)**:
+  - `DstTile::DType` must be one of: `int32_t`, `uint32_t`.
+  - Tile layout must be row-major (`DstTile::isRowMajor`).
+  - `Rounds` must be either 7 or 10 (default: 10).
+  - `key` and `counter` must not be null.
+- **Valid region**:
+  - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT dst;
+  TRandomKey key = {0x01234, 0x56789};
+  TRandomCounter counter = {0, 0, 0, 0};
+  TRANDOM_IMPL(dst, key, counter);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT dst;
+  TRandomKey key = {0x01234, 0x56789};
+  TRandomCounter counter = {0, 0, 0, 0};
+  TASSIGN(dst, 0x0);
+  TRANDOM_IMPL<10>(dst, key, counter);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x3000)
+%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+trandom %dst, %key, %counter : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TRANDOM_zh.md b/designs/outerCube/PTOISA/TRANDOM_zh.md
new file mode 100644
index 00000000..d6dfc933
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRANDOM_zh.md
@@ -0,0 +1,122 @@
+# TRANDOM
+
+## Tile Operation Diagram
+
+![TRANDOM tile operation](../figures/isa/TRANDOM.svg)
+
+## 简介
+
+使用基于计数器的密码算法在目标 Tile 中生成随机数。
+
+## 数学解释
+
+该指令实现了一个基于计数器的随机数生成器。对于有效区域中的每个元素，它基于密钥和计数器状态，使用可配置轮数的密码类变换生成
+伪随机值。
+
+该算法使用：
+
+- 128 位状态（4 × 32 位计数器）
+- 64 位密钥（2 × 32 位字）
+- 类似 ChaCha 的四分之一轮操作，使用向量指令
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS.md)。
+
+同步形式：
+
+```text
+trandom %dst, %key, %counter : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内置函数
+
+声明于 `include/pto/npu/a5/TRandom.hpp`：
+
+```cpp
+template <uint16_t Rounds = 10, typename DstTile>
+PTO_INST void TRANDOM_IMPL(DstTile &dst, TRandomKey &key, TRandomCounter &counter);
+```
+
+## 约束条件
+
+- **实现检查（A5）**：
+  - `DstTile::DType` 必须为以下类型之一：`int32_t`、`uint32_t`。
+  - Tile 布局必须为行主序（`DstTile::isRowMajor`）。
+  - `Rounds` 必须为 7 或 10（默认为 10）。
+  - `key` 和 `counter` 不能为空。
+- **有效区域**：
+  - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+### Auto 模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT dst;
+  TRandomKey key = {0x01234, 0x56789};
+  TRandomCounter counter = {0, 0, 0, 0};
+  TRANDOM_IMPL(dst, key, counter);
+}
+```
+
+### Manual 模式
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT dst;
+  TRandomKey key = {0x01234, 0x56789};
+  TRandomCounter counter = {0, 0, 0, 0};
+  TASSIGN(dst, 0x0);
+  TRANDOM_IMPL<10>(dst, key, counter);
+}
+```
+
+## 汇编形式示例
+
+### Auto 模式
+
+```text
+# Auto 模式：编译器/运行时管理的布局和调度。
+%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual 模式
+
+```text
+# Manual 模式：在发出指令之前显式绑定资源。
+# Tile 操作数可选：
+# pto.tassign %arg0, @tile(0x3000)
+%dst = pto.trandom %key, %counter : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+trandom %dst, %key, %counter : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trandom ins(%key, %counter : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TRECIP.md b/designs/outerCube/PTOISA/TRECIP.md
new file mode 100644
index 00000000..4b592b63
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRECIP.md
@@ -0,0 +1,113 @@
+﻿# TRECIP
+
+
+## Tile Operation Diagram
+
+![TRECIP tile operation](../figures/isa/TRECIP.svg)
+
+## Introduction
+
+Elementwise reciprocal of a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \frac{1}{\mathrm{src}_{i,j}} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trecip %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TRECIP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - A3's TRECIP instruction does not support setting the source Tile and destination Tile to the same memory.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TRECIP(out, x);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trecip %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TRECIP_zh.md b/designs/outerCube/PTOISA/TRECIP_zh.md
new file mode 100644
index 00000000..f93f9086
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRECIP_zh.md
@@ -0,0 +1,86 @@
+# TRECIP
+
+## 指令示意图
+
+![TRECIP tile operation](../figures/isa/TRECIP.svg)
+
+## 简介
+
+Tile 的逐元素倒数。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \frac{1}{\mathrm{src}_{i,j}} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trecip %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trecip %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trecip ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TRECIP(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - A3's TRECIP instruction does not support setting the source Tile and destination Tile to the same memory.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TRECIP(out, x);
+}
+```
diff --git a/designs/outerCube/PTOISA/TRELU.md b/designs/outerCube/PTOISA/TRELU.md
new file mode 100644
index 00000000..0a85211e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRELU.md
@@ -0,0 +1,116 @@
+﻿# TRELU
+
+
+## Tile Operation Diagram
+
+![TRELU tile operation](../figures/isa/TRELU.svg)
+
+## Introduction
+
+Elementwise ReLU of a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, 0) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trelu %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TRELU(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TRELU(out, x);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trelu %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TRELU_zh.md b/designs/outerCube/PTOISA/TRELU_zh.md
new file mode 100644
index 00000000..7024ffd7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRELU_zh.md
@@ -0,0 +1,89 @@
+# TRELU
+
+## 指令示意图
+
+![TRELU tile operation](../figures/isa/TRELU.svg)
+
+## 简介
+
+Tile 的逐元素 ReLU。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \max(\mathrm{src}_{i,j}, 0) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trelu %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trelu %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trelu ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TRELU(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `half`, `float`, `int32_t`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src` and `dst` tiles should have the same `validRow/validCol`.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src/dst` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TRELU(out, x);
+}
+```
diff --git a/designs/outerCube/PTOISA/TREM.md b/designs/outerCube/PTOISA/TREM.md
new file mode 100644
index 00000000..88968044
--- /dev/null
+++ b/designs/outerCube/PTOISA/TREM.md
@@ -0,0 +1,115 @@
+﻿# TREM
+
+
+## Tile Operation Diagram
+
+![TREM tile operation](../figures/isa/TREM.svg)
+
+## Introduction
+
+Elementwise remainder of two tiles. The result has the same sign as the divider.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \bmod \mathrm{src1}_{i,j}$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trem %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TREM(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation Checks (A2A3)**:
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - Supported element types: `float` and `int32_t`.
+    - `dst`, `src0`, and `src1` must be vector tiles.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow() > 0` and `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol() > 0`.
+    - **tmp Buffer Requirements**:
+      - `tmp.GetValidCol() >= dst.GetValidCol()` (at least as many columns as dst)
+      - `tmp.GetValidRow() >= 1` (at least 1 row)
+      - Data type must match `TileDataDst::DType`.
+- **Implementation Checks (A5)**:
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - Supported element types: `float`, `int32_t`, `uint32_t`, `half`, `int16_t`, and `uint16_t`.
+    - `dst`, `src0`, and `src1` must be vector tiles.
+    - Static valid bounds: `ValidRow <= Rows` and `ValidCol <= Cols` for all tiles.
+    - Runtime: `dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow()` and `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol()`.
+    - Note: tmp parameter is accepted but not validated or used on A5.
+- **Division by Zero**:
+    - Behavior is target-defined; the CPU simulator asserts in debug builds.
+- **Valid Region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **For `int32_t` Inputs (A2A3 Only)**: Both `src0` and `src1` elements must be in the range `[-2^24, 2^24]` (i.e., `[-16777216, 16777216]`) to ensure exact conversion to float32 during computation.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT out, a, b;
+  Tile<TileType::Vec, int32_t, 16, 16> tmp;
+  TREM(out, a, b, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trem %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TREMS.md b/designs/outerCube/PTOISA/TREMS.md
new file mode 100644
index 00000000..03c323be
--- /dev/null
+++ b/designs/outerCube/PTOISA/TREMS.md
@@ -0,0 +1,115 @@
+﻿# TREMS
+
+
+## Tile Operation Diagram
+
+![TREMS tile operation](../figures/isa/TREMS.svg)
+
+## Introduction
+
+Elementwise remainder with a scalar: `%`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$\mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \bmod \mathrm{scalar}$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trems %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TREMS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar,
+                           TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation Checks (A2A3)**:
+    - `dst` and `src` must use the same element type.
+    - Supported element types: `float` and `int32_t`.
+    - `dst` and `src` must be vector tiles.
+    - `dst` and `src` must be row-major.
+    - Runtime: `dst.GetValidRow() == src.GetValidRow() > 0` and `dst.GetValidCol() == src.GetValidCol() > 0`.
+    - **tmp Buffer Requirements**:
+      - `tmp.GetValidCol() >= dst.GetValidCol()` (at least as many columns as dst)
+      - `tmp.GetValidRow() >= 1` (at least 1 row)
+      - Data type must match `TileDataDst::DType`.
+- **Implementation Checks (A5)**:
+    - `dst` and `src` must use the same element type.
+    - Supported element types: `float`, `int32_t`, `uint32_t`, `half`, `int16_t`, and `uint16_t`.
+    - `dst` and `src` must be vector tiles.
+    - Static valid bounds: `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
+    - Runtime: `dst.GetValidRow() == src.GetValidRow()` and `dst.GetValidCol() == src.GetValidCol()`.
+    - Note: tmp parameter is accepted but not validated or used on A5.
+- **Division by Zero**:
+    - Behavior is target-defined; the CPU simulator asserts in debug builds.
+- **Valid Region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **For `int32_t` Inputs (A2A3 Only)**: Both `src` elements and `scalar` must be in the range `[-2^24, 2^24]` (i.e., `[-16777216, 16777216]`) to ensure exact conversion to float32 during computation.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  Tile<TileType::Vec, float, 16, 16> tmp;
+  TREMS(out, x, 3.0f, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trems %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TREMS_zh.md b/designs/outerCube/PTOISA/TREMS_zh.md
new file mode 100644
index 00000000..b3e3bb51
--- /dev/null
+++ b/designs/outerCube/PTOISA/TREMS_zh.md
@@ -0,0 +1,115 @@
+﻿# TREMS
+
+## 指令示意图
+
+![TREMS tile operation](../figures/isa/TREMS.svg)
+
+## 简介
+
+与标量的逐元素余数：`remainder(src, scalar)`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$\mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \bmod \mathrm{scalar}$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = trems %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TREMS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar,
+                           TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - 支持的元素类型：`float` 和 `int32_t`。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - `dst` 和 `src` 必须是行主序。
+    - 运行时：`dst.GetValidRow() == src.GetValidRow() > 0` 且 `dst.GetValidCol() == src.GetValidCol() > 0`。
+    - **tmp 缓冲区要求**：
+      - `tmp.GetValidCol() >= dst.GetValidCol()`（至少与 dst 相同的列数）
+      - `tmp.GetValidRow() >= 1`（至少 1 行）
+      - 数据类型必须与 `TileDataDst::DType` 匹配。
+- **实现检查 (A5)**:
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - 支持的元素类型：`float`、`int32_t`、`uint32_t`、`half`、`int16_t` 和 `uint16_t`。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
+    - 运行时：`dst.GetValidRow() == src.GetValidRow()` 且 `dst.GetValidCol() == src.GetValidCol()`。
+    - 注意：tmp 参数在 A5 上被接受但不进行验证或使用。
+- **除零**:
+    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- **对于 `int32_t` 输入（仅 A2A3）**：`src` 的元素和 `scalar` 必须在 `[-2^24, 2^24]` 范围内（即 `[-16777216, 16777216]`），以确保在计算过程中能精确转换为 float32。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  Tile<TileType::Vec, float, 16, 16> tmp;
+  TREMS(out, x, 3.0f, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trems %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trems %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.trems ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TREM_zh.md b/designs/outerCube/PTOISA/TREM_zh.md
new file mode 100644
index 00000000..5011131c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TREM_zh.md
@@ -0,0 +1,113 @@
+﻿# TREM
+
+## 指令示意图
+
+![TREM tile operation](../figures/isa/TREM.svg)
+
+## 简介
+
+两个 Tile 的逐元素余数运算。结果符号与除数相同。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \bmod \mathrm{src1}_{i,j}$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = trem %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TREM(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - 支持的元素类型：`float` 和 `int32_t`。
+    - `dst`、`src0` 和 `src1` 必须是向量 Tile。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow() > 0` 且 `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol() > 0`。
+    - **tmp 缓冲区要求**：
+      - `tmp.GetValidCol() >= dst.GetValidCol()`（至少与 dst 相同的列数）
+      - `tmp.GetValidRow() >= 1`（至少 1 行）
+      - 数据类型必须与 `TileDataDst::DType` 匹配。
+- **实现检查 (A5)**:
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - 支持的元素类型：`float`、`int32_t`、`uint32_t`、`half`、`int16_t` 和 `uint16_t`。
+    - `dst`、`src0` 和 `src1` 必须是向量 Tile。
+    - 静态有效边界：所有 Tile 都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
+    - 运行时：`dst.GetValidRow() == src0.GetValidRow() == src1.GetValidRow()` 且 `dst.GetValidCol() == src0.GetValidCol() == src1.GetValidCol()`。
+    - 注意：tmp 参数在 A5 上被接受但不进行验证或使用。
+- **除零**:
+    - 行为由目标定义；CPU 模拟器在调试构建中会断言。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- **对于 `int32_t` 输入（仅 A2A3）**：`src0` 和 `src1` 的所有元素必须在 `[-2^24, 2^24]` 范围内（即 `[-16777216, 16777216]`），以确保在计算过程中能精确转换为 float32。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, int32_t, 16, 16>;
+  TileT out, a, b;
+  Tile<TileType::Vec, int32_t, 16, 16> tmp;
+  TREM(out, a, b, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trem %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TRESHAPE.md b/designs/outerCube/PTOISA/TRESHAPE.md
new file mode 100644
index 00000000..ef5ca45d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRESHAPE.md
@@ -0,0 +1,115 @@
+﻿# TRESHAPE
+
+
+## Tile Operation Diagram
+
+![TRESHAPE tile operation](../figures/isa/TRESHAPE.svg)
+
+## Introduction
+
+Reinterpret a tile as another tile type/shape while preserving the underlying bytes.
+
+This is a *bitwise* reshape: it does not change values, it only changes how the same byte buffer is viewed.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+```text
+%dst = treshape %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TRESHAPE(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## Constraints
+
+Enforced by `TRESHAPE_IMPL`:
+
+- **Tile type must match**: `TileDataIn::Loc == TileDataOut::Loc`.
+- **Total byte size must match**: `sizeof(InElem) * InNumel == sizeof(OutElem) * OutNumel`.
+- **No boxed/non-boxed conversion**:
+    - cannot reshape between `SLayout::NoneBox` and boxed layouts.
+
+## Notes
+
+- **CPU simulation**: implemented as a byte-for-byte copy into `dst`.
+- **A2/A3**: implemented as an alias (`TASSIGN_IMPL(dst, src.data())`), so `dst` and `src` refer to the same underlying storage.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using Src = Tile<TileType::Vec, float, 16, 16>;
+  using Dst = Tile<TileType::Vec, float, 8, 32>;
+  static_assert(Src::Numel == Dst::Numel);
+
+  Src src;
+  Dst dst;
+  TRESHAPE(dst, src);
+}
+```
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TRESHAPE_zh.md b/designs/outerCube/PTOISA/TRESHAPE_zh.md
new file mode 100644
index 00000000..616c08b8
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRESHAPE_zh.md
@@ -0,0 +1,81 @@
+# TRESHAPE
+
+## 指令示意图
+
+![TRESHAPE tile operation](../figures/isa/TRESHAPE.svg)
+
+## 简介
+
+将 Tile 重新解释为另一种 Tile 类型/形状，同时保留底层字节。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+```text
+%dst = treshape %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.treshape %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.treshape ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename... WaitEvents>
+PTO_INST RecordEvent TRESHAPE(TileDataOut &dst, TileDataIn &src, WaitEvents &... events);
+```
+
+## 约束
+
+Enforced by `TRESHAPE_IMPL`:
+
+- **Tile type must match**: `TileDataIn::Loc == TileDataOut::Loc`.
+- **Total byte size must match**: `sizeof(InElem) * InNumel == sizeof(OutElem) * OutNumel`.
+- **No boxed/non-boxed conversion**:
+    - cannot reshape between `SLayout::NoneBox` and boxed layouts.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using Src = Tile<TileType::Vec, float, 16, 16>;
+  using Dst = Tile<TileType::Vec, float, 8, 32>;
+  static_assert(Src::Numel == Dst::Numel);
+
+  Src src;
+  Dst dst;
+  TRESHAPE(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TROWARGMAX.md b/designs/outerCube/PTOISA/TROWARGMAX.md
new file mode 100644
index 00000000..470e0bda
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWARGMAX.md
@@ -0,0 +1,148 @@
+# TROWARGMAX
+
+
+## Tile Operation Diagram
+
+![TROWARGMAX tile operation](../figures/isa/TROWARGMAX.svg)
+
+## Introduction
+
+Get the column index of the maximum element for each row.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \max_{0 \le j < C} j_{i} $$
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/grammar/PTO-AS.md`.
+
+Synchronous form:
+
+```text
+%dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile layout of `dst`:
+    - **Compact Mode**: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`, ROWS must be 32b aligned.
+    - **Traditional Mode**: ND layout Tile of 2D, e.g., `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`.
+  - Source data types: `half` or `float`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+- A5:
+  - Source data types: `half` or `float`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+  - `tmp` temporary tile is not used, only for compatibility use.
+
+### About temporary tile `tmp` for A3
+
+* Temporary tile is not used when `srcValidCol <= ElementPerRepeat`, used when `srcValidCol > ElementPerRepeat`.
+* `tmp` tile's rows is the same as `src`.
+* Simply set `tmp` tile size the same as `src` when `src` is small.
+* `tmp` tile's stride can be calculated out based on `src`'s `validCol` using the following formula:
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWARGMAX(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWARGMAX(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWARGMAX_zh.md b/designs/outerCube/PTOISA/TROWARGMAX_zh.md
new file mode 100644
index 00000000..01da87a4
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWARGMAX_zh.md
@@ -0,0 +1,147 @@
+# TROWARGMAX
+
+## 指令示意图
+
+![TROWARGMAX tile operation](../figures/isa/TROWARGMAX.svg)
+
+## 简介
+
+获取每行最大值对应列索引。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \max_{0 \le j < C} j_{i} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/grammar/PTO-AS.md`.
+
+同步形式：
+
+```text
+%dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1（SSA）
+
+```text
+%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2（DPS）
+
+```text
+pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile 布局 of `dst`:
+      - **紧凑模式**：DN 布局的一维 Tile，例如 `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`，此时ROWS要做到32b对齐。
+      - **传统模式**：ND 布局的二维 Tile，例如 `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`。
+  - 源数据类型: `half` or `float`.
+  - 目标数据类型：`uint32_t` or `int32_t`.
+  - 运行期有效区域检查:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+- A5:
+  - 源数据类型: `half` or `float`.
+  - 目标数据类型：`uint32_t` or `int32_t`.
+  - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+  - `tmp`临时Tile不使用，仅做兼容。
+
+### A3 `tmp`临时Tile相关说明
+
+* `tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用，`srcValidCol > ElementPerRepeat`时需要使用。
+* `tmp` tile的行数和`src` tile的行数相同。
+* 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWARGMAX(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWARGMAX(dst, src, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trowargmax %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.trowargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWARGMIN.md b/designs/outerCube/PTOISA/TROWARGMIN.md
new file mode 100644
index 00000000..e1d99930
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWARGMIN.md
@@ -0,0 +1,148 @@
+# TROWARGMIN
+
+
+## Tile Operation Diagram
+
+![TROWARGMIN tile operation](../figures/isa/TROWARGMIN.svg)
+
+## Introduction
+
+Get the column index of the minimum element for each row.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \min_{0 \le j < C} j_{i} $$
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/grammar/PTO-AS.md`.
+
+Synchronous form:
+
+```text
+%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile layout of `dst`:
+    - **Compact Mode**: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`, ROWS must be 32b aligned.
+    - **Traditional Mode**: ND layout Tile of 2D, e.g., `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`.
+  - Source data types: `half` or `float`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+- A5:
+  - Source data types: `half` or `float`.
+  - Destination data types: `uint32_t` or `int32_t`.
+  - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+  - `tmp` temporary tile is not used, only for compatibility use.
+
+### About temporary tile `tmp` for A3
+
+* Temporary tile is not used when `srcValidCol <= ElementPerRepeat`, used when `srcValidCol > ElementPerRepeat`.
+* `tmp` tile's rows is the same as `src`.
+* Simply set `tmp` tile size the same as `src` when `src` is small.
+* `tmp` tile's stride can be calculated out based on `src`'s `validCol` using the following formula:
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWARGMIN(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, uint32_t, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWARGMIN(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWARGMIN_zh.md b/designs/outerCube/PTOISA/TROWARGMIN_zh.md
new file mode 100644
index 00000000..4dae67f1
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWARGMIN_zh.md
@@ -0,0 +1,147 @@
+# TROWARGMIN
+
+## 指令示意图
+
+![TROWARGMIN tile operation](../figures/isa/TROWARGMIN.svg)
+
+## 简介
+
+获取每行最小值对应列索引。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \min_{0 \le j < C} j_{i} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/grammar/PTO-AS.md`.
+
+同步形式：
+
+```text
+%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### IR Level 1（SSA）
+
+```text
+%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2（DPS）
+
+```text
+pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+  - Tile location: `dst` and `src` must be `TileType::Vec`.
+  - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+  - Tile 布局 of `dst`:
+    - **紧凑模式**：DN 布局的一维 Tile，例如 `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`，此时ROWS要做到32b对齐。
+    - **传统模式**：ND 布局的二维 Tile，例如 `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`。
+  - 源数据类型: `half` or `float`.
+  - 目标数据类型：`uint32_t` or `int32_t`.
+  - 运行期有效区域检查:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+- A5:
+  - 源数据类型: `half` or `float`.
+  - 目标数据类型：`uint32_t` or `int32_t`.
+  - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+  - `tmp`临时Tile不使用，仅做兼容。
+
+### A3 `tmp`临时Tile相关说明
+
+* `tmp`临时Tile在`srcValidCol <= ElementPerRepeat`时不使用，`srcValidCol > ElementPerRepeat`时需要使用。
+* `tmp` tile的行数和`src` tile的行数相同。
+* 按以下公式根据`src` tile的`validCol`算出`tmp` tile所需stride：
+
+```text
+repeats = ceil(validCol / elementPerRepeat)
+stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
+```
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWARGMIN(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWARGMIN(dst, src, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trowargmin %src : !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.trowargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWEXPAND.md b/designs/outerCube/PTOISA/TROWEXPAND.md
new file mode 100644
index 00000000..55d385d1
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPAND.md
@@ -0,0 +1,132 @@
+﻿# TROWEXPAND
+
+
+## Tile Operation Diagram
+
+![TROWEXPAND tile operation](../figures/isa/TROWEXPAND.svg)
+
+## Introduction
+
+Broadcast the first element of each source row across the destination row.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,0} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- Tile Type: `dst` and `src` must be `TileType::Vec`.
+- Tile layout: ND fractal (`isRowMajor` and `SLayout::NoneBox`) for both `src` and `dst`.
+- Data type: A2A3/A5 element types must be one of: `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+- Runtime valid checks:
+    - A2A3: returns early if any of `dstValidRow`, `dstValidCol`, `srcValidRow`, `srcValidCol` is zero.
+    - A5: asserts `srcValidRow == dstValidRow` and asserts `srcValidRow != 0 && srcValidCol != 0`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TROWEXPAND(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TROWEXPAND(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWEXPANDADD.md b/designs/outerCube/PTOISA/TROWEXPANDADD.md
new file mode 100644
index 00000000..8f8b334f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDADD.md
@@ -0,0 +1,106 @@
+﻿# TROWEXPANDADD
+
+
+## Tile Operation Diagram
+
+![TROWEXPANDADD tile operation](../figures/isa/TROWEXPANDADD.svg)
+
+## Introduction
+
+Row-wise broadcast add: add a per-row scalar vector `src1` to each row of `src0`.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_i
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWEXPANDADD_zh.md b/designs/outerCube/PTOISA/TROWEXPANDADD_zh.md
new file mode 100644
index 00000000..6170b26b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDADD_zh.md
@@ -0,0 +1,79 @@
+# TROWEXPANDADD
+
+## 指令示意图
+
+![TROWEXPANDADD tile operation](../figures/isa/TROWEXPANDADD.svg)
+
+## 简介
+
+行广播加法：加上一个每行标量向量。
+
+## 数学语义
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + s_i
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpandadd %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDADD(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TROWEXPANDDIV.md b/designs/outerCube/PTOISA/TROWEXPANDDIV.md
new file mode 100644
index 00000000..5a116155
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDDIV.md
@@ -0,0 +1,126 @@
+﻿# TROWEXPANDDIV
+
+
+## Tile Operation Diagram
+
+![TROWEXPANDDIV tile operation](../figures/isa/TROWEXPANDDIV.svg)
+
+## Introduction
+
+Row-wise broadcast divide: divide each row of `src0` by a per-row scalar vector `src1`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{0,i}} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks**:
+    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType` (compile-time).
+    - `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+    - Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+    - Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+    - Mode 2: `src1` is expected to provide **32 bytes data per row**.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TROWEXPANDDIV(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(dst,  0x2000);
+  TASSIGN(src1, 0x3000);
+  TROWEXPANDDIV(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWEXPANDDIV_zh.md b/designs/outerCube/PTOISA/TROWEXPANDDIV_zh.md
new file mode 100644
index 00000000..0f5523ea
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDDIV_zh.md
@@ -0,0 +1,126 @@
+﻿# TROWEXPANDDIV
+
+## 指令示意图
+
+![TROWEXPANDDIV tile operation](../figures/isa/TROWEXPANDDIV.svg)
+
+## 简介
+
+行广播除法：将 `src0` 的每一行除以一个每行标量向量 `src1`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{0,i}} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDDIV(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查**:
+    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`（编译时）。
+    - `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+    - Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+    - 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
+    - 模式 2：`src1` 预期提供**每行 32 字节数据**。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TROWEXPANDDIV(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(dst,  0x2000);
+  TASSIGN(src1, 0x3000);
+  TROWEXPANDDIV(dst, src0, src1);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trowexpanddiv %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWEXPANDEXPDIF.md b/designs/outerCube/PTOISA/TROWEXPANDEXPDIF.md
new file mode 100644
index 00000000..d5551a18
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDEXPDIF.md
@@ -0,0 +1,106 @@
+﻿# TROWEXPANDEXPDIF
+
+
+## Tile Operation Diagram
+
+![TROWEXPANDEXPDIF tile operation](../figures/isa/TROWEXPANDEXPDIF.svg)
+
+## Introduction
+
+Row-wise exp-diff: compute `exp(src0 - src1)` where `src1` provides one scalar per row.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_i)
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWEXPANDEXPDIF_zh.md b/designs/outerCube/PTOISA/TROWEXPANDEXPDIF_zh.md
new file mode 100644
index 00000000..7a98cd6f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDEXPDIF_zh.md
@@ -0,0 +1,79 @@
+# TROWEXPANDEXPDIF
+
+## 指令示意图
+
+![TROWEXPANDEXPDIF tile operation](../figures/isa/TROWEXPANDEXPDIF.svg)
+
+## 简介
+
+行指数差运算：计算 exp(src0 - src1)，其中 src1 为每行标量。
+
+## 数学语义
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \exp(\mathrm{src0}_{i,j} - s_i)
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpandexpdif %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDEXPDIF(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TROWEXPANDMAX.md b/designs/outerCube/PTOISA/TROWEXPANDMAX.md
new file mode 100644
index 00000000..88ddf649
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDMAX.md
@@ -0,0 +1,106 @@
+﻿# TROWEXPANDMAX
+
+
+## Tile Operation Diagram
+
+![TROWEXPANDMAX tile operation](../figures/isa/TROWEXPANDMAX.svg)
+
+## Introduction
+
+Row-wise broadcast max: take `max(src0, src1)` where `src1` provides one scalar per row.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_i)
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWEXPANDMAX_zh.md b/designs/outerCube/PTOISA/TROWEXPANDMAX_zh.md
new file mode 100644
index 00000000..f763dfa3
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDMAX_zh.md
@@ -0,0 +1,79 @@
+# TROWEXPANDMAX
+
+## 指令示意图
+
+![TROWEXPANDMAX tile operation](../figures/isa/TROWEXPANDMAX.svg)
+
+## 简介
+
+行广播最大值：与每行标量向量取最大值。
+
+## 数学语义
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \max(\mathrm{src0}_{i,j}, s_i)
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpandmax %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMAX(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TROWEXPANDMIN.md b/designs/outerCube/PTOISA/TROWEXPANDMIN.md
new file mode 100644
index 00000000..e981807c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDMIN.md
@@ -0,0 +1,106 @@
+﻿# TROWEXPANDMIN
+
+
+## Tile Operation Diagram
+
+![TROWEXPANDMIN tile operation](../figures/isa/TROWEXPANDMIN.svg)
+
+## Introduction
+
+Row-wise broadcast min: take `min(src0, src1)` where `src1` provides one scalar per row.
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_i)
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWEXPANDMIN_zh.md b/designs/outerCube/PTOISA/TROWEXPANDMIN_zh.md
new file mode 100644
index 00000000..e3895fd9
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDMIN_zh.md
@@ -0,0 +1,79 @@
+# TROWEXPANDMIN
+
+## 指令示意图
+
+![TROWEXPANDMIN tile operation](../figures/isa/TROWEXPANDMIN.svg)
+
+## 简介
+
+行广播最小值：与每行标量向量取最小值。
+
+## 数学语义
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `s_i` be the per-row scalar taken from `src1` (one value per row).
+
+For `0 <= i < R` and `0 <= j < C`:
+
+$$
+\mathrm{dst}_{i,j} = \min(\mathrm{src0}_{i,j}, s_i)
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpandmin %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMIN(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`
+- `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+- Tile 形状/布局约束 (compile-time): `TileDataDst::isRowMajor`.
+- Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+- Mode 2: `src1` is expected to provide **32 bytes data per row**.
+- Exact layout/fractal constraints are target-specific; see backend headers under `include/pto/npu/*/TRowExpand*.hpp`.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TROWEXPANDMUL.md b/designs/outerCube/PTOISA/TROWEXPANDMUL.md
new file mode 100644
index 00000000..76a11fb4
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDMUL.md
@@ -0,0 +1,126 @@
+﻿# TROWEXPANDMUL
+
+
+## Tile Operation Diagram
+
+![TROWEXPANDMUL tile operation](../figures/isa/TROWEXPANDMUL.svg)
+
+## Introduction
+
+Row-wise broadcast multiply: multiply each row of `src0` by a per-row scalar vector `src1`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{0,i} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks**:
+    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType` (compile-time).
+    - `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+    - Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+    - Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+    - Mode 2: `src1` is expected to provide **32 bytes data per row**.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TROWEXPANDMUL(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(dst,  0x2000);
+  TASSIGN(src1, 0x3000);
+  TROWEXPANDMUL(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWEXPANDMUL_zh.md b/designs/outerCube/PTOISA/TROWEXPANDMUL_zh.md
new file mode 100644
index 00000000..f27d717f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDMUL_zh.md
@@ -0,0 +1,126 @@
+﻿# TROWEXPANDMUL
+
+## 指令示意图
+
+![TROWEXPANDMUL tile operation](../figures/isa/TROWEXPANDMUL.svg)
+
+## 简介
+
+行广播乘法：将 `src0` 的每一行乘以一个每行标量向量 `src1`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \cdot \mathrm{src1}_{0,i} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDMUL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查**:
+    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`（编译时）。
+    - `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+    - Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+    - 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
+    - 模式 2：`src1` 预期提供**每行 32 字节数据**。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TROWEXPANDMUL(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(dst,  0x2000);
+  TASSIGN(src1, 0x3000);
+  TROWEXPANDMUL(dst, src0, src1);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trowexpandmul %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWEXPANDSUB.md b/designs/outerCube/PTOISA/TROWEXPANDSUB.md
new file mode 100644
index 00000000..15f38477
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDSUB.md
@@ -0,0 +1,126 @@
+﻿# TROWEXPANDSUB
+
+
+## Tile Operation Diagram
+
+![TROWEXPANDSUB tile operation](../figures/isa/TROWEXPANDSUB.svg)
+
+## Introduction
+
+Row-wise broadcast subtract: subtract a per-row scalar vector `src1` from each row of `src0`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{0,i} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks**:
+    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType` (compile-time).
+    - `TileDataDst::DType`, `TileDataSrc0::DType`, `TileDataSrc1::DType` must be one of: `half`, `float`.
+    - Tile shape/layout constraint (compile-time): `TileDataDst::isRowMajor`.
+    - Mode 1: `src1` is expected to provide **one scalar per row** (i.e., its valid shape must cover `R` values).
+    - Mode 2: `src1` is expected to provide **32 bytes data per row**.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TROWEXPANDSUB(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(dst,  0x2000);
+  TASSIGN(src1, 0x3000);
+  TROWEXPANDSUB(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWEXPANDSUB_zh.md b/designs/outerCube/PTOISA/TROWEXPANDSUB_zh.md
new file mode 100644
index 00000000..d1dad849
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPANDSUB_zh.md
@@ -0,0 +1,126 @@
+﻿# TROWEXPANDSUB
+
+## 指令示意图
+
+![TROWEXPANDSUB tile operation](../figures/isa/TROWEXPANDSUB.svg)
+
+## 简介
+
+行广播减法：从 `src0` 的每一行中减去一个每行标量向量 `src1`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{0,i} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPANDSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查**:
+    - `TileDataDst::DType == TileDataSrc0::DType == TileDataSrc1::DType`（编译时）。
+    - `TileDataDst::DType`、`TileDataSrc0::DType`、`TileDataSrc1::DType` 必须是以下之一：`half`、`float`。
+    - Tile 形状/布局约束（编译时）：`TileDataDst::isRowMajor`。
+    - 模式 1：`src1` 预期提供**每行一个标量**（即，其有效形状必须覆盖 `R` 个值）。
+    - 模式 2：`src1` 预期提供**每行 32 字节数据**。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TROWEXPANDSUB(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, half, 16, 16>;
+  using RowVecT = Tile<TileType::Vec, half, 16, 1, BLayout::ColMajor, 1, DYNAMIC, SLayout::NoneBox>;
+
+  TileT src0, dst;
+  RowVecT src1(16);
+  TASSIGN(src0, 0x1000);
+  TASSIGN(dst,  0x2000);
+  TASSIGN(src1, 0x3000);
+  TROWEXPANDSUB(dst, src0, src1);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trowexpandsub %src0, %src1 : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TROWEXPAND_zh.md b/designs/outerCube/PTOISA/TROWEXPAND_zh.md
new file mode 100644
index 00000000..d0dd216d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWEXPAND_zh.md
@@ -0,0 +1,105 @@
+# TROWEXPAND
+
+## 指令示意图
+
+![TROWEXPAND tile operation](../figures/isa/TROWEXPAND.svg)
+
+## 简介
+
+将每个源行的第一个元素广播到目标行中。
+
+## 数学语义
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. For `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,0} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowexpand %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowexpand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TROWEXPAND(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- Tile Type: `dst` and `src` must be `TileType::Vec`.
+- Tile 布局: ND fractal (`isRowMajor` and `SLayout::NoneBox`) for both `src` and `dst`.
+- Data type: A2A3/A5 element types must be one of: `int8_t` or `uint8_t` or `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+- 运行期有效区域检查:
+    - A2A3: returns early if any of `dstValidRow`, `dstValidCol`, `srcValidRow`, `srcValidCol` is zero.
+    - A5: asserts `srcValidRow == dstValidRow` and asserts `srcValidRow != 0 && srcValidCol != 0`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TROWEXPAND(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TROWEXPAND(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TROWMAX.md b/designs/outerCube/PTOISA/TROWMAX.md
new file mode 100644
index 00000000..67c2254c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWMAX.md
@@ -0,0 +1,135 @@
+﻿# TROWMAX
+
+
+## Tile Operation Diagram
+
+![TROWMAX tile operation](../figures/isa/TROWMAX.svg)
+
+## Introduction
+
+Reduce each row by taking the maximum across columns.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \max_{0 \le j < C} \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWMAX(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+    - Tile location: `dst` and `src` must be `TileType::Vec`.
+    - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+    - Tile layout of `dst`:
+        - **Compact Mode**: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`, ROWS must be 32b aligned.
+        - **Traditional Mode**: ND layout Tile of 2D, e.g., `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`.
+    - Data types: `half` or `float`.
+    - DType consistency: `dst.DType == src.DType`.
+    - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `srcValidRow == dstValidRow` (the output valid row must match the input valid row).
+- A5:
+    - Data types: `half` or `float`.
+    - DType consistency: `dst.DType == src.DType`.
+    - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWMAX(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWMAX(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWMAX_zh.md b/designs/outerCube/PTOISA/TROWMAX_zh.md
new file mode 100644
index 00000000..1795309c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWMAX_zh.md
@@ -0,0 +1,135 @@
+﻿# TROWMAX
+
+## 指令示意图
+
+![TROWMAX tile operation](../figures/isa/TROWMAX.svg)
+
+## 简介
+
+通过取列间最大值来归约每一行。
+
+## 数学语义
+
+设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
+
+$$ \mathrm{dst}_{i,0} = \max_{0 \le j < C} \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
+```
+降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWMAX(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+    - Tile 位置：`dst` 和 `src` 必须是 `TileType::Vec`。
+    - `src` 的 Tile 布局：ND 分形（`isRowMajor` 且 `SLayout::NoneBox`）。
+    - `dst` 的 Tile 布局：
+        - **紧凑模式**：DN 布局的一维 Tile，例如 `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`，此时ROWS要做到32b对齐。
+        - **传统模式**：ND 布局的二维 Tile，例如 `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`。
+    - 数据类型：`half` 或 `float`。
+    - 数据类型一致性：`dst.DType == src.DType`。
+    - 运行期有效区域检查：
+    - `srcValidCol != 0` 且 `srcValidRow != 0`。
+    - `srcValidRow == dstValidRow`（输出有效行数必须与输入有效行数匹配）。
+- A5:
+    - 数据类型：`half` 或 `float`。
+    - 数据类型一致性：`dst.DType == src.DType`。
+    - 实现中对 `validRow/validCol` 无显式运行时断言；循环使用 `src.GetValidRow()` 和 `src.GetValidCol()`。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWMAX(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWMAX(dst, src, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trowmax %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWMIN.md b/designs/outerCube/PTOISA/TROWMIN.md
new file mode 100644
index 00000000..ab60bbe7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWMIN.md
@@ -0,0 +1,135 @@
+﻿# TROWMIN
+
+
+## Tile Operation Diagram
+
+![TROWMIN tile operation](../figures/isa/TROWMIN.svg)
+
+## Introduction
+
+Reduce each row by taking the minimum across columns.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \min_{0 \le j < C} \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWMIN(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+    - Tile location: `dst` and `src` must be `TileType::Vec`.
+    - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+    - Tile layout of `dst`:
+        - **Compact Mode**: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`, ROWS must be 32b aligned.
+        - **Traditional Mode**: ND layout Tile of 2D, e.g., `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`.
+    - Data types: `half` or `float`.
+    - DType consistency: `dst.DType == src.DType`.
+    - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `srcValidRow == dstValidRow` (the output valid row must match the input valid row).
+- A5:
+    - Data types: `half` or `float`.
+    - DType consistency: `dst.DType == src.DType`.
+    - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWMIN(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWMIN(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWMIN_zh.md b/designs/outerCube/PTOISA/TROWMIN_zh.md
new file mode 100644
index 00000000..8cd11fc2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWMIN_zh.md
@@ -0,0 +1,135 @@
+﻿# TROWMIN
+
+## 指令示意图
+
+![TROWMIN tile operation](../figures/isa/TROWMIN.svg)
+
+## 简介
+
+通过取列间最小值来归约每一行。
+
+## 数学语义
+
+设 `R = src.GetValidRow()`，`C = src.GetValidCol()`。对 `0 <= i < R`：
+
+$$ \mathrm{dst}_{i,0} = \min_{0 \le j < C} \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
+```
+降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWMIN(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+    - Tile 位置：`dst` 和 `src` 必须是 `TileType::Vec`。
+    - `src` 的 Tile 布局：ND 分形（`isRowMajor` 且 `SLayout::NoneBox`）。
+    - `dst` 的 Tile 布局：
+        - **紧凑模式**：DN 布局的一维 Tile，例如 `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`，此时ROWS要做到32b对齐。
+        - **传统模式**：ND 布局的二维 Tile，例如 `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`。
+    - 数据类型：`half` 或 `float`。
+    - 数据类型一致性：`dst.DType == src.DType`。
+    - 运行期有效区域检查：
+    - `srcValidCol != 0` 且 `srcValidRow != 0`。
+    - `srcValidRow == dstValidRow`（输出有效行数必须与输入有效行数匹配）。
+- A5:
+    - 数据类型：`half` 或 `float`。
+    - 数据类型一致性：`dst.DType == src.DType`。
+    - 实现中对 `validRow/validCol` 无显式运行时断言；循环使用 `src.GetValidRow()` 和 `src.GetValidCol()`。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWMIN(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWMIN(dst, src, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = trowmin %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWPROD.md b/designs/outerCube/PTOISA/TROWPROD.md
new file mode 100644
index 00000000..dfd49a18
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWPROD.md
@@ -0,0 +1,150 @@
+﻿# TROWPROD
+
+
+## Tile Operation Diagram
+
+![TROWPROD tile operation](../figures/isa/TROWPROD.svg)
+
+## Introduction
+
+Reduce each row by multiplying across columns.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \prod_{j=0}^{C-1} \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWPROD(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+    - Tile location: `dst` and `src` must be `TileType::Vec`.
+    - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+    - Tile layout of `dst`: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`
+    - Data types: `half`, `float`.
+    - DType consistency: `dst.DType == src.DType`.
+    - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `srcValidRow == dstValidRow` (the output valid row must match the input valid row).
+    - `tmp` must have the same shape as `src`.
+
+## Implementation Notes
+
+Unlike TROWSUM which uses `vcadd`/`vcgadd` instructions, TROWPROD uses binary reduction with `vmul` since there is no `vcmul` instruction available on A2A3. The implementation:
+
+1. Multiplies adjacent repeat pairs and stores results in `tmp`
+2. Iteratively performs binary multiplication reduction on `tmp`
+3. Continues until each row has only one element
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWPROD(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWPROD(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWPROD_zh.md b/designs/outerCube/PTOISA/TROWPROD_zh.md
new file mode 100644
index 00000000..7863b14d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWPROD_zh.md
@@ -0,0 +1,115 @@
+# TROWPROD
+
+## 指令示意图
+
+![TROWPROD tile operation](../figures/isa/TROWPROD.svg)
+
+## 简介
+
+对每一行沿列方向连乘归约。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \prod_{j=0}^{C-1} \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trowprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowprod %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowprod ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowprod %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowprod ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWPROD(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+    - Tile location: `dst` and `src` must be `TileType::Vec`.
+    - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+    - Tile 布局 of `dst`: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`
+    - 数据类型: `half`, `float`.
+    - 数据类型一致性: `dst.DType == src.DType`.
+    - 运行期有效区域检查:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `srcValidRow == dstValidRow` (the output valid row must match the input valid row).
+    - `tmp` must have the same shape as `src`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWPROD(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWPROD(dst, src, tmp);
+}
+```
diff --git a/designs/outerCube/PTOISA/TROWSUM.md b/designs/outerCube/PTOISA/TROWSUM.md
new file mode 100644
index 00000000..44944d24
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWSUM.md
@@ -0,0 +1,147 @@
+﻿# TROWSUM
+
+
+## Tile Operation Diagram
+
+![TROWSUM tile operation](../figures/isa/TROWSUM.svg)
+
+## Introduction
+
+Reduce each row by summing across columns.
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \sum_{j=0}^{C-1} \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+Implementation checks (NPU):
+
+- A2A3:
+    - Tile location: `dst` and `src` must be `TileType::Vec`.
+    - Tile layout of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+    - Tile layout of `dst`:
+    - **Recommended**: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`
+    - **To be removed**: ND layout Tile of 2D, e.g., `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`
+    - Data types: `half` or `float`.
+    - DType consistency: `dst.DType == src.DType`.
+    - Runtime valid checks:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `srcValidRow == dstValidRow` (the output valid row must match the input valid row).
+- A5:
+    - Data types: `half` or `float`.
+    - DType consistency: `dst.DType == src.DType`.
+    - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWSUM(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWSUM(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TROWSUM_zh.md b/designs/outerCube/PTOISA/TROWSUM_zh.md
new file mode 100644
index 00000000..527f4e02
--- /dev/null
+++ b/designs/outerCube/PTOISA/TROWSUM_zh.md
@@ -0,0 +1,120 @@
+# TROWSUM
+
+## 指令示意图
+
+![TROWSUM tile operation](../figures/isa/TROWSUM.svg)
+
+## 简介
+
+通过对列求和来归约每一行。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. For `0 <= i < R`:
+
+$$ \mathrm{dst}_{i,0} = \sum_{j=0}^{C-1} \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trowsum %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trowsum %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trowsum ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TROWSUM(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+实现检查 (NPU):
+
+- A2A3:
+    - Tile location: `dst` and `src` must be `TileType::Vec`.
+    - Tile 布局 of `src`: ND fractal (`isRowMajor` and `SLayout::NoneBox`).
+    - Tile 布局 of `dst`:
+    - **推荐**: DN layout Tile of 1D, e.g., `Tile<TileType::Vec, T, ROWS, 1, BLayout::ColMajor, ValidRows, 1>`
+    - **将移除**: ND layout Tile of 2D, e.g., `Tile<TileType::Vec, T, ROWS, COLS, BLayout::RowMajor, ValidRows, 1>`
+    - 数据类型: `half` or `float`.
+    - 数据类型一致性: `dst.DType == src.DType`.
+    - 运行期有效区域检查:
+    - `srcValidCol != 0` and `srcValidRow != 0`.
+    - `srcValidRow == dstValidRow` (the output valid row must match the input valid row).
+- A5:
+    - 数据类型: `half` or `float`.
+    - 数据类型一致性: `dst.DType == src.DType`.
+    - No explicit runtime assertions on `validRow/validCol` in the implementation; the loops use `src.GetValidRow()` and `src.GetValidCol()`.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TROWSUM(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 1, BLayout::ColMajor>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TROWSUM(dst, src, tmp);
+}
+```
diff --git a/designs/outerCube/PTOISA/TRSQRT.md b/designs/outerCube/PTOISA/TRSQRT.md
new file mode 100644
index 00000000..10e58013
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRSQRT.md
@@ -0,0 +1,134 @@
+﻿# TRSQRT
+
+
+## Tile Operation Diagram
+
+![TRSQRT tile operation](../figures/isa/TRSQRT.svg)
+
+## Introduction
+
+Elementwise reciprocal square root.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \frac{1}{\sqrt{\mathrm{src}_{i,j}}} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = trsqrt %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (NPU)**:
+    - The `tmp` buffer must be at least 32 bytes. When tmp is provided, the high-precision version is executed.
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Behavior is target-defined (e.g., for `src == 0` or negative inputs).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TRSQRT(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TRSQRT(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = trsqrt %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TRSQRT_zh.md b/designs/outerCube/PTOISA/TRSQRT_zh.md
new file mode 100644
index 00000000..9b942dee
--- /dev/null
+++ b/designs/outerCube/PTOISA/TRSQRT_zh.md
@@ -0,0 +1,107 @@
+# TRSQRT
+
+## 指令示意图
+
+![TRSQRT tile operation](../figures/isa/TRSQRT.svg)
+
+## 简介
+
+逐元素倒数平方根。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \frac{1}{\sqrt{\mathrm{src}_{i,j}}} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = trsqrt %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.trsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.trsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TRSQRT(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (NPU)**:
+    - The `tmp` buffer must be at least 32 bytes. When tmp is provided, the high-precision version is executed.
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Behavior is target-defined (e.g., for `src == 0` or negative inputs).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TRSQRT(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TRSQRT(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSCATTER.md b/designs/outerCube/PTOISA/TSCATTER.md
new file mode 100644
index 00000000..05d0dd23
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSCATTER.md
@@ -0,0 +1,139 @@
+﻿# TSCATTER
+
+
+## Tile Operation Diagram
+
+![TSCATTER tile operation](../figures/isa/TSCATTER.svg)
+
+## Introduction
+
+Scatter source elements into a destination tile using per-element flattened destination offsets.
+
+## Math Interpretation
+
+For each source element `(i, j)`, let `k = idx[i,j]` and write:
+
+$$ \mathrm{dst\_flat}_{k} = \mathrm{src}_{i,j} $$
+
+Here `dst_flat` denotes the destination tile viewed as a single linear storage sequence. `TSCATTER` does **not** interpret `idx[i,j]` as a destination row selector. On the standard row-major tile layout, this is equivalent to writing the `k`-th flattened destination element.
+
+If multiple elements map to the same destination location, the final value is implementation-defined (last writer wins in the current implementation).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tscatter %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataD, typename TileDataS, typename TileDataI, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(TileDataD &dst, TileDataS &src, TileDataI &indexes, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
+  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
+  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
+  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
+  - No bounds checks are enforced on `indexes` values.
+  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
+  - `TileDataD::DType` and `TileDataS::DType` must be the same.
+  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
+  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
+  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
+- **Implementation checks (A5)**:
+  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
+  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
+  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
+  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
+  - No bounds checks are enforced on `indexes` values.
+  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
+  - `TileDataD::DType` and `TileDataS::DType` must be the same.
+  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
+  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
+  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileT src, dst;
+  IdxT idx;
+  TSCATTER(dst, src, idx);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileT src, dst;
+  IdxT idx;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(idx, 0x3000);
+  TSCATTER(dst, src, idx);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tscatter %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# IR Level 2 (DPS)
+pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TSCATTER_zh.md b/designs/outerCube/PTOISA/TSCATTER_zh.md
new file mode 100644
index 00000000..30b828ac
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSCATTER_zh.md
@@ -0,0 +1,112 @@
+# TSCATTER
+
+## 指令示意图
+
+![TSCATTER tile operation](../figures/isa/TSCATTER.svg)
+
+## 简介
+
+使用逐元素行索引将源 Tile 的行散播到目标 Tile 中。
+
+## 数学语义
+
+对每个源元素 `(i, j)`, let `k = idx[i,j]` and write:
+
+$$ \mathrm{dst\_flat}_{k} = \mathrm{src}_{i,j} $$
+
+Here `dst_flat` denotes the destination tile viewed as a single linear storage sequence. `TSCATTER` does **not** interpret `idx[i,j]` as a destination row selector. On the standard row-major tile layout, this is equivalent to writing the `k`-th flattened destination element.
+
+If multiple elements map to the same destination location, the final value is implementation-defined (last writer wins in the current implementation).
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tscatter %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tscatter %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataD, typename TileDataS, typename TileDataI, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(TileDataD &dst, TileDataS &src, TileDataI &indexes, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
+  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
+  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
+  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
+  - No bounds checks are enforced on `indexes` values.
+  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
+  - `TileDataD::DType` and `TileDataS::DType` must be the same.
+  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
+  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
+  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
+- **实现检查 (A5)**:
+  - `TileDataD::Loc`, `TileDataS::Loc`, `TileDataI::Loc` must be `TileType::Vec`.
+  - `TileDataD::DType`, `TileDataS::DType` must be one of: `int32_t`, `int16_t`, `int8_t`, `half`, `float32_t`, `uint32_t`, `uint16_t`, `uint8_t`, `bfloat16_t`.
+  - `TileDataI::DType` must be one of: `int16_t`, `int32_t`, `uint16_t` or `uint32_t`.
+  - `indexes` values are interpreted as flattened destination element offsets in destination tile storage order.
+  - No bounds checks are enforced on `indexes` values.
+  - Static valid bounds: `TileDataD::ValidRow <= TileDataD::Rows`, `TileDataD::ValidCol <= TileDataD::Cols`, `TileDataS::ValidRow <= TileDataS::Rows`, `TileDataS::ValidCol <= TileDataS::Cols`, `TileDataI::ValidRow <= TileDataI::Rows`, `TileDataI::ValidCol <= TileDataI::Cols`.
+  - `TileDataD::DType` and `TileDataS::DType` must be the same.
+  - When size of `TileDataD::DType` is 4 bytes, the size of `TileDataI::DType` must be 4 bytes.
+  - When size of `TileDataD::DType` is 2 bytes, the size of `TileDataI::DType` must be 2 bytes.
+  - When size of `TileDataD::DType` is 1 bytes, the size of `TileDataI::DType` must be 2 bytes.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileT src, dst;
+  IdxT idx;
+  TSCATTER(dst, src, idx);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using IdxT = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileT src, dst;
+  IdxT idx;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(idx, 0x3000);
+  TSCATTER(dst, src, idx);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSEL.md b/designs/outerCube/PTOISA/TSEL.md
new file mode 100644
index 00000000..c1090b2a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSEL.md
@@ -0,0 +1,140 @@
+﻿# TSEL
+
+
+## Tile Operation Diagram
+
+![TSEL tile operation](../figures/isa/TSEL.svg)
+
+## Introduction
+
+Select between two tiles using a mask tile (per-element selection).
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src0}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{is true} \\
+\mathrm{src1}_{i,j} & \text{otherwise}
+\end{cases}
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename MaskTile, typename TmpTile, typename... WaitEvents>
+PTO_INST RecordEvent TSEL(TileData &dst, MaskTile &selMask, TileData &src0, TileData &src1, TmpTile &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `sizeof(TileData::DType)` must be `2` or `4` bytes.
+    - `TileData::DType` must be `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - The selection domain is `dst.GetValidRow()` / `dst.GetValidCol()`.
+- **Implementation checks (A5)**:
+    - `sizeof(TileData::DType)` must be `2` or `4` bytes.
+    - `TileData::DType` must be `int16_t` or `uint16_t` or `int32_t` or `uint32_t` or `half` or `bfloat16_t` or `float`.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - The selection domain is `dst.GetValidRow()` / `dst.GetValidCol()`.
+- **Mask encoding**:
+    - The mask tile is interpreted as packed predicate bits in a target-defined layout.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
+  TileT src0, src1, dst;
+  MaskT mask(16, 2);
+  TmpT tmp;
+  TSEL(dst, mask, src0, src1, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
+  TileT src0, src1, dst;
+  MaskT mask(16, 2);
+  TmpT tmp;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TASSIGN(mask, 0x4000);
+  TASSIGN(tmp,  0x5000);
+  TSEL(dst, mask, src0, src1, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSELS.md b/designs/outerCube/PTOISA/TSELS.md
new file mode 100644
index 00000000..a68226f3
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSELS.md
@@ -0,0 +1,157 @@
+﻿# TSELS
+
+## Tile Operation Diagram
+
+![TSELS tile operation](../figures/isa/TSELS.svg)
+
+## Introduction
+
+Select between source tile and scalar using a mask tile (per-element selection for source tile).
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{is true} \\
+\mathrm{scalar} & \text{otherwise}
+\end{cases}
+$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsels %mask, %src, %scalar : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tsels %src0, %src1, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsels ins(%src0, %src1, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataMask, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TSELS(TileDataDst &dst, TileDataMask &mask, TileDataSrc &src, TileDataTmp &tmp, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `sizeof(TileDataDst::DType)` must be `2` or `4` bytes.
+    - Supported data types are `half`, `float16_t`, `float`, and `float32_t`.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be row-major.
+    - Runtime: `src.GetValidRow()/GetValidCol()` must match `dst.GetValidRow()/GetValidCol()`.
+- **Implementation checks (A5)**:
+    - `sizeof(TileDataDst::DType)` may be `1`, `2`, or `4` bytes.
+    - Supported data types are `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `half`, and `float`.
+    - `dst` and `src` must use the same element type.
+    - `dst`, `mask`, and `src` must be row-major.
+    - Runtime: `src.GetValidRow()/GetValidCol()` must match `dst.GetValidRow()/GetValidCol()`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Mask encoding**:
+    - The mask tile is interpreted as packed predicate bits in a target-defined layout.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileDst = Tile<TileType::Vec, float, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
+  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  TileDst dst;
+  TileSrc src;
+  TileTmp tmp;
+  TileMask mask(16, 2);
+  float scalar = 0.0f;
+  TSELS(dst, mask, src, tmp, scalar);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileDst = Tile<TileType::Vec, float, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
+  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  TileDst dst;
+  TileSrc src;
+  TileTmp tmp;
+  TileMask mask(16, 2);
+  float scalar = 0.0f;
+  TASSIGN(src, 0x1000);
+  TASSIGN(tmp, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TASSIGN(mask, 0x4000);
+  TSELS(dst, mask, src, tmp, scalar);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsels %mask, %src, %scalar : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TSELS_zh.md b/designs/outerCube/PTOISA/TSELS_zh.md
new file mode 100644
index 00000000..31d6d7e2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSELS_zh.md
@@ -0,0 +1,147 @@
+﻿# TSELS
+
+## 指令示意图
+
+![TSELS tile operation](../figures/isa/TSELS.svg)
+
+## 简介
+
+使用 mask tile 在源 Tile 和标量之间进行逐元素选择。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{为真} \\
+\mathrm{scalar} & \text{否则}
+\end{cases}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tsels %mask, %src, %scalar : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataMask, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TSELS(TileDataDst &dst, TileDataMask &mask, TileDataSrc &src, TileDataTmp &tmp, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `sizeof(TileDataDst::DType)` 必须是 `2` 或 `4` 字节。
+    - 支持的数据类型为 `half`、`float16_t`、`float` 和 `float32_t`。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是行主序。
+    - 运行时：`src.GetValidRow()/GetValidCol()` 必须与 `dst.GetValidRow()/GetValidCol()` 一致。
+- **实现检查 (A5)**:
+    - `sizeof(TileDataDst::DType)` 可以是 `1`、`2` 或 `4` 字节。
+    - 支持的数据类型为 `int8_t`、`uint8_t`、`int16_t`、`uint16_t`、`int32_t`、`uint32_t`、`half` 和 `float`。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst`、`mask` 和 `src` 必须是行主序。
+    - 运行时：`src.GetValidRow()/GetValidCol()` 必须与 `dst.GetValidRow()/GetValidCol()` 一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+- **掩码编码**:
+    - 掩码 Tile 被解释为目标定义布局中的打包谓词位。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileDst = Tile<TileType::Vec, float, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
+  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  TileDst dst;
+  TileSrc src;
+  TileTmp tmp;
+  TileMask mask(16, 2);
+  float scalar = 0.0f;
+  TSELS(dst, mask, src, tmp, scalar);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileDst = Tile<TileType::Vec, float, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, float, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, float, 16, 16>;
+  using TileMask = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  TileDst dst;
+  TileSrc src;
+  TileTmp tmp;
+  TileMask mask(16, 2);
+  float scalar = 0.0f;
+  TASSIGN(src, 0x1000);
+  TASSIGN(tmp, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TASSIGN(mask, 0x4000);
+  TSELS(dst, mask, src, tmp, scalar);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsels %mask, %src, %scalar : (!pto.tile<...>, !pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tsels %mask, %src, %scalar : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsels ins(%mask, %src, %scalar : !pto.tile_buf<...>, !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSEL_zh.md b/designs/outerCube/PTOISA/TSEL_zh.md
new file mode 100644
index 00000000..fac4e34b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSEL_zh.md
@@ -0,0 +1,140 @@
+﻿# TSEL
+
+## 指令示意图
+
+![TSEL tile operation](../figures/isa/TSEL.svg)
+
+## 简介
+
+使用掩码 Tile 在两个 Tile 之间进行选择（逐元素选择）。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$
+\mathrm{dst}_{i,j} =
+\begin{cases}
+\mathrm{src0}_{i,j} & \text{if } \mathrm{mask}_{i,j}\ \text{is true} \\
+\mathrm{src1}_{i,j} & \text{otherwise}
+\end{cases}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileData, typename MaskTile, typename TmpTile, typename... WaitEvents>
+PTO_INST RecordEvent TSEL(TileData &dst, MaskTile &selMask, TileData &src0, TileData &src1, TmpTile &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `sizeof(TileData::DType)` 必须是 `2` 或 `4` 字节。
+    - `TileData::DType` 必须是 `int16_t` 或 `uint16_t` 或 `int32_t` 或 `uint32_t` 或 `half` 或 `bfloat16_t` 或 `float`。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 选择域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- **实现检查 (A5)**:
+    - `sizeof(TileData::DType)` 必须是 `2` 或 `4` 字节。
+    - `TileData::DType` 必须是 `int16_t` 或 `uint16_t` 或 `int32_t` 或 `uint32_t` 或 `half` 或 `bfloat16_t` 或 `float`。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 选择域由 `dst.GetValidRow()` / `dst.GetValidCol()` 决定。
+- **掩码编码**:
+    - 掩码 tile 被解释为目标定义布局中的打包谓词位。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
+  TileT src0, src1, dst;
+  MaskT mask(16, 2);
+  TmpT tmp;
+  TSEL(dst, mask, src0, src1, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using MaskT = Tile<TileType::Vec, uint8_t, 16, 32, BLayout::RowMajor, -1, -1>;
+  using TmpT = Tile<TileType::Vec, uint32_t, 1, 16>;
+  TileT src0, src1, dst;
+  MaskT mask(16, 2);
+  TmpT tmp;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TASSIGN(mask, 0x4000);
+  TASSIGN(tmp,  0x5000);
+  TSEL(dst, mask, src0, src1, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsel %mask, %src0, %src1 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tsel %mask, %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsel ins(%mask, %src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSETFMATRIX.md b/designs/outerCube/PTOISA/TSETFMATRIX.md
new file mode 100644
index 00000000..a7679c4e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSETFMATRIX.md
@@ -0,0 +1,89 @@
+﻿# TSETFMATRIX
+
+
+## Tile Operation Diagram
+
+![TSETFMATRIX tile operation](../figures/isa/TSETFMATRIX.svg)
+
+## Introduction
+
+Set the FMATRIX register(s) used by IMG2COL-like operations from an `Img2colTileConfig` (target/implementation-defined).
+
+## See also
+
+- IMG2COL instruction: `docs/isa/TIMG2COL.md`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
+PTO_INST RecordEvent TSETFMATRIX(ConvTileData &src, WaitEvents&... events);
+```
+
+## Math Interpretation
+
+Unless otherwise specified, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
+```
+## Constraints
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+```
+
+### PTO Assembly Form
+
+```text
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+# AS Level 2 (DPS)
+pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
+```
diff --git a/designs/outerCube/PTOISA/TSETFMATRIX_zh.md b/designs/outerCube/PTOISA/TSETFMATRIX_zh.md
new file mode 100644
index 00000000..ee314082
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSETFMATRIX_zh.md
@@ -0,0 +1,58 @@
+# TSETFMATRIX
+
+## 指令示意图
+
+![TSETFMATRIX tile operation](../figures/isa/TSETFMATRIX.svg)
+
+## 简介
+
+为类 IMG2COL 操作设置 FMATRIX 寄存器。
+
+## 数学语义
+
+除非另有说明, semantics are defined over the valid region and target-dependent behavior is marked as implementation-defined.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tsetfmatrix %cfg : !pto.fmatrix_config -> ()
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsetfmatrix ins(%cfg : !pto.fmatrix_config) outs()
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
+PTO_INST RecordEvent TSETFMATRIX(ConvTileData &src, WaitEvents&... events);
+```
+
+## 约束
+
+Type/layout/location/shape legality is backend-dependent; treat implementation-specific notes as normative for that backend.
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TSETHF32MODE.md b/designs/outerCube/PTOISA/TSETHF32MODE.md
new file mode 100644
index 00000000..9d27bd40
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSETHF32MODE.md
@@ -0,0 +1,62 @@
+# TSETHF32MODE
+
+## Tile Operation Diagram
+
+![TSETHF32MODE tile operation](../figures/isa/TSETHF32MODE.svg)
+
+## Introduction
+
+Configure HF32 transform mode (implementation-defined).
+
+This instruction controls backend-specific HF32 transformation behavior used by supported compute paths.
+
+## Math Interpretation
+
+No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+Schematic form:
+
+```text
+tsethf32mode {enable = true, mode = ...}
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tsethf32mode {enable = true, mode = ...}
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsethf32mode ins({enable = true, mode = ...}) outs()
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <bool isEnable, RoundMode hf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
+PTO_INST RecordEvent TSETHF32MODE(WaitEvents &... events);
+```
+
+## Constraints
+
+- Available only when the corresponding backend capability macro is enabled.
+- Exact mode values and hardware behavior are target-defined.
+- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_enable_hf32() {
+  TSETHF32MODE<true, RoundMode::CAST_ROUND>();
+}
+```
diff --git a/designs/outerCube/PTOISA/TSETHF32MODE_zh.md b/designs/outerCube/PTOISA/TSETHF32MODE_zh.md
new file mode 100644
index 00000000..674bbedb
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSETHF32MODE_zh.md
@@ -0,0 +1,61 @@
+# TSETHF32MODE
+
+## 指令示意图
+
+![TSETHF32MODE tile operation](../figures/isa/TSETHF32MODE.svg)
+
+## 简介
+
+设置 HF32 变换模式（实现定义）。
+
+## 数学语义
+
+No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+Schematic form:
+
+```text
+tsethf32mode {enable = true, mode = ...}
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tsethf32mode {enable = true, mode = ...}
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsethf32mode ins({enable = true, mode = ...}) outs()
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <bool isEnable, RoundMode hf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
+PTO_INST RecordEvent TSETHF32MODE(WaitEvents &... events);
+```
+
+## 约束
+
+- Available only when the corresponding backend capability macro is enabled.
+- Exact mode values and hardware behavior are target-defined.
+- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_enable_hf32() {
+  TSETHF32MODE<true, RoundMode::CAST_ROUND>();
+}
+```
diff --git a/designs/outerCube/PTOISA/TSETTF32MODE.md b/designs/outerCube/PTOISA/TSETTF32MODE.md
new file mode 100644
index 00000000..b366e0d2
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSETTF32MODE.md
@@ -0,0 +1,62 @@
+# TSETTF32MODE
+
+## Tile Operation Diagram
+
+![TSETTF32MODE tile operation](../figures/isa/TSETTF32MODE.svg)
+
+## Introduction
+
+Configure TF32 transform mode (implementation-defined).
+
+This instruction controls backend-specific TF32 transformation behavior used by supported compute paths.
+
+## Math Interpretation
+
+No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
+
+## Assembly Syntax
+
+PTO-AS form: see `docs/assembly/PTO-AS.md`.
+
+Schematic form:
+
+```text
+tsettf32mode {enable = true, mode = ...}
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tsettf32mode {enable = true, mode = ...}
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsettf32mode ins({enable = true, mode = ...}) outs()
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <bool isEnable, RoundMode tf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
+PTO_INST RecordEvent TSETTF32MODE(WaitEvents &... events);
+```
+
+## Constraints
+
+- Available only when the corresponding backend capability macro is enabled.
+- Exact mode values and hardware behavior are target-defined.
+- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_enable_tf32() {
+  TSETTF32MODE<true, RoundMode::CAST_ROUND>();
+}
+```
diff --git a/designs/outerCube/PTOISA/TSETTF32MODE_zh.md b/designs/outerCube/PTOISA/TSETTF32MODE_zh.md
new file mode 100644
index 00000000..3bb8aaac
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSETTF32MODE_zh.md
@@ -0,0 +1,61 @@
+# TSETTF32MODE
+
+## 指令示意图
+
+![TSETTF32MODE tile operation](../figures/isa/TSETTF32MODE.svg)
+
+## 简介
+
+设置 TF32 变换模式（实现定义）。
+
+## 数学语义
+
+No direct tensor arithmetic is produced by this instruction. It updates target mode state used by subsequent instructions.
+
+## 汇编语法
+
+PTO-AS 形式：参见 `docs/assembly/PTO-AS.md`.
+
+Schematic form:
+
+```text
+tsettf32mode {enable = true, mode = ...}
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tsettf32mode {enable = true, mode = ...}
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsettf32mode ins({enable = true, mode = ...}) outs()
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <bool isEnable, RoundMode tf32TransMode = RoundMode::CAST_ROUND, typename... WaitEvents>
+PTO_INST RecordEvent TSETTF32MODE(WaitEvents &... events);
+```
+
+## 约束
+
+- Available only when the corresponding backend capability macro is enabled.
+- Exact mode values and hardware behavior are target-defined.
+- This instruction has control-state side effects and should be ordered appropriately relative to dependent compute instructions.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+void example_enable_tf32() {
+  TSETTF32MODE<true, RoundMode::CAST_ROUND>();
+}
+```
diff --git a/designs/outerCube/PTOISA/TSET_IMG2COL_PADDING.md b/designs/outerCube/PTOISA/TSET_IMG2COL_PADDING.md
new file mode 100644
index 00000000..7f3b513a
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSET_IMG2COL_PADDING.md
@@ -0,0 +1,106 @@
+﻿# TSET_IMG2COL_PADDING
+
+## Tile Operation Diagram
+
+![TSET_IMG2COL_PADDING tile operation](../figures/isa/TSET_IMG2COL_PADDING.svg)
+
+## Introduction
+
+Set IMG2COL padding metadata from an IMG2COL configuration tile (implementation-defined).
+
+## Math Interpretation
+
+No direct tensor arithmetic is produced by this instruction. It updates IMG2COL padding control state consumed by subsequent data-movement operations.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Schematic form:
+
+```text
+tset_img2col_padding %cfg
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tset_img2col_padding %cfg
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tset_img2col_padding ins(%cfg) outs()
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename ConvTileData, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
+
+template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
+```
+
+For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
+
+## Constraints
+
+- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
+- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
+- The exact padding fields updated by this instruction are implementation-defined.
+- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_set_img2col_padding(Img2colTileConfig<uint64_t>& cfg) {
+  TSET_IMG2COL_PADDING(cfg);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
+```
+
+### PTO Assembly Form
+
+```text
+pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
+# AS Level 2 (DPS)
+pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
+```
diff --git a/designs/outerCube/PTOISA/TSET_IMG2COL_PADDING_zh.md b/designs/outerCube/PTOISA/TSET_IMG2COL_PADDING_zh.md
new file mode 100644
index 00000000..5d4c2e1d
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSET_IMG2COL_PADDING_zh.md
@@ -0,0 +1,80 @@
+# TSET_IMG2COL_PADDING
+
+## 指令示意图
+
+![TSET_IMG2COL_PADDING tile operation](../figures/isa/TSET_IMG2COL_PADDING.svg)
+
+## 简介
+
+从 IMG2COL 配置 Tile 设置 IMG2COL 填充元数据。
+
+## 数学语义
+
+No direct tensor arithmetic is produced by this instruction. It updates IMG2COL padding control state consumed by subsequent data-movement operations.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Schematic form:
+
+```text
+tset_img2col_padding %cfg
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tset_img2col_padding %cfg : !pto.fmatrix_config -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tset_img2col_padding ins(%cfg : !pto.fmatrix_config) outs()
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tset_img2col_padding %cfg
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tset_img2col_padding ins(%cfg) outs()
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename ConvTileData, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
+
+template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_PADDING(ConvTileData &src, WaitEvents &... events);
+```
+
+For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
+
+## 约束
+
+- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
+- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
+- The exact padding fields updated by this instruction are implementation-defined.
+- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_set_img2col_padding(Img2colTileConfig<uint64_t>& cfg) {
+  TSET_IMG2COL_PADDING(cfg);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSET_IMG2COL_RPT.md b/designs/outerCube/PTOISA/TSET_IMG2COL_RPT.md
new file mode 100644
index 00000000..6911e358
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSET_IMG2COL_RPT.md
@@ -0,0 +1,106 @@
+﻿# TSET_IMG2COL_RPT
+
+## Tile Operation Diagram
+
+![TSET_IMG2COL_RPT tile operation](../figures/isa/TSET_IMG2COL_RPT.svg)
+
+## Introduction
+
+Set IMG2COL repeat metadata from an IMG2COL configuration tile (implementation-defined).
+
+## Math Interpretation
+
+No direct tensor arithmetic is produced by this instruction. It updates IMG2COL control state used by subsequent data-movement operations.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Schematic form:
+
+```text
+tset_img2col_rpt %cfg
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tset_img2col_rpt %cfg
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tset_img2col_rpt ins(%cfg) outs()
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename ConvTileData, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
+
+template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
+```
+
+For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
+
+## Constraints
+
+- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
+- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
+- The exact register/metadata fields updated by this instruction are implementation-defined.
+- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_set_img2col_rpt(Img2colTileConfig<uint64_t>& cfg) {
+  TSET_IMG2COL_RPT(cfg);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
+```
+
+### PTO Assembly Form
+
+```text
+pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
+# AS Level 2 (DPS)
+pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
+```
diff --git a/designs/outerCube/PTOISA/TSET_IMG2COL_RPT_zh.md b/designs/outerCube/PTOISA/TSET_IMG2COL_RPT_zh.md
new file mode 100644
index 00000000..059534d0
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSET_IMG2COL_RPT_zh.md
@@ -0,0 +1,80 @@
+# TSET_IMG2COL_RPT
+
+## 指令示意图
+
+![TSET_IMG2COL_RPT tile operation](../figures/isa/TSET_IMG2COL_RPT.svg)
+
+## 简介
+
+从 IMG2COL 配置 Tile 设置 IMG2COL 重复次数元数据。
+
+## 数学语义
+
+No direct tensor arithmetic is produced by this instruction. It updates IMG2COL control state used by subsequent data-movement operations.
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Schematic form:
+
+```text
+tset_img2col_rpt %cfg
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tset_img2col_rpt %cfg : !pto.fmatrix_config -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tset_img2col_rpt ins(%cfg : !pto.fmatrix_config) outs()
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tset_img2col_rpt %cfg
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tset_img2col_rpt ins(%cfg) outs()
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename ConvTileData, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
+
+template <typename ConvTileData, SetFmatrixMode FmatrixMode = SetFmatrixMode::FMATRIX_A_MANUAL, typename... WaitEvents>
+PTO_INST RecordEvent TSET_IMG2COL_RPT(ConvTileData &src, WaitEvents &... events);
+```
+
+For `MEMORY_BASE` targets, an overload without `SetFmatrixMode` is also provided.
+
+## 约束
+
+- This instruction is backend-specific and available only for backends that expose IMG2COL configuration state.
+- `src` must be a valid IMG2COL configuration tile type accepted by the backend implementation.
+- The exact register/metadata fields updated by this instruction are implementation-defined.
+- Use this instruction before dependent `TIMG2COL` operations in the same execution stream.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_set_img2col_rpt(Img2colTileConfig<uint64_t>& cfg) {
+  TSET_IMG2COL_RPT(cfg);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSHL.md b/designs/outerCube/PTOISA/TSHL.md
new file mode 100644
index 00000000..8f3cb121
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHL.md
@@ -0,0 +1,103 @@
+﻿# TSHL
+
+
+## Tile Operation Diagram
+
+![TSHL tile operation](../figures/isa/TSHL.svg)
+
+## Introduction
+
+Elementwise shift-left of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \ll \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tshl %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TSHL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Implementation checks (A5)**:
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT x, sh, out;
+  TSHL(out, x, sh);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tshl %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSHLS.md b/designs/outerCube/PTOISA/TSHLS.md
new file mode 100644
index 00000000..d792f7cc
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHLS.md
@@ -0,0 +1,108 @@
+﻿# TSHLS
+
+
+## Tile Operation Diagram
+
+![TSHLS tile operation](../figures/isa/TSHLS.svg)
+
+## Introduction
+
+Elementwise shift-left of a tile, shift bits given by scalar.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \ll \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tshls %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSHLS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported element types are `int32_t`, `int`, `int16_t`, `uint32_t`, `unsigned int`, and `uint16_t`.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
+    - Scalar only supports zero and positive values.
+- **Implementation checks (A5)**:
+    - Supported element types are `int32_t`, `int16_t`, `int8_t`, `uint32_t`, `uint16_t`, and `uint8_t`.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - Static valid bounds must satisfy `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
+    - Scalar only supports zero and positive values.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TSHLS(dst, src, 0x2);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tshls %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSHLS_zh.md b/designs/outerCube/PTOISA/TSHLS_zh.md
new file mode 100644
index 00000000..12e456e7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHLS_zh.md
@@ -0,0 +1,108 @@
+﻿# TSHLS
+
+## 指令示意图
+
+![TSHLS tile operation](../figures/isa/TSHLS.svg)
+
+## 简介
+
+Tile 按标量逐元素左移。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \ll \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tshls %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSHLS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 支持的元素类型为 `int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+    - 标量仅支持零和正值。
+- **实现检查 (A5)**:
+    - 支持的元素类型为 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+    - 标量仅支持零和正值。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TSHLS(dst, src, 0x2);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshls %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tshls %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tshls ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSHL_zh.md b/designs/outerCube/PTOISA/TSHL_zh.md
new file mode 100644
index 00000000..02fb6bc5
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHL_zh.md
@@ -0,0 +1,103 @@
+﻿# TSHL
+
+## 指令示意图
+
+![TSHL tile operation](../figures/isa/TSHL.svg)
+
+## 简介
+
+两个 Tile 的逐元素左移。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \ll \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tshl %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TSHL(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **实现检查 (A5)**:
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT x, sh, out;
+  TSHL(out, x, sh);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshl %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tshl %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tshl ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSHR.md b/designs/outerCube/PTOISA/TSHR.md
new file mode 100644
index 00000000..0ddd9b4f
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHR.md
@@ -0,0 +1,103 @@
+﻿# TSHR
+
+
+## Tile Operation Diagram
+
+![TSHR tile operation](../figures/isa/TSHR.svg)
+
+## Introduction
+
+Elementwise shift-right of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \gg \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tshr %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TSHR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Implementation checks (A5)**:
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
+    - `dst`, `src0`, and `src1` must use the same element type.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - Runtime: `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT x, sh, out;
+  TSHR(out, x, sh);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tshr %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSHRS.md b/designs/outerCube/PTOISA/TSHRS.md
new file mode 100644
index 00000000..bf8fa11b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHRS.md
@@ -0,0 +1,108 @@
+﻿# TSHRS
+
+
+## Tile Operation Diagram
+
+![TSHRS tile operation](../figures/isa/TSHRS.svg)
+
+## Introduction
+
+Elementwise shift-right of a tile, shift bits given by scalar.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \gg \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tshrs %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSHRS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported element types are `int32_t`, `int`, `int16_t`, `uint32_t`, `unsigned int`, and `uint16_t`.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
+    - Scalar only supports zero and positive values.
+- **Implementation checks (A5)**:
+    - Supported element types are `int32_t`, `int16_t`, `int8_t`, `uint32_t`, `uint16_t`, and `uint8_t`.
+    - `dst` and `src` must use the same element type.
+    - `dst` and `src` must be vector tiles.
+    - Static valid bounds must satisfy `ValidRow <= Rows` and `ValidCol <= Cols` for both tiles.
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`.
+    - Scalar only supports zero and positive values.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TSHRS(dst, src, 0x2);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tshrs %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSHRS_zh.md b/designs/outerCube/PTOISA/TSHRS_zh.md
new file mode 100644
index 00000000..12330812
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHRS_zh.md
@@ -0,0 +1,108 @@
+﻿# TSHRS
+
+## 指令示意图
+
+![TSHRS tile operation](../figures/isa/TSHRS.svg)
+
+## 简介
+
+Tile 按标量逐元素右移。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \gg \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tshrs %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSHRS(TileDataDst &dst, TileDataSrc &src, typename TileDataDst::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 支持的元素类型为 `int32_t`、`int`、`int16_t`、`uint32_t`、`unsigned int` 和 `uint16_t`。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+    - 标量仅支持零和正值。
+- **实现检查 (A5)**:
+    - 支持的元素类型为 `int32_t`、`int16_t`、`int8_t`、`uint32_t`、`uint16_t` 和 `uint8_t`。
+    - `dst` 和 `src` 必须使用相同的元素类型。
+    - `dst` 和 `src` 必须是向量 Tile。
+    - 两个 Tile 的静态有效边界都必须满足 `ValidRow <= Rows` 且 `ValidCol <= Cols`。
+    - 运行时：`src.GetValidRow() == dst.GetValidRow()` 且 `src.GetValidCol() == dst.GetValidCol()`。
+    - 标量仅支持零和正值。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint16_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint16_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TSHRS(dst, src, 0x2);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshrs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tshrs %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.tshrs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSHR_zh.md b/designs/outerCube/PTOISA/TSHR_zh.md
new file mode 100644
index 00000000..2a170903
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSHR_zh.md
@@ -0,0 +1,103 @@
+﻿# TSHR
+
+## 指令示意图
+
+![TSHR tile operation](../figures/isa/TSHR.svg)
+
+## 简介
+
+两个 Tile 的逐元素右移。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \gg \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tshr %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TSHR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **实现检查 (A5)**:
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+    - `dst`、`src0` 和 `src1` 必须使用相同的元素类型。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - 运行时：`src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileT x, sh, out;
+  TSHR(out, x, sh);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tshr %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tshr %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tshr ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSORT32.md b/designs/outerCube/PTOISA/TSORT32.md
new file mode 100644
index 00000000..08948410
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSORT32.md
@@ -0,0 +1,150 @@
+﻿# TSORT32
+
+## Tile Operation Diagram
+
+![TSORT32 tile operation](../figures/isa/TSORT32.svg)
+
+## Introduction
+
+Sort each 32-element block of `src` together with the corresponding indices from `idx`, and write the sorted value-index pairs into `dst`.
+
+## Math Interpretation
+
+For each row, `TSORT32` processes `src` in independent 32-element blocks. Let block `b` cover columns `32b ... 32b+31`, and let `n_b = min(32, C - 32b)` be the valid element count of that block.
+
+For each valid element in the block, form a pair
+
+$$
+(v_k, i_k) = (\mathrm{src}_{r,32b+k}, \mathrm{idx}_{r,32b+k}), \quad 0 \le k < n_b
+$$
+
+Then sort the pairs by value and write the sorted value-index pairs to `dst`. The exact packing layout in `dst` is target-defined, but semantically the output of each block is the reordered sequence
+
+$$
+[(v_{\pi(0)}, i_{\pi(0)}), (v_{\pi(1)}, i_{\pi(1)}), \ldots, (v_{\pi(n_b-1)}, i_{\pi(n_b-1)})]
+$$
+
+where `π` is the permutation produced by the implementation for that 32-element block.
+
+Notes:
+
+- `idx` is an input tile, not an output tile.
+- `dst` stores sorted value-index pairs, not just sorted values.
+- The CPU simulation sorts in descending order by value, and for equal values keeps smaller indices first.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename IdxTileData>
+PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx);
+
+template <typename DstTileData, typename SrcTileData, typename IdxTileData, typename TmpTileData>
+PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx, TmpTileData &tmp);
+```
+
+## Constraints
+
+- `TSORT32` does not take `WaitEvents&...` and does not call `TSYNC(...)` internally; synchronize explicitly if needed.
+- `idx` is a required input operand in both overloads; it provides the indices that are permuted together with `src`.
+- **Implementation checks (A2A3/A5)**:
+    - `DstTileData::DType` must be `half` or `float`.
+    - `SrcTileData::DType` must match `DstTileData::DType`.
+    - `IdxTileData::DType` must be `uint32_t`.
+    - `dst/src/idx` tile location must be `TileType::Vec`, and all must be row-major (`isRowMajor`).
+- **Valid region**:
+    - The implementation uses `dst.GetValidRow()` as the row count.
+    - The implementation uses `src.GetValidCol()` to determine how many elements participate in sorting in each row.
+    - Sorting is performed independently per 32-element block; the 4-argument overload additionally supports non-32-aligned tails with `tmp`.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 1, 32>;
+  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
+  using DstT = Tile<TileType::Vec, float, 1, 64>;
+  SrcT src;
+  IdxT idx;
+  DstT dst;
+  TSORT32(dst, src, idx);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 1, 32>;
+  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
+  using DstT = Tile<TileType::Vec, float, 1, 64>;
+  SrcT src;
+  IdxT idx;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(idx, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TSORT32(dst, src, idx);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+# pto.tassign %arg2, @tile(0x3000)
+%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSORT32_zh.md b/designs/outerCube/PTOISA/TSORT32_zh.md
new file mode 100644
index 00000000..2229f183
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSORT32_zh.md
@@ -0,0 +1,150 @@
+﻿# TSORT32
+
+## 指令示意图
+
+![TSORT32 tile operation](../figures/isa/TSORT32.svg)
+
+## 简介
+
+对 `src` 的每个 32 元素块，与 `idx` 中对应的索引一起进行排序，并将排序后的值-索引对写入 `dst`。
+
+## 数学语义
+
+对每一行，`TSORT32` 会按独立的 32 元素块处理 `src`。设第 `b` 个块覆盖列 `32b ... 32b+31`，该块的有效元素数为 `n_b = min(32, C - 32b)`。
+
+对于块中的每个有效元素，先构造一个二元组：
+
+$$
+(v_k, i_k) = (\mathrm{src}_{r,32b+k}, \mathrm{idx}_{r,32b+k}), \quad 0 \le k < n_b
+$$
+
+然后按值对这些二元组排序，并将排序后的值-索引对写入 `dst`。`dst` 中的具体打包布局由目标实现定义，但从语义上看，每个块的输出可表示为：
+
+$$
+[(v_{\pi(0)}, i_{\pi(0)}), (v_{\pi(1)}, i_{\pi(1)}), \ldots, (v_{\pi(n_b-1)}, i_{\pi(n_b-1)})]
+$$
+
+其中 `π` 是该 32 元素块对应的排序置换。
+
+说明：
+
+- `idx` 是输入 Tile，不是输出 Tile。
+- `dst` 保存的是排序后的值-索引对，而不只是排序后的值。
+- 在 CPU 仿真实现中，按值降序排序；当值相同时，索引较小者优先。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename DstTileData, typename SrcTileData, typename IdxTileData>
+PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx);
+
+template <typename DstTileData, typename SrcTileData, typename IdxTileData, typename TmpTileData>
+PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx, TmpTileData &tmp);
+```
+
+## 约束
+
+- `TSORT32` 不接受 `WaitEvents&...` 参数，也不在内部调用 `TSYNC(...)`；如有需要请显式同步。
+- `idx` 在两个重载中都是必需的输入操作数；它提供与 `src` 一起参与重排的索引。
+- **实现检查 (A2A3/A5)**:
+    - `DstTileData::DType` 必须是 `half` 或 `float`。
+    - `SrcTileData::DType` 必须与 `DstTileData::DType` 匹配。
+    - `IdxTileData::DType` 必须是 `uint32_t`。
+    - `dst`/`src`/`idx` Tile 位置必须是 `TileType::Vec`，且都必须是行主序（`isRowMajor`）。
+- **有效区域**:
+    - 实现使用 `dst.GetValidRow()` 作为行数。
+    - 实现使用 `src.GetValidCol()` 确定每行参与排序的元素数量。
+    - 排序按独立的 32 元素块进行；4 参数重载额外通过 `tmp` 支持非 32 对齐尾块。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 1, 32>;
+  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
+  using DstT = Tile<TileType::Vec, float, 1, 64>;
+  SrcT src;
+  IdxT idx;
+  DstT dst;
+  TSORT32(dst, src, idx);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 1, 32>;
+  using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
+  using DstT = Tile<TileType::Vec, float, 1, 64>;
+  SrcT src;
+  IdxT idx;
+  DstT dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(idx, 0x2000);
+  TASSIGN(dst, 0x3000);
+  TSORT32(dst, src, idx);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+# pto.tassign %arg2, @tile(0x3000)
+%dst = pto.tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tsort32 %src, %idx : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TSQRT.md b/designs/outerCube/PTOISA/TSQRT.md
new file mode 100644
index 00000000..881fb3ec
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSQRT.md
@@ -0,0 +1,130 @@
+﻿# TSQRT
+
+
+## Tile Operation Diagram
+
+![TSQRT tile operation](../figures/isa/TSQRT.svg)
+
+## Introduction
+
+Elementwise square root.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \sqrt{\mathrm{src}_{i,j}} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsqrt %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Behavior is target-defined (e.g., for negative inputs).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TSQRT(dst, src);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TSQRT(dst, src);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsqrt %src : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TSQRT_zh.md b/designs/outerCube/PTOISA/TSQRT_zh.md
new file mode 100644
index 00000000..555c0308
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSQRT_zh.md
@@ -0,0 +1,103 @@
+# TSQRT
+
+## 指令示意图
+
+![TSQRT tile operation](../figures/isa/TSQRT.svg)
+
+## 简介
+
+逐元素平方根。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \sqrt{\mathrm{src}_{i,j}} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tsqrt %src : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsqrt %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsqrt ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSQRT(TileDataDst &dst, TileDataSrc &src, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (NPU)**:
+    - `TileData::DType` must be one of: `float` or `half`;
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`);
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`;
+    - Runtime: `src.GetValidRow() == dst.GetValidRow()` and `src.GetValidCol() == dst.GetValidCol()`;
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+- **Domain / NaN**:
+    - Behavior is target-defined (e.g., for negative inputs).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TSQRT(dst, src);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src, dst;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TSQRT(dst, src);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSTORE.md b/designs/outerCube/PTOISA/TSTORE.md
new file mode 100644
index 00000000..e6e017c7
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSTORE.md
@@ -0,0 +1,158 @@
+﻿# TSTORE
+
+
+## Tile Operation Diagram
+
+![TSTORE tile operation](../figures/isa/TSTORE.svg)
+
+## Introduction
+
+Store data from a Tile into a GlobalTensor (GM), optionally using atomic write or quantization parameters.
+
+## Math Interpretation
+
+Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
+
+$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{src}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+tstore %t1, %sv_out[%c0, %c0]
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
+          typename... WaitEvents>
+PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, WaitEvents &... events);
+
+template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
+          typename... WaitEvents>
+PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, uint64_t preQuantScalar, WaitEvents &... events);
+
+template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
+          typename... WaitEvents>
+PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
+```
+
+The `preQuantScalar` and `TSTORE_FP` quantized-store overloads are only legal for `TileType::Acc` on current A2/A3 and A5 backends. They do not provide a native vec-tile quantized store contract.
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+  - Source tile location must be one of: `TileType::Vec`, `TileType::Mat`, `TileType::Acc`.
+  - Runtime: all `dst.GetShape(dim)` values and `src.GetValidRow()/GetValidCol()` must be `> 0`.
+  - For `TileType::Vec` / `TileType::Mat`:
+    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
+    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
+    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
+    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
+    - A2/A3 does not expose a native vec quantized-store path. Frontends that need `vec -> GM` dtype conversion or quantization MUST first materialize the converted vec tile (for example via `TCVT`) and then issue a same-dtype `TSTORE`.
+  - For `TileType::Acc` (including quantized/atomic variants):
+    - Destination layout must be ND or NZ.
+    - Source dtype must be `int32_t` or `float`.
+    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
+    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
+    - Runtime: `1 <= src.GetValidCol() <= 4095`.
+- **Implementation checks (A5)**:
+  - Source tile location must be `TileType::Vec` or `TileType::Acc` (no `Mat` store on this target).
+  - For `TileType::Vec`:
+    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
+    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`, `float8_e4m3_t`, `float8_e5m2_t`, `hifloat8_t`, `float4_e1m2x2_t`, `float4_e2m1x2_t`.
+    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
+    - Additional alignment constraints are enforced (e.g., for ND the row-major width in bytes must be a multiple of 32; for DN the column-major height in bytes must be a multiple of 32, with special-case exceptions).
+  - For `TileType::Acc`:
+    - Destination layout must be ND or NZ; source dtype must be `int32_t` or `float`.
+    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
+    - Static shape constraints match A2A3 for rows/cols; `AtomicAdd` additionally restricts destination dtype to supported atomic types.
+- **Valid region**:
+  - The implementation uses `src.GetValidRow()` / `src.GetValidCol()` as the transfer size.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_auto(__gm__ T* out) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gout(out);
+  TileT t;
+  TSTORE(gout, t);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_manual(__gm__ T* out) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gout(out);
+  TileT t;
+  TASSIGN(t, 0x1000);
+  TSTORE<TileT, GTensor, AtomicType::AtomicAdd>(gout, t);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### PTO Assembly Form
+
+```text
+tstore %t1, %sv_out[%c0, %c0]
+# IR Level 2 (DPS)
+pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
diff --git a/designs/outerCube/PTOISA/TSTORE_FP.md b/designs/outerCube/PTOISA/TSTORE_FP.md
new file mode 100644
index 00000000..942e1abe
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSTORE_FP.md
@@ -0,0 +1,147 @@
+﻿# TSTORE_FP
+
+
+## Tile Operation Diagram
+
+![TSTORE_FP tile operation](../figures/isa/TSTORE_FP.svg)
+
+## Introduction
+
+Store an accumulator tile into global memory using a scaling (`fp`) tile for vector quantization parameters.
+
+`TSTORE_FP` is the fp-quantization overload of `TSTORE` (see `docs/isa/TSTORE.md`).
+
+## Math Interpretation
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. Conceptually (2D view, with a base offset), for `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+tstore.fp %src, %fp, %sv_out[%c0, %c0]
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
+          ReluPreMode reluPreMode = ReluPreMode::NoRelu, typename... WaitEvents>
+PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - The fp store path is implemented via `TSTORE_IMPL(dst, src, fp)` and uses the same accumulator-to-GM legality checks as quantized accumulator stores:
+    - Destination layout must be ND or NZ.
+    - Source dtype must be `int32_t` or `float`.
+    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
+    - Runtime: `1 <= src.GetValidCol() <= 4095`.
+    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
+- **Implementation checks (A5)**:
+    - Implemented via `TSTORE_IMPL(dst, src, fp)` and validated by `CheckStaticAcc<..., true>()` for the accumulator path (ND/NZ only, `int32_t/float` source dtype, rows/cols ranges).
+    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto(__gm__ int8_t* out) {
+  using AccT = TileAcc<float, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
+  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
+
+  GT gout(out);
+  AccT acc;
+  FpT fp(16);
+  TSTORE_FP(gout, acc, fp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual(__gm__ int8_t* out) {
+  using AccT = TileAcc<float, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
+  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
+
+  GT gout(out);
+  AccT acc;
+  FpT fp(16);
+  TASSIGN(acc, 0x1000);
+  TASSIGN(fp,  0x2000);
+  TSTORE_FP(gout, acc, fp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### PTO Assembly Form
+
+```text
+tstore.fp %src, %fp, %sv_out[%c0, %c0]
+# AS Level 2 (DPS)
+pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
diff --git a/designs/outerCube/PTOISA/TSTORE_FP_zh.md b/designs/outerCube/PTOISA/TSTORE_FP_zh.md
new file mode 100644
index 00000000..53e9a64e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSTORE_FP_zh.md
@@ -0,0 +1,118 @@
+# TSTORE_FP
+
+## 指令示意图
+
+![TSTORE_FP tile operation](../figures/isa/TSTORE_FP.svg)
+
+## 简介
+
+使用缩放 (`fp`) Tile 作为向量量化参数，将累加器 Tile 存储到全局内存。
+
+## 数学语义
+
+Let `R = src.GetValidRow()` and `C = src.GetValidCol()`. Conceptually (2D view, with a base offset), for `0 <= i < R` and `0 <= j < C`:
+
+$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+tstore.fp %src, %fp, %sv_out[%c0, %c0]
+```
+
+### AS Level 1 (SSA)
+
+```text
+pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tstore.fp %src, %fp, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tstore.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
+          ReluPreMode reluPreMode = ReluPreMode::NoRelu, typename... WaitEvents>
+PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - The fp store path is implemented via `TSTORE_IMPL(dst, src, fp)` and uses the same accumulator-to-GM legality checks as quantized accumulator stores:
+    - Destination layout must be ND or NZ.
+    - Source dtype must be `int32_t` or `float`.
+    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
+    - Runtime: `1 <= src.GetValidCol() <= 4095`.
+    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
+- **实现检查 (A5)**:
+    - Implemented via `TSTORE_IMPL(dst, src, fp)` and validated by `CheckStaticAcc<..., true>()` for the accumulator path (ND/NZ only, `int32_t/float` source dtype, rows/cols ranges).
+    - No explicit `static_assert` is enforced on `FpTileData` (the implementation uses `fp` to set FPC state).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto(__gm__ int8_t* out) {
+  using AccT = TileAcc<float, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
+  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
+
+  GT gout(out);
+  AccT acc;
+  FpT fp(16);
+  TSTORE_FP(gout, acc, fp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual(__gm__ int8_t* out) {
+  using AccT = TileAcc<float, 16, 16>;
+  using FpT = Tile<TileType::Scaling, uint64_t, 1, 16, BLayout::RowMajor, 1, DYNAMIC, SLayout::NoneBox>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<int8_t, 16, 16, Layout::ND>;
+  using GT = GlobalTensor<int8_t, GShape, GStride, Layout::ND>;
+
+  GT gout(out);
+  AccT acc;
+  FpT fp(16);
+  TASSIGN(acc, 0x1000);
+  TASSIGN(fp,  0x2000);
+  TSTORE_FP(gout, acc, fp);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSTORE_zh.md b/designs/outerCube/PTOISA/TSTORE_zh.md
new file mode 100644
index 00000000..1ae44856
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSTORE_zh.md
@@ -0,0 +1,131 @@
+# TSTORE
+
+## 指令示意图
+
+![TSTORE tile operation](../figures/isa/TSTORE.svg)
+
+## 简介
+
+将 Tile 中的数据存储到 GlobalTensor (GM)，可选使用原子写入或量化参数。
+
+## 数学语义
+
+Notation depends on the `GlobalTensor` shape/stride and the `Tile` layout. Conceptually (2D view, with a base offset):
+
+$$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{src}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+tstore %t1, %sv_out[%c0, %c0]
+```
+
+### AS Level 1（SSA）
+
+```text
+pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp` and `include/pto/common/constants.hpp`:
+
+```cpp
+template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
+          typename... WaitEvents>
+PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, WaitEvents &... events);
+
+template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
+          typename... WaitEvents>
+PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, uint64_t preQuantScalar, WaitEvents &... events);
+
+template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
+          typename... WaitEvents>
+PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, WaitEvents &... events);
+```
+
+The `preQuantScalar` and `TSTORE_FP` quantized-store overloads are only legal for `TileType::Acc` on current A2/A3 and A5 backends. They do not provide a native vec-tile quantized store contract.
+
+## 约束
+
+- **实现检查 (A2A3)**:
+  - Source tile location must be one of: `TileType::Vec`, `TileType::Mat`, `TileType::Acc`.
+  - Runtime: all `dst.GetShape(dim)` values and `src.GetValidRow()/GetValidCol()` must be `> 0`.
+  - For `TileType::Vec` / `TileType::Mat`:
+    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`.
+    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
+    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
+    - For `int64_t/uint64_t`, only ND->ND or DN->DN are supported.
+    - A2/A3 does not expose a native vec quantized-store path. Frontends that need `vec -> GM` dtype conversion or quantization MUST first materialize the converted vec tile (for example via `TCVT`) and then issue a same-dtype `TSTORE`.
+  - For `TileType::Acc` (including quantized/atomic variants):
+    - Destination layout must be ND or NZ.
+    - Source dtype must be `int32_t` or `float`.
+    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
+    - Static shape constraints: `1 <= TileData::Cols <= 4095`; if ND then `1 <= TileData::Rows <= 8192`; if NZ then `1 <= TileData::Rows <= 65535` and `TileData::Cols % 16 == 0`.
+    - Runtime: `1 <= src.GetValidCol() <= 4095`.
+- **实现检查 (A5)**:
+  - Source tile location must be `TileType::Vec` or `TileType::Acc` (no `Mat` store on this target).
+  - For `TileType::Vec`:
+    - `sizeof(TileData::DType) == sizeof(GlobalData::DType)`.
+    - `TileData::DType` must be one of: `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `half`, `bfloat16_t`, `float`, `float8_e4m3_t`, `float8_e5m2_t`, `hifloat8_t`, `float4_e1m2x2_t`, `float4_e2m1x2_t`.
+    - Layouts must match ND/DN/NZ (or a special case where `TileData::Rows == 1` or `TileData::Cols == 1`).
+    - Additional alignment constraints are enforced (e.g., for ND the row-major width in bytes must be a multiple of 32; for DN the column-major height in bytes must be a multiple of 32, with special-case exceptions).
+  - For `TileType::Acc`:
+    - Destination layout must be ND or NZ; source dtype must be `int32_t` or `float`.
+    - When not using quantization, destination dtype must be `__gm__ int32_t/float/half/bfloat16_t`.
+    - Static shape constraints match A2A3 for rows/cols; `AtomicAdd` additionally restricts destination dtype to supported atomic types.
+- **有效区域**:
+  - The implementation uses `src.GetValidRow()` / `src.GetValidCol()` as the transfer size.
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_auto(__gm__ T* out) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gout(out);
+  TileT t;
+  TSTORE(gout, t);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_manual(__gm__ T* out) {
+  using TileT = Tile<TileType::Vec, T, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+  GTensor gout(out);
+  TileT t;
+  TASSIGN(t, 0x1000);
+  TSTORE<TileT, GTensor, AtomicType::AtomicAdd>(gout, t);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSUB.md b/designs/outerCube/PTOISA/TSUB.md
new file mode 100644
index 00000000..5f20459b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUB.md
@@ -0,0 +1,135 @@
+﻿# TSUB
+
+
+## Tile Operation Diagram
+
+![TSUB tile operation](../figures/isa/TSUB.svg)
+
+## Introduction
+
+Elementwise subtract of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsub %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TSUB(dst, src0, src1);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TSUB(dst, src0, src1);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsub %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TSUBC.md b/designs/outerCube/PTOISA/TSUBC.md
new file mode 100644
index 00000000..aa0ef697
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBC.md
@@ -0,0 +1,103 @@
+﻿# TSUBC
+
+
+## Tile Operation Diagram
+
+![TSUBC tile operation](../figures/isa/TSUBC.svg)
+
+## Introduction
+
+Elementwise ternary op: `src0 - src1 + src2`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsubc %src0, %src1, %src2 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSUBC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
+```
+
+## Constraints
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c, out;
+  TSUBC(out, a, b, c);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsubc %src0, %src1, %src2 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TSUBC_zh.md b/designs/outerCube/PTOISA/TSUBC_zh.md
new file mode 100644
index 00000000..100e1832
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBC_zh.md
@@ -0,0 +1,76 @@
+# TSUBC
+
+## 指令示意图
+
+![TSUBC tile operation](../figures/isa/TSUBC.svg)
+
+## 简介
+
+三元逐元素运算：`src0 - src1 + src2`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} + \mathrm{src2}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tsubc %src0, %src1, %src2 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsubc %src0, %src1, %src2 : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsubc ins(%src0, %src1, %src2 : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSUBC(TileData &dst, TileData &src0, TileData &src1, TileData &src2, WaitEvents &... events);
+```
+
+## 约束
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c, out;
+  TSUBC(out, a, b, c);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSUBS.md b/designs/outerCube/PTOISA/TSUBS.md
new file mode 100644
index 00000000..4ce309b9
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBS.md
@@ -0,0 +1,105 @@
+﻿# TSUBS
+
+
+## Tile Operation Diagram
+
+![TSUBS tile operation](../figures/isa/TSUBS.svg)
+
+## Introduction
+
+Elementwise subtract a scalar from a tile.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} - \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsubs %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSUBS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `int`, `int16_t`, `half`, `float16_t`, `float`, `float32_t`.
+    - Tile location must be vector (`TileDataDst::Loc == TileType::Vec` and `TileDataSrc::Loc == TileType::Vec`).
+    - Static valid bounds: `TileDataDst::ValidRow <= TileDataDst::Rows`, `TileDataDst::ValidCol <= TileDataDst::Cols`, `TileDataSrc::ValidRow <= TileDataSrc::Rows`, and `TileDataSrc::ValidCol <= TileDataSrc::Cols`.
+    - Runtime: `src0.GetValidRow() == dst.GetValidRow()` and `src0.GetValidCol() == dst.GetValidCol()`.
+- **Common constraints**:
+    - `dst` and `src0` must use the same element type.
+    - Scalar type must match `TileDataSrc::DType`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TSUBS(out, x, 1.0f);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsubs %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSUBSC.md b/designs/outerCube/PTOISA/TSUBSC.md
new file mode 100644
index 00000000..c04cf390
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBSC.md
@@ -0,0 +1,116 @@
+﻿# TSUBSC
+
+
+## Tile Operation Diagram
+
+![TSUBSC tile operation](../figures/isa/TSUBSC.svg)
+
+## Introduction
+
+Elementwise fused op: `src0 - scalar + src1`.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{scalar} + \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = tsubsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSUBSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
+                            WaitEvents&... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Implementation checks (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile layout must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
+    - Scalar type must match the Tile data type.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, out;
+  TSUBSC(out, a, 2.0f, b);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = tsubsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
+# AS Level 2 (DPS)
+pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TSUBSC_zh.md b/designs/outerCube/PTOISA/TSUBSC_zh.md
new file mode 100644
index 00000000..3b34e713
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBSC_zh.md
@@ -0,0 +1,89 @@
+# TSUBSC
+
+## 指令示意图
+
+![TSUBSC tile operation](../figures/isa/TSUBSC.svg)
+
+## 简介
+
+融合逐元素运算：`src0 - scalar + src1`。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{scalar} + \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tsubsc %src0, %scalar, %src1 : !pto.tile<...>, f32, !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsubsc %src0, %scalar, %src1 : (!pto.tile<...>, dtype, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsubsc ins(%src0, %scalar, %src1 : !pto.tile_buf<...>, dtype, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSUBSC(TileData& dst, TileData& src0, typename TileData::DType scalar, TileData& src1,
+                            WaitEvents&... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+- **Common constraints**:
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `dst`, `src0` and `src1` must have the same valid row/col.
+    - Scalar type must match the Tile data type.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, out;
+  TSUBSC(out, a, 2.0f, b);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSUBS_zh.md b/designs/outerCube/PTOISA/TSUBS_zh.md
new file mode 100644
index 00000000..5441db2b
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBS_zh.md
@@ -0,0 +1,105 @@
+﻿# TSUBS
+
+## 指令示意图
+
+![TSUBS tile operation](../figures/isa/TSUBS.svg)
+
+## 简介
+
+从 Tile 中逐元素减去一个标量。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} - \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = tsubs %src, %scalar : !pto.tile<...>, f32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSUBS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
+    - Tile 位置必须是向量（`TileData::Loc == TileType::Vec`）。
+    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
+- **实现检查 (A5)**:
+    - `TileData::DType` 必须是以下之一：`int32_t`、`int`、`int16_t`、`half`、`float16_t`、`float`、`float32_t`。
+    - Tile 位置必须是向量（`TileDataDst::Loc == TileType::Vec` 且 `TileDataSrc::Loc == TileType::Vec`）。
+    - 静态有效边界：`TileDataDst::ValidRow <= TileDataDst::Rows`、`TileDataDst::ValidCol <= TileDataDst::Cols`、`TileDataSrc::ValidRow <= TileDataSrc::Rows`，且 `TileDataSrc::ValidCol <= TileDataSrc::Cols`。
+    - 运行时：`src0.GetValidRow() == dst.GetValidRow()` 且 `src0.GetValidCol() == dst.GetValidCol()`。
+- **通用约束**:
+    - `dst` 和 `src0` 必须使用相同的元素类型。
+    - 标量类型必须与 `TileDataSrc::DType` 一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT x, out;
+  TSUBS(out, x, 1.0f);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.tsubs %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = tsubs %src, %scalar : !pto.tile<...>, f32
+# AS Level 2 (DPS)
+pto.tsubs ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TSUBVIEW.md b/designs/outerCube/PTOISA/TSUBVIEW.md
new file mode 100644
index 00000000..c423eb63
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBVIEW.md
@@ -0,0 +1,89 @@
+﻿# TSUBVIEW
+
+
+## Tile Operation Diagram
+
+![TSUBVIEW tile operation](../figures/isa/TSUBVIEW.svg)
+
+## Introduction
+
+Reinterpret a tile as a subtile of another tile.
+
+## Math Interpretation
+
+- `rowIdx`: in the valid region of `src`, the starting row index of the `dst` subtile.
+- `colIdx`: in the valid region of `src`, the starting column index of the `dst` subtile.
+
+For each element `(i, j)` in the valid region of `dst`:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{rowIdx} + i,\mathrm{colIdx} + j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+
+### IR Level 1 (SSA)
+TODO
+
+### IR Level 2 (DPS)
+TODO
+
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSUBVIEW(TileDataDst &dst, TileDataSrc &src, uint16_t rowIdx, uint16_t colIdx, WaitEvents&... events);
+```
+
+## Constraints
+
+Enforced by `TSUBVIEW_IMPL`:
+
+- **Tile type must match**: `TileDataSrc::Loc == TileDataDst::Loc`.
+- **Both tiles must have the same static capacity**: `TileDataSrc::Rows == TileDataDst::Rows` and `TileDataSrc::Cols == TileDataDst::Cols`.
+- **Both tiles must have the same BLayout**: `TileDataSrc::BFractal == TileDataDst::BFractal`.
+- **The source tile's validRow (validCol) is at least as big as the destination tile's validRow (validCol)**
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using Src = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 4, 64>;
+  using Dst = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 2, 32>;
+
+  Src src;
+  Dst dst0;
+  Dst dst1;
+  Dst dst2;
+  Dst dst3;
+
+  // e.g. split into four 2x32 subtiles
+  TSUBVIEW(dst0, src, 0, 0);
+  TSUBVIEW(dst1, src, 0, 32);
+  TSUBVIEW(dst2, src, 2, 0);
+  TSUBVIEW(dst3, src, 2, 32);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+TODO
+
+### Manual Mode
+
+TODO
+
+### PTO Assembly Form
+
+TODO
+
diff --git a/designs/outerCube/PTOISA/TSUBVIEW_zh.md b/designs/outerCube/PTOISA/TSUBVIEW_zh.md
new file mode 100644
index 00000000..42f56e89
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUBVIEW_zh.md
@@ -0,0 +1,87 @@
+﻿# TSUBVIEW
+
+## Tile操作图例
+
+![TSUBVIEW tile operation](../figures/isa/TSUBVIEW.svg)
+
+## 简介
+
+表达一个Tile是另一个Tile的subview。
+
+## 数学表达
+
+- `rowIdx`: 在`src`的有效区域内的起始行的索引。
+- `colIdx`: 在`src`的有效区域内的起始列的索引。
+
+对于`dst`中有效区域内的每一个元素`(i, j)`：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{\mathrm{rowIdx} + i,\mathrm{colIdx} + j} $$
+
+## 汇编语法
+
+PTO-AS form: 详见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### IR Level 1 (SSA)
+
+TODO
+
+### IR Level 2 (DPS)
+
+TODO
+
+## C++ Intrinsic
+
+定义在 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents>
+PTO_INST RecordEvent TSUBVIEW(TileDataDst &dst, TileDataSrc &src, uint16_t rowIdx, uint16_t colIdx, WaitEvents&... events);
+```
+
+## 限制
+
+规定在`TSUBVIEW_IMPL`中:
+
+- **Tile类型必须相同**: `TileDataSrc::Loc == TileDataDst::Loc`.
+- **输入和输出Tile的静态shape必须相同**: `TileDataSrc::Rows == TileDataDst::Rows` and `TileDataSrc::Cols == TileDataDst::Cols`.
+- **输入和输出Tile的BLayout必须相同**: `TileDataSrc::BFractal == TileDataDst::BFractal`.
+- **src的validRow和validCol必须大于等于dst的validRow和validCol**
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using Src = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 4, 64>;
+  using Dst = Tile<TileType::Vec, float, 4, 64, BLayout::RowMajor, 2, 32>;
+
+  Src src;
+  Dst dst0;
+  Dst dst1;
+  Dst dst2;
+  Dst dst3;
+
+  // e.g. split into four 2x32 subtiles
+  TSUBVIEW(dst0, src, 0, 0);
+  TSUBVIEW(dst1, src, 0, 32);
+  TSUBVIEW(dst2, src, 2, 0);
+  TSUBVIEW(dst3, src, 2, 32);
+}
+```
+
+## ASM示例
+
+### Auto模式
+
+TODO
+
+### Manual模式
+
+TODO
+
+### PTO汇编格式
+
+TODO
diff --git a/designs/outerCube/PTOISA/TSUB_zh.md b/designs/outerCube/PTOISA/TSUB_zh.md
new file mode 100644
index 00000000..4a69783c
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSUB_zh.md
@@ -0,0 +1,108 @@
+# TSUB
+
+## 指令示意图
+
+![TSUB tile operation](../figures/isa/TSUB.svg)
+
+## 简介
+
+两个 Tile 的逐元素减法。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} - \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+同步形式：
+
+```text
+%dst = tsub %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.tsub %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.tsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
+PTO_INST RecordEvent TSUB(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `TileData::DType` must be one of: `int32_t`, `int16_t`, `half`, `float`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **实现检查 (A5)**:
+    - `TileData::DType` must be one of: `uint32_t`, `int32_t`, `uint16_t`, `int16_t`, `uint8_t`,  `int8_t`, `float`, `half`.
+    - Tile 布局 must be row-major (`TileData::isRowMajor`).
+    - Tile location must be vector (`TileData::Loc == TileType::Vec`).
+    - Static valid bounds: `TileData::ValidRow <= TileData::Rows` and `TileData::ValidCol <= TileData::Cols`.
+    - Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
+- **有效区域**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain; `src0/src1` are assumed to be compatible (not validated by explicit runtime checks in this op).
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TSUB(dst, src0, src1);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT src0, src1, dst;
+  TASSIGN(src0, 0x1000);
+  TASSIGN(src1, 0x2000);
+  TASSIGN(dst,  0x3000);
+  TSUB(dst, src0, src1);
+}
+```
diff --git a/designs/outerCube/PTOISA/TSYNC.md b/designs/outerCube/PTOISA/TSYNC.md
new file mode 100644
index 00000000..7764c600
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSYNC.md
@@ -0,0 +1,143 @@
+﻿# TSYNC
+
+
+## Tile Operation Diagram
+
+![TSYNC tile operation](../figures/isa/TSYNC.svg)
+
+## Introduction
+
+Synchronize PTO execution:
+
+- `TSYNC(events...)` waits on a set of explicit event tokens.
+- `TSYNC<Op>()` inserts a pipeline barrier for a single vector op class.
+
+Many intrinsics in `include/pto/common/pto_instr.hpp` call `TSYNC(events...)` internally before issuing the instruction.
+
+## Math Interpretation
+
+Not applicable.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Event operand form:
+
+```text
+tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
+```
+
+Single-op barrier form:
+
+```text
+tsync.op #pto.op<TADD>
+```
+
+### AS Level 1 (SSA)
+
+```text
+// Level 1 (SSA) does not support explicit synchronization primitives.
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.record_event[src_op, dst_op, eventID]
+// 支持的op：TLOAD，TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
+pto.wait_event[src_op, dst_op, eventID]
+// 支持的op：TLOAD，TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
+pto.barrier(op)
+// 支持的op：TVEC,TMATMUL
+```
+
+In the current PTO-DSL front-end flow, `record_event` and `wait_event` should
+be treated as low-level TSYNC forms. Front-end kernels SHOULD normally stay free
+of explicit event wiring and rely on `ptoas --enable-insert-sync`.
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <Op OpCode>
+PTO_INST void TSYNC();
+
+template <typename... WaitEvents>
+PTO_INST void TSYNC(WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (`TSYNC<Op>()`)**:
+    - `TSYNC_IMPL<Op>()` only supports vector-pipeline ops (`static_assert(pipe == PIPE_V)` in `include/pto/common/event.hpp`).
+- **`TSYNC(events...)` semantics**:
+    - `TSYNC(events...)` calls `WaitAllEvents(events...)`, which invokes `events.Wait()` on each event token. In auto mode, this is no-op.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto(__gm__ float* in) {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<float, 16, 16, Layout::ND>;
+  using GT = GlobalTensor<float, GShape, GStride, Layout::ND>;
+
+  GT gin(in);
+  TileT t;
+  Event<Op::TLOAD, Op::TADD> e;
+  e = TLOAD(t, gin);
+  TSYNC(e);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c;
+  Event<Op::TADD, Op::TSTORE_VEC> e;
+  e = TADD(c, a, b);
+  TSYNC<Op::TADD>();
+  TSYNC(e);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%result = pto.tsync ...
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%result = pto.tsync ...
+```
+
+### PTO Assembly Form
+
+```text
+tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
+# AS Level 2 (DPS)
+pto.record_event[src_op, dst_op, eventID]
+```
+
diff --git a/designs/outerCube/PTOISA/TSYNC_zh.md b/designs/outerCube/PTOISA/TSYNC_zh.md
new file mode 100644
index 00000000..2d686d55
--- /dev/null
+++ b/designs/outerCube/PTOISA/TSYNC_zh.md
@@ -0,0 +1,141 @@
+﻿# TSYNC
+
+## 指令示意图
+
+![TSYNC tile operation](../figures/isa/TSYNC.svg)
+
+## 简介
+
+同步 PTO 执行（等待事件或插入每操作流水线屏障）。
+
+- `TSYNC(events...)` 等待一组显式事件令牌。
+- `TSYNC<Op>()` 为单个向量操作类插入流水线屏障。
+
+`include/pto/common/pto_instr.hpp` 中的许多内建函数在发射指令前会在内部调用 `TSYNC(events...)`。
+
+## 数学语义
+
+不适用。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+Event operand form:
+
+```text
+tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
+```
+
+Single-op barrier form:
+
+```text
+tsync.op #pto.op<TADD>
+```
+
+### AS Level 1（SSA）
+
+```text
+// Level 1 (SSA) does not support explicit synchronization primitives.
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.record_event[src_op, dst_op, eventID]
+// 支持的op：TLOAD， TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
+pto.wait_event[src_op, dst_op, eventID]
+// 支持的op：TLOAD， TSTORE_ACC，TSTORE_VEC，TMOV_M2L，TMOV_M2S，TMOV_M2B，TMOV_M2V，TMOV_V2M，TMATMUL，TVEC
+pto.barrier(op)
+// 支持的op：TVEC,TMATMUL
+```
+
+在当前 PTO-DSL 前端流程中，`record_event` 和 `wait_event` 应视为 TSYNC 的低层形式。
+前端 kernel 通常不应手工编写事件连线，而应依赖 `ptoas --enable-insert-sync`
+自动插入同步。
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <Op OpCode>
+PTO_INST void TSYNC();
+
+template <typename... WaitEvents>
+PTO_INST void TSYNC(WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查（`TSYNC<Op>()`）**:
+  - `TSYNC_IMPL<Op>()` 仅支持向量流水线操作（`include/pto/common/event.hpp` 中通过 `static_assert(pipe == PIPE_V)` 强制执行）。
+- **`TSYNC(events...)` 语义**:
+  - `TSYNC(events...)` 调用 `WaitAllEvents(events...)`，后者对每个事件令牌调用 `events.Wait()`。在auto模式下是no-op。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto(__gm__ float* in) {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  using GShape = Shape<1, 1, 1, 16, 16>;
+  using GStride = BaseShape2D<float, 16, 16, Layout::ND>;
+  using GT = GlobalTensor<float, GShape, GStride, Layout::ND>;
+
+  GT gin(in);
+  TileT t;
+  Event<Op::TLOAD, Op::TADD> e;
+  e = TLOAD(t, gin);
+  TSYNC(e);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using TileT = Tile<TileType::Vec, float, 16, 16>;
+  TileT a, b, c;
+  Event<Op::TADD, Op::TSTORE_VEC> e;
+  e = TADD(c, a, b);
+  TSYNC<Op::TADD>();
+  TSYNC(e);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%result = pto.tsync ...
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%result = pto.tsync ...
+```
+
+### PTO 汇编形式
+
+```text
+tsync %e0, %e1 : !pto.event<...>, !pto.event<...>
+# AS Level 2 (DPS)
+pto.record_event[src_op, dst_op, eventID]
+```
diff --git a/designs/outerCube/PTOISA/TTRANS.md b/designs/outerCube/PTOISA/TTRANS.md
new file mode 100644
index 00000000..f2738063
--- /dev/null
+++ b/designs/outerCube/PTOISA/TTRANS.md
@@ -0,0 +1,144 @@
+﻿# TTRANS
+
+
+## Tile Operation Diagram
+
+![TTRANS tile operation](../figures/isa/TTRANS.svg)
+
+## Introduction
+
+Transpose with an implementation-defined temporary tile.
+
+## Math Interpretation
+
+For a 2D tile, over the effective transpose domain:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{j,i} $$
+
+Exact shape/layout and the transpose domain depend on the target (see Constraints).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit `tmp` operand.
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TTRANS(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`.
+    - Source layout must be row-major (`TileDataSrc::isRowMajor`).
+    - Element size must be `1`, `2`, or `4` bytes.
+    - Supported element types are restricted per element width:
+    - 4 bytes: `uint32_t`, `int32_t`, `float`
+    - 2 bytes: `uint16_t`, `int16_t`, `half`, `bfloat16_t`
+    - 1 byte: `uint8_t`, `int8_t`
+    - The transpose size is taken from `src.GetValidRow()` / `src.GetValidCol()`.
+- **Implementation checks (A5)**:
+    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`.
+    - 32-byte alignment constraints are enforced on the major dimension of both input and output (row-major checks `Cols * sizeof(T) % 32 == 0`, col-major checks `Rows * sizeof(T) % 32 == 0`).
+    - Supported element types are restricted per element width:
+    - 4 bytes: `uint32_t`, `int32_t`, `float`
+    - 2 bytes: `uint16_t`, `int16_t`, `half`, `bfloat16_t`
+    - 1 byte: `uint8_t`, `int8_t`
+    - The implementation operates over the static tile shape (`TileDataSrc::Rows/Cols`) and does not consult `GetValidRow/GetValidCol`.
+- **Temporary tile**:
+    - The C++ API requires `tmp`, but some implementations may not use it.
+- **ConvTile**:
+    - Transpose of ConvTile for `TileType::Vec` is supported。 Element size must be `1`、`2` or `4` bytes. Supported element types are `uint32_t`、`int32_t`、`float`、`uint16_t`、`int16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`.
+    - Format transformation from `NCHW` to `NC1HWC0` is supported, while `C1 == (C + C0 - 1)/C0`，HW matches alignment constraint，which means `H*W*sizeof(T)==0`. C0 means `c0_size`, which `C0 * sizeof(T) == 32`。C0 can also be 4.
+    - Format transformation from `NC1HWC0` to `FRACTAL_Z` is supported， while `N1 == (N + N0 - 1)/N0`。N0 should be 16.
+
+## Examples
+
+### Auto
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TTRANS(dst, src, tmp);
+}
+```
+
+### Manual
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TTRANS(dst, src, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TTRANS_zh.md b/designs/outerCube/PTOISA/TTRANS_zh.md
new file mode 100644
index 00000000..370e6082
--- /dev/null
+++ b/designs/outerCube/PTOISA/TTRANS_zh.md
@@ -0,0 +1,144 @@
+﻿# TTRANS
+
+## 指令示意图
+
+![TTRANS tile operation](../figures/isa/TTRANS.svg)
+
+## 简介
+
+使用实现定义的临时 Tile 进行转置。
+
+## 数学语义
+
+对于二维 Tile，在有效转置域上：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{j,i} $$
+
+确切的形状/布局及转置域取决于目标硬件（参见约束）。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+降低时可能引入内部临时 Tile；C++ 内建接口需要显式传入 `tmp` 操作数。
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TTRANS(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`。
+    - 源布局必须是行主序（`TileDataSrc::isRowMajor`）。
+    - 元素大小必须是 `1`、`2` 或 `4` 字节。
+    - 支持的元素类型按元素宽度限制如下：
+    - 4 字节：`uint32_t`、`int32_t`、`float`
+    - 2 字节：`uint16_t`、`int16_t`、`half`、`bfloat16_t`
+    - 1 字节：`uint8_t`、`int8_t`
+    - 转置大小取自 `src.GetValidRow()` / `src.GetValidCol()`。
+- **实现检查 (A5)**:
+    - `sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType)`。
+    - 对输入和输出的主维度强制执行 32 字节对齐约束（行主序检查 `Cols * sizeof(T) % 32 == 0`，列主序检查 `Rows * sizeof(T) % 32 == 0`）。
+    - 支持的元素类型按元素宽度限制如下：
+    - 4 字节：`uint32_t`、`int32_t`、`float`
+    - 2 字节：`uint16_t`、`int16_t`、`half`、`bfloat16_t`
+    - 1 字节：`uint8_t`、`int8_t`
+    - 实现在静态 Tile 形状（`TileDataSrc::Rows/Cols`）上运算，不参考 `GetValidRow/GetValidCol`。
+- **临时 Tile**:
+    - C++ API 需要 `tmp`，但某些实现可能不使用它。
+- **ConvTile**:
+    - 支持在`TileType::Vec`上的ConvTile的格式转换。其元素大小必须是 `1`、`2` 或 `4` 字节。元素类型限制为`uint32_t`、`int32_t`、`float`、`uint16_t`、`int16_t`、`half`、`bfloat16_t`、`uint8_t`、`int8_t`。
+    - 支持ConvTile从`NCHW`到`NC1HWC0`的变换，其中`C1 == (C + C0 - 1)/C0`，HW满足对齐要求，即`H*W*sizeof(T)==0`. C0对应`c0_size`, 即`C0 * sizeof(T) == 32`。C0也可以为4。
+    - 支持ConvTile从`NC1HWC0`到`FRACTAL_Z`的变换, 其中`N1 == (N + N0 - 1)/N0`。N0为16。
+
+## 示例
+
+### 自动（Auto）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_auto() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TTRANS(dst, src, tmp);
+}
+```
+
+### 手动（Manual）
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example_manual() {
+  using SrcT = Tile<TileType::Vec, float, 16, 16>;
+  using DstT = Tile<TileType::Vec, float, 16, 16>;
+  using TmpT = Tile<TileType::Vec, float, 16, 16>;
+  SrcT src;
+  DstT dst;
+  TmpT tmp;
+  TASSIGN(src, 0x1000);
+  TASSIGN(dst, 0x2000);
+  TASSIGN(tmp, 0x3000);
+  TTRANS(dst, src, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TTRI.md b/designs/outerCube/PTOISA/TTRI.md
new file mode 100644
index 00000000..d1b01b3e
--- /dev/null
+++ b/designs/outerCube/PTOISA/TTRI.md
@@ -0,0 +1,98 @@
+﻿# TTRI
+
+
+## Tile Operation Diagram
+
+![TTRI tile operation](../figures/isa/TTRI.svg)
+
+## Introduction
+
+Generate a (lower/upper) triangular mask tile with ones and zeros. The triangular orientation is controlled by the compile-time template parameter `isUpperOrLower` (0 = lower, 1 = upper).
+
+## Math Interpretation
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `d = diagonal`.
+
+Lower-triangular (`isUpperOrLower=0`) conceptually produces:
+
+$$
+\mathrm{dst}_{i,j} = \begin{cases}1 & j \le i + d \\\\ 0 & \text{otherwise}\end{cases}
+$$
+
+Upper-triangular (`isUpperOrLower=1`) conceptually produces:
+
+$$
+\mathrm{dst}_{i,j} = \begin{cases}0 & j < i + d \\\\ 1 & \text{otherwise}\end{cases}
+$$
+
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, int isUpperOrLower, typename... WaitEvents>
+PTO_INST RecordEvent TTRI(TileData &dst, int diagonal, WaitEvents &... events);
+```
+
+## Constraints
+
+- `isUpperOrLower` must be `0` (lower) or `1` (upper).
+- Destination tile must be row-major on some targets (see `include/pto/npu/*/TTri.hpp`).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### IR Level 1 (SSA)
+
+```text
+%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### IR Level 2 (DPS)
+
+```text
+pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## Examples
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+# AS Level 2 (DPS)
+pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
diff --git a/designs/outerCube/PTOISA/TTRI_zh.md b/designs/outerCube/PTOISA/TTRI_zh.md
new file mode 100644
index 00000000..cf82b414
--- /dev/null
+++ b/designs/outerCube/PTOISA/TTRI_zh.md
@@ -0,0 +1,71 @@
+# TTRI
+
+## 指令示意图
+
+![TTRI tile operation](../figures/isa/TTRI.svg)
+
+## 简介
+
+生成三角（下/上）掩码 Tile。
+
+## 数学语义
+
+Let `R = dst.GetValidRow()` and `C = dst.GetValidCol()`. Let `d = diagonal`.
+
+Lower-triangular (`isUpperOrLower=0`) conceptually produces:
+
+$$
+\mathrm{dst}_{i,j} = \begin{cases}1 & j \le i + d \\\\ 0 & \text{otherwise}\end{cases}
+$$
+
+Upper-triangular (`isUpperOrLower=1`) conceptually produces:
+
+$$
+\mathrm{dst}_{i,j} = \begin{cases}0 & j < i + d \\\\ 1 & \text{otherwise}\end{cases}
+$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS Specification](../assembly/PTO-AS.md).
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.ttri %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileData, int isUpperOrLower, typename... WaitEvents>
+PTO_INST RecordEvent TTRI(TileData &dst, int diagonal, WaitEvents &... events);
+```
+
+## 约束
+
+- `isUpperOrLower` must be `0` (lower) or `1` (upper).
+- Destination tile must be row-major on some targets (see `include/pto/npu/*/TTri.hpp`).
+
+## 示例
+
+See related examples in `docs/isa/` and `docs/coding/tutorials/`.
diff --git a/designs/outerCube/PTOISA/TXOR.md b/designs/outerCube/PTOISA/TXOR.md
new file mode 100644
index 00000000..ab22c828
--- /dev/null
+++ b/designs/outerCube/PTOISA/TXOR.md
@@ -0,0 +1,110 @@
+﻿# TXOR
+
+
+## Tile Operation Diagram
+
+![TXOR tile operation](../figures/isa/TXOR.svg)
+
+## Introduction
+
+Elementwise bitwise XOR of two tiles.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \oplus \mathrm{src1}_{i,j} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = txor %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TXOR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- The op iterates over `dst.GetValidRow()` / `dst.GetValidCol()`.
+- **Implementation checks (A5)**:
+    - `dst`, `src0`, and `src1` element types must match.
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
+    - `dst`, `src0`, and `src1` must be row-major.
+    - `src0.GetValidRow()/GetValidCol()` and `src1.GetValidRow()/GetValidCol()` must match `dst`.
+- **Implementation checks (A2A3)**:
+    - `dst`, `src0`, `src1`, and `tmp` element types must match.
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, and `int16_t`.
+    - `dst`, `src0`, `src1`, and `tmp` must be row-major.
+    - `src0`, `src1`, and `tmp` valid shapes must match `dst`.
+    - In manual mode, `dst`, `src0`, `src1`, and `tmp` must not overlap in memory.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileSrc0 = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileSrc1 = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileDst dst;
+  TileSrc0 src0;
+  TileSrc1 src1;
+  TileTmp tmp;
+  TXOR(dst, src0, src1, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = txor %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TXORS.md b/designs/outerCube/PTOISA/TXORS.md
new file mode 100644
index 00000000..9caa91d4
--- /dev/null
+++ b/designs/outerCube/PTOISA/TXORS.md
@@ -0,0 +1,105 @@
+﻿# TXORS
+
+
+## Tile Operation Diagram
+
+![TXORS tile operation](../figures/isa/TXORS.svg)
+
+## Introduction
+
+Elementwise bitwise XOR of a tile and a scalar.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \oplus \mathrm{scalar} $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+%dst = txors %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1 (SSA)
+
+```text
+%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2 (DPS)
+
+```text
+pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+## C++ Intrinsic
+
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TXORS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## Constraints
+
+- **Implementation checks (A2A3)**:
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, and `int16_t`.
+    - `dst`, `src`, and `tmp` must use the same element type.
+    - In manual mode, source, destination, and temporary storage must not overlap in memory.
+- **Implementation checks (A5)**:
+    - Supported element types are `uint8_t`, `int8_t`, `uint16_t`, `int16_t`, `uint32_t`, and `int32_t`.
+    - `dst` and `src` element types must match.
+    - `src.GetValidRow()/GetValidCol()` must match `dst`.
+- **Valid region**:
+    - The op uses `dst.GetValidRow()` / `dst.GetValidCol()` as the iteration domain.
+
+## Examples
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TileTmp tmp;
+  TXORS(dst, src, 0x1u, tmp);
+}
+```
+
+## ASM Form Examples
+
+### Auto Mode
+
+```text
+# Auto mode: compiler/runtime-managed placement and scheduling.
+%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### Manual Mode
+
+```text
+# Manual mode: bind resources explicitly before issuing the instruction.
+# Optional for tile operands:
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO Assembly Form
+
+```text
+%dst = txors %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TXORS_zh.md b/designs/outerCube/PTOISA/TXORS_zh.md
new file mode 100644
index 00000000..ebc0ed69
--- /dev/null
+++ b/designs/outerCube/PTOISA/TXORS_zh.md
@@ -0,0 +1,105 @@
+﻿# TXORS
+
+## 指令示意图
+
+![TXORS tile operation](../figures/isa/TXORS.svg)
+
+## 简介
+
+Tile 与标量的逐元素按位异或。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \oplus \mathrm{scalar} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = txors %src, %scalar : !pto.tile<...>, i32
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
+PTO_INST RecordEvent TXORS(TileDataDst &dst, TileDataSrc &src0, typename TileDataSrc::DType scalar, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- **实现检查 (A2A3)**:
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t` 和 `int16_t`。
+    - `dst`、`src` 和 `tmp` 必须使用相同的元素类型。
+    - 在手动模式下，源、目标和临时存储的内存区域不得重叠。
+- **实现检查 (A5)**:
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+    - `dst` 和 `src` 的元素类型必须一致。
+    - `src.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **有效区域**:
+    - 该操作使用 `dst.GetValidRow()` / `dst.GetValidCol()` 作为迭代域。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileSrc = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileDst dst;
+  TileSrc src;
+  TileTmp tmp;
+  TXORS(dst, src, 0x1u, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.txors %src, %scalar : (!pto.tile<...>, dtype) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = txors %src, %scalar : !pto.tile<...>, i32
+# AS Level 2 (DPS)
+pto.txors ins(%src, %scalar : !pto.tile_buf<...>, dtype) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/TXOR_zh.md b/designs/outerCube/PTOISA/TXOR_zh.md
new file mode 100644
index 00000000..ee3126fc
--- /dev/null
+++ b/designs/outerCube/PTOISA/TXOR_zh.md
@@ -0,0 +1,110 @@
+﻿# TXOR
+
+## 指令示意图
+
+![TXOR tile operation](../figures/isa/TXOR.svg)
+
+## 简介
+
+两个 Tile 的逐元素按位异或。
+
+## 数学语义
+
+对每个元素 `(i, j)` 在有效区域内：
+
+$$ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} \oplus \mathrm{src1}_{i,j} $$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+%dst = txor %src0, %src1 : !pto.tile<...>
+```
+
+### AS Level 1（SSA）
+
+```text
+%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### AS Level 2（DPS）
+
+```text
+pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/common/pto_instr.hpp`：
+
+```cpp
+template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename TileDataTmp,
+          typename... WaitEvents>
+PTO_INST RecordEvent TXOR(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);
+```
+
+## 约束
+
+- 该操作在 `dst.GetValidRow()` / `dst.GetValidCol()` 上迭代。
+- **实现检查 (A5)**:
+    - `dst`、`src0` 和 `src1` 的元素类型必须一致。
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t`、`int16_t`、`uint32_t` 和 `int32_t`。
+    - `dst`、`src0` 和 `src1` 必须是行主序。
+    - `src0.GetValidRow()/GetValidCol()` 和 `src1.GetValidRow()/GetValidCol()` 必须与 `dst` 一致。
+- **实现检查 (A2A3)**:
+    - `dst`、`src0`、`src1` 和 `tmp` 的元素类型必须一致。
+    - 支持的元素类型为 `uint8_t`、`int8_t`、`uint16_t` 和 `int16_t`。
+    - `dst`、`src0`、`src1` 和 `tmp` 必须是行主序。
+    - `src0`、`src1` 和 `tmp` 的有效形状必须与 `dst` 一致。
+    - 在手动模式下，`dst`、`src0`、`src1` 和 `tmp` 的内存区域不得重叠。
+
+## 示例
+
+```cpp
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+void example() {
+  using TileDst = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileSrc0 = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileSrc1 = Tile<TileType::Vec, uint32_t, 16, 16>;
+  using TileTmp = Tile<TileType::Vec, uint32_t, 16, 16>;
+  TileDst dst;
+  TileSrc0 src0;
+  TileSrc1 src1;
+  TileTmp tmp;
+  TXOR(dst, src0, src1, tmp);
+}
+```
+
+## 汇编示例（ASM）
+
+### 自动模式
+
+```text
+# 自动模式：由编译器/运行时负责资源放置与调度。
+%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### 手动模式
+
+```text
+# 手动模式：先显式绑定资源，再发射指令。
+# 可选（当该指令包含 tile 操作数时）：
+# pto.tassign %arg0, @tile(0x1000)
+# pto.tassign %arg1, @tile(0x2000)
+%dst = pto.txor %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
+```
+
+### PTO 汇编形式
+
+```text
+%dst = txor %src0, %src1 : !pto.tile<...>
+# AS Level 2 (DPS)
+pto.txor ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
+```
+
diff --git a/designs/outerCube/PTOISA/comm/README.md b/designs/outerCube/PTOISA/comm/README.md
new file mode 100644
index 00000000..01bba656
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/README.md
@@ -0,0 +1,130 @@
+# PTO Communication ISA Reference
+
+This directory contains the per-instruction reference for the PTO Communication ISA.
+
+- Source of truth (C++ intrinsics): `include/pto/comm/pto_comm_inst.hpp`
+- Type definitions: `include/pto/comm/comm_types.hpp`
+
+## Point-to-Point Communication (Synchronous)
+- [**TPUT**](TPUT.md): Remote write (GM → UB → GM)
+- [**TGET**](TGET.md): Remote read (GM → UB → GM)
+
+## Point-to-Point Communication (Asynchronous)
+- [**TPUT_ASYNC**](TPUT_ASYNC.md): Asynchronous remote write (GM → DMA engine → GM)
+- [**TGET_ASYNC**](TGET_ASYNC.md): Asynchronous remote read (GM → DMA engine → GM)
+
+## Signal-Based Synchronization
+- [**TNOTIFY**](TNOTIFY.md): Send notification to remote NPU
+- [**TWAIT**](TWAIT.md): Blocking wait for signal condition
+- [**TTEST**](TTEST.md): Non-blocking test signal condition
+
+## Collective Communication
+
+- [**TGATHER**](TGATHER.md): Gather data from all ranks
+- [**TSCATTER**](TSCATTER.md): Scatter data to all ranks
+- [**TREDUCE**](TREDUCE.md): Reduce data from all ranks to local
+- [**TBROADCAST**](TBROADCAST.md): Broadcast from current NPU to all ranks
+
+## Type Definitions
+
+### NotifyOp
+
+Operation type for `TNOTIFY`:
+
+| Value | Description |
+|-------|-------------|
+| `NotifyOp::Set` | Direct set (`signal = value`) |
+| `NotifyOp::AtomicAdd` | Atomic add (`signal += value`) |
+
+### WaitCmp
+
+Comparison operators for `TWAIT` and `TTEST`:
+
+| Value | Description |
+|-------|-------------|
+| `WaitCmp::EQ` | Equal (`==`) |
+| `WaitCmp::NE` | Not equal (`!=`) |
+| `WaitCmp::GT` | Greater than (`>`) |
+| `WaitCmp::GE` | Greater or equal (`>=`) |
+| `WaitCmp::LT` | Less than (`<`) |
+| `WaitCmp::LE` | Less or equal (`<=`) |
+
+```cpp
+// Usage (unified runtime parameter style):
+comm::TNOTIFY(signal, 1, comm::NotifyOp::Set);
+comm::TWAIT(signal, 1, comm::WaitCmp::EQ);
+comm::TTEST(signal, 1, comm::WaitCmp::GE);
+```
+
+### ReduceOp
+
+Reduction operators for `TREDUCE`:
+
+| Value | Description |
+|-------|-------------|
+| `ReduceOp::Sum` | Element-wise sum |
+| `ReduceOp::Max` | Element-wise maximum |
+| `ReduceOp::Min` | Element-wise minimum |
+
+### AtomicType
+
+Atomic operation type for `TPUT` (defined in `include/pto/common/constants.hpp`):
+
+| Value | Description |
+|-------|-------------|
+| `AtomicType::AtomicNone` | No atomic operation (default) |
+| `AtomicType::AtomicAdd` | Atomic add operation |
+
+### DmaEngine
+
+DMA backend selection for `TPUT_ASYNC` and `TGET_ASYNC`:
+
+| Value | Description |
+|-------|-------------|
+| `DmaEngine::SDMA` | SDMA engine (supports 2D transfer) |
+| `DmaEngine::URMA` | URMA engine (supports 1D transfer, todo) |
+
+### AsyncEvent
+
+Returned by `TPUT_ASYNC` / `TGET_ASYNC`. Use to synchronize completion:
+
+```cpp
+struct AsyncEvent {
+    uint64_t handle;
+    DmaEngine engine;
+
+    bool valid() const;                        // true if handle != 0
+    bool Wait(const AsyncSession &session) const; // block until transfer completes
+    bool Test(const AsyncSession &session) const; // non-blocking completion check
+};
+```
+
+### AsyncSession
+
+Engine-agnostic session for async DMA operations. Build once, pass to all async calls:
+
+```cpp
+comm::AsyncSession session;
+comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, workspace, session);
+```
+
+Defined in `include/pto/comm/async/async_types.hpp`. See [TPUT_ASYNC](TPUT_ASYNC.md) for construction details and parameters.
+
+### ParallelGroup
+
+Wrapper for collective communication across multiple NPUs:
+
+```cpp
+template <typename GlobalData>
+struct ParallelGroup {
+    // Pointer to an array of `GlobalData` objects (each wraps a GM address).
+    // The array itself is local metadata; the wrapped addresses may refer to local or remote GM,
+    // depending on the collective instruction.
+    GlobalData *tensors;
+    int nranks;   // Number of ranks
+    int rootIdx;  // Root NPU's rank index
+    
+    // Factory function (recommended): build from an existing tensor array.
+    static ParallelGroup Create(GlobalData *tensorArray, int size, int rank_id);
+};
+```
diff --git a/designs/outerCube/PTOISA/comm/README_zh.md b/designs/outerCube/PTOISA/comm/README_zh.md
new file mode 100644
index 00000000..ee826e32
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/README_zh.md
@@ -0,0 +1,131 @@
+# PTO 通信 ISA 参考手册
+
+本目录包含 PTO 通信 ISA 的逐指令参考文档。
+
+- 权威来源（C++ 内建接口）：`include/pto/comm/pto_comm_inst.hpp`
+- 类型定义：`include/pto/comm/comm_types.hpp`
+
+## 点对点通信（同步）
+- [**TPUT**](TPUT_zh.md)：远程写（GM → UB → GM）
+- [**TGET**](TGET_zh.md)：远程读（GM → UB → GM）
+
+## 点对点通信（异步）
+- [**TPUT_ASYNC**](TPUT_ASYNC_zh.md)：异步远程写（GM → DMA 引擎 → GM）
+- [**TGET_ASYNC**](TGET_ASYNC_zh.md)：异步远程读（GM → DMA 引擎 → GM）
+
+## 基于信号的同步
+- [**TNOTIFY**](TNOTIFY_zh.md)：向远端 NPU 发送通知
+- [**TWAIT**](TWAIT_zh.md)：阻塞等待信号条件满足
+- [**TTEST**](TTEST_zh.md)：非阻塞检测信号条件
+
+## 集合通信
+
+- [**TGATHER**](TGATHER_zh.md)：从所有 rank 收集数据
+- [**TSCATTER**](TSCATTER_zh.md)：向所有 rank 分发数据
+- [**TREDUCE**](TREDUCE_zh.md)：从所有 rank 归约数据到本地
+- [**TBROADCAST**](TBROADCAST_zh.md)：从当前 NPU 广播数据到所有 rank
+
+## 类型定义
+
+### NotifyOp
+
+`TNOTIFY` 的操作类型：
+
+| 值 | 说明 |
+|-------|-------------|
+| `NotifyOp::Set` | 直接赋值（`signal = value`）|
+| `NotifyOp::AtomicAdd` | 原子加（`signal += value`）|
+
+### WaitCmp
+
+`TWAIT` 和 `TTEST` 的比较运算符：
+
+| 值 | 说明 |
+|-------|-------------|
+| `WaitCmp::EQ` | 等于（`==`）|
+| `WaitCmp::NE` | 不等于（`!=`）|
+| `WaitCmp::GT` | 大于（`>`）|
+| `WaitCmp::GE` | 大于等于（`>=`）|
+| `WaitCmp::LT` | 小于（`<`）|
+| `WaitCmp::LE` | 小于等于（`<=`）|
+
+```cpp
+// 用法示例（统一运行时参数风格）：
+comm::TNOTIFY(signal, 1, comm::NotifyOp::Set);
+comm::TWAIT(signal, 1, comm::WaitCmp::EQ);
+comm::TTEST(signal, 1, comm::WaitCmp::GE);
+```
+
+### ReduceOp
+
+`TREDUCE` 的归约运算符：
+
+| 值 | 说明 |
+|-------|-------------|
+| `ReduceOp::Sum` | 逐元素求和 |
+| `ReduceOp::Max` | 逐元素取最大值 |
+| `ReduceOp::Min` | 逐元素取最小值 |
+
+### AtomicType
+
+`TPUT` 的原子操作类型（定义于 `include/pto/common/constants.hpp`）：
+
+| 值 | 说明 |
+|-------|-------------|
+| `AtomicType::AtomicNone` | 无原子操作（默认）|
+| `AtomicType::AtomicAdd` | 原子加操作 |
+
+### DmaEngine
+
+`TPUT_ASYNC` 和 `TGET_ASYNC` 的 DMA 后端选择：
+
+| 值 | 说明 |
+|-------|-------------|
+| `DmaEngine::SDMA` | SDMA 引擎（支持二维传输）|
+| `DmaEngine::URMA` | URMA 引擎（支持一维传输，待实现）|
+
+### AsyncEvent
+
+由 `TPUT_ASYNC` / `TGET_ASYNC` 返回，用于同步传输完成状态：
+
+```cpp
+struct AsyncEvent {
+    uint64_t handle;
+    DmaEngine engine;
+
+    bool valid() const;                        // handle != 0 时返回 true
+    bool Wait(const AsyncSession &session) const; // 阻塞直到传输完成
+    bool Test(const AsyncSession &session) const; // 非阻塞完成检测
+};
+```
+
+### AsyncSession
+
+用于异步 DMA 操作的引擎无关会话对象，构建一次后传递给所有异步调用：
+
+```cpp
+comm::AsyncSession session;
+comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, workspace, session);
+```
+
+定义于 `include/pto/comm/async/async_types.hpp`。构建参数详见 [TPUT_ASYNC](TPUT_ASYNC_zh.md)。
+
+### ParallelGroup
+
+用于多 NPU 集合通信的包装器：
+
+```cpp
+template <typename GlobalData>
+struct ParallelGroup {
+    // 指向 `GlobalData` 对象数组的指针（每个对象封装一个 GM 地址）。
+    // 数组本身是本地元数据；封装的地址可以指向本地或远端 GM，
+    // 具体取决于集合通信指令的语义。
+    GlobalData *tensors;
+    int nranks;   // rank 总数
+    int rootIdx;  // 根 NPU 的 rank 索引
+
+    // 工厂函数（推荐）：从已有 tensor 数组构建。
+    static ParallelGroup Create(GlobalData *tensorArray, int size, int rank_id);
+};
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TBROADCAST.md b/designs/outerCube/PTOISA/comm/TBROADCAST.md
new file mode 100644
index 00000000..1f0f613d
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TBROADCAST.md
@@ -0,0 +1,122 @@
+﻿# TBROADCAST
+
+## Introduction
+
+Broadcast data from current NPU to all ranks in the parallel group. The calling NPU is the root and its data is copied to all other NPUs.
+
+Only the root needs to execute `TBROADCAST`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `TBROADCAST` on non-root ranks is undefined behavior.
+
+**Large Tile Support**: When the GlobalTensor exceeds the UB (Unified Buffer) tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.
+
+## Math Interpretation
+
+After the operation:
+
+$$ \mathrm{dst}^{(k)}_{i,j} = \mathrm{src}^{(\text{root})}_{i,j} \quad \forall k \in [0, N) $$
+
+where $N$ is the number of ranks and `root` is the calling NPU.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+tbroadcast %group, %src : (!pto.group<...>, !pto.memref<...>)
+```
+Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+// Basic broadcast (single staging tile)
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                                TileData &stagingTileData, WaitEvents&... events);
+
+// Ping-pong broadcast (double buffering with two staging tiles)
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                                TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
+    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
+- **Memory constraints**:
+    - `srcGlobalData` must point to local memory (current NPU).
+    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
+- **ParallelGroup constraints**:
+    - `parallelGroup.tensors[k]` must refer to rank `k`'s destination buffer (remote GM as seen by the root).
+    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the broadcast root.
+    - All destination tensors are assumed to have the same shape and strides.
+- **Chunked mode constraints** (when data exceeds a single UB tile):
+    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
+    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+
+## Examples
+
+### Basic Broadcast
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void broadcast(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
+    // Tile dimensions can differ from tensor dimensions.
+    // The 2D sliding chunked path automatically tiles both row and column.
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                 BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GTensor(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor srcG(my_data);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+
+    // Current NPU broadcasts its data to all others
+    comm::TBROADCAST(group, srcG, stagingTile);
+}
+```
+
+### Ping-Pong Broadcast (Double Buffering)
+
+Uses two UB tiles to overlap TLOAD of the next chunk with TSTORE of the current chunk.
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void broadcast_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
+
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GPerRank(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GPerRank srcG(my_data);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+
+    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
+    comm::TBROADCAST(group, srcG, pingTile, pongTile);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TBROADCAST_zh.md b/designs/outerCube/PTOISA/comm/TBROADCAST_zh.md
new file mode 100644
index 00000000..a82808a5
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TBROADCAST_zh.md
@@ -0,0 +1,123 @@
+# TBROADCAST
+
+## 简介
+
+将当前 NPU 的数据广播到并行组中所有 rank。调用方 NPU 为根节点，其数据将被复制到所有其他 NPU。
+
+只有根节点需要执行 `TBROADCAST`。非根节点只需确保在操作期间其目标缓冲区已分配且可写。在非根节点上调用 `TBROADCAST` 属于未定义行为。
+
+**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB（统一缓冲区）Tile 容量时，传输将通过二维滑动自动分块。
+
+## 数学语义
+
+操作完成后：
+
+$$ \mathrm{dst}^{(k)}_{i,j} = \mathrm{src}^{(\text{root})}_{i,j} \quad \forall k \in [0, N) $$
+
+其中 $N$ 为 rank 总数，`root` 为调用方 NPU。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+tbroadcast %group, %src : (!pto.group<...>, !pto.memref<...>)
+```
+
+降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+// 基础广播（单暂存 Tile）
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                                TileData &stagingTileData, WaitEvents&... events);
+
+// 乒乓广播（使用两个暂存 Tile 实现双缓冲）
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TBROADCAST(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                                TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalSrcData::RawDType`。
+    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
+- **内存约束**：
+    - `srcGlobalData` 必须指向本地内存（当前 NPU）。
+    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
+- **ParallelGroup 约束**：
+    - `parallelGroup.tensors[k]` 必须指向 rank `k` 的目标缓冲区（从根节点视角看到的远端 GM）。
+    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为广播根节点。
+    - 所有目标 tensor 假定具有相同的形状和步幅。
+- **分块模式约束**（数据超出单个 UB Tile 时）：
+    - 若 `TileData` 具有静态 `ValidRow`，则 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
+    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
+
+## 示例
+
+### 基础广播
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void broadcast(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
+    // Tile 维度可以与 tensor 维度不同。
+    // 二维滑动分块路径会自动在行和列两个方向进行分块。
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                 BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GTensor(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor srcG(my_data);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+
+    // 当前 NPU 将自身数据广播到所有其他 NPU
+    comm::TBROADCAST(group, srcG, stagingTile);
+}
+```
+
+### 乒乓广播（双缓冲）
+
+使用两个 UB Tile，将下一块的 TLOAD 与当前块的 TSTORE 重叠执行。
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void broadcast_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* my_data, int my_rank) {
+
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GPerRank(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GPerRank srcG(my_data);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+
+    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
+    comm::TBROADCAST(group, srcG, pingTile, pongTile);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TGATHER.md b/designs/outerCube/PTOISA/comm/TGATHER.md
new file mode 100644
index 00000000..a5d3b35e
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TGATHER.md
@@ -0,0 +1,128 @@
+﻿# TGATHER
+
+## Introduction
+
+Gather operation: the calling NPU (root) collects data from all ranks in the parallel group and concatenates the results along **DIM_3** (row dimension) into a local output buffer.
+
+
+Only the root needs to execute `TGATHER`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `TGATHER` on non-root ranks is undefined behavior.
+
+**Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding — the same mechanism used by other PTO-COMM instructions.
+
+## Math Interpretation
+
+Each rank $r$ has source data of shape $(D_0, D_1, D_2, H, W)$. The gather concatenates all $N$ ranks along DIM_3:
+
+$$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
+
+The destination tensor has shape $(D_0, D_1, D_2, N \times H, W)$.
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+tgather %group, %dst : (!pto.group<...>, !pto.memref<...>)
+```
+Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+// Basic gather (single staging tile)
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                             TileData &stagingTileData, WaitEvents&... events);
+
+// Ping-pong gather (double buffering with two staging tiles)
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                             TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
+    - `TileData::DType` must equal `GlobalDstData::RawDType`.
+- **Memory constraints**:
+    - `dstGlobalData` must point to local memory (current NPU) and be large enough to hold the concatenated result from all ranks. Specifically, `dstGlobalData.GetShape(DIM_3)` must be $\geq N \times H$ where $H$ is each rank's `GetShape(DIM_3)`.
+    - If `dstGlobalData.GetShape(DIM_3) > N × H`, only the first `N × H` rows are written; remaining rows are left unchanged.
+    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
+- **ParallelGroup constraints**:
+    - `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen by the root).
+    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the gather root.
+    - All source tensors are assumed to have the same shape and strides; behavior is undefined if they differ.
+- **Chunked mode constraints** (when source data exceeds a single UB tile):
+    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's source must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
+    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+
+## Examples
+
+### Basic Gather (Single Staging Tile)
+
+Each rank contributes `ROWS × COLS` data. The root collects them into `NRANKS * ROWS` rows.
+The tile size (`TILE_ROWS × TILE_COLS`) can be smaller than the per-rank data — when it is, the implementation automatically chunks the transfer along both DIM_3 and DIM_4 via 2D sliding.
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GResult = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GPerRank(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GResult dstG(result);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+
+    comm::TGATHER(group, dstG, stagingTile);
+}
+```
+
+### Ping-Pong Gather (Double Buffering)
+
+Uses two UB tiles to overlap TLOAD of the next chunk (MTE2) with TSTORE of the current chunk (MTE3).
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    // Tile can be smaller than the data in both dimensions
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GResult = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GPerRank(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GResult dstG(result);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+
+    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
+    comm::TGATHER(group, dstG, pingTile, pongTile);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TGATHER_zh.md b/designs/outerCube/PTOISA/comm/TGATHER_zh.md
new file mode 100644
index 00000000..699e50b1
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TGATHER_zh.md
@@ -0,0 +1,121 @@
+# TGATHER
+
+## 简介
+
+Gather 操作：调用方 NPU（根节点）从并行组中所有 rank 收集数据，并沿 **DIM_3**（行维度）拼接到本地输出缓冲区。
+
+只有根节点需要执行 `TGATHER`。非根节点只需确保在操作期间其源缓冲区已就绪且保持有效。在非根节点上调用 `TGATHER` 属于未定义行为。
+
+**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB Tile 容量时，传输将通过二维滑动自动分块——与其他 PTO-COMM 指令采用相同机制。
+
+## 数学语义
+
+每个 rank $r$ 的源数据形状为 $(D_0, D_1, D_2, H, W)$。gather 沿 DIM_3 拼接所有 $N$ 个 rank 的数据：
+
+$$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
+
+目标 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+tgather %group, %dst : (!pto.group<...>, !pto.memref<...>)
+```
+
+降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+// 基础 gather（单暂存 Tile）
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                             TileData &stagingTileData, WaitEvents&... events);
+
+// 乒乓 gather（使用两个暂存 Tile 实现双缓冲）
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGATHER(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                             TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`。
+    - `TileData::DType` 必须等于 `GlobalDstData::RawDType`。
+- **内存约束**：
+    - `dstGlobalData` 必须指向本地内存（当前 NPU），且足够容纳所有 rank 拼接后的结果。具体要求：`dstGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$，其中 $H$ 为每个 rank 的 `GetShape(DIM_3)`。
+    - 若 `dstGlobalData.GetShape(DIM_3) > N × H`，则只写入前 `N × H` 行，其余行保持不变。
+    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
+- **ParallelGroup 约束**：
+    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）。
+    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 gather 根节点。
+    - 所有源 tensor 假定具有相同的形状和步幅；否则行为未定义。
+- **分块模式约束**（源数据超出单个 UB Tile 时）：
+    - 若 `TileData` 具有静态 `ValidRow`，则每个 rank 源数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
+    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
+
+## 示例
+
+### 基础 Gather（单暂存 Tile）
+
+每个 rank 提供 `ROWS × COLS` 的数据，根节点将其收集到 `NRANKS * ROWS` 行中。
+Tile 大小（`TILE_ROWS × TILE_COLS`）可小于每 rank 的数据——此时实现会自动沿 DIM_3 和 DIM_4 通过二维滑动进行分块传输。
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void gather(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GResult dstG(result);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+    comm::TGATHER(group, dstG, stagingTile);
+}
+```
+
+### 乒乓 Gather（双缓冲）
+
+使用两个 UB Tile，将下一块的 TLOAD（MTE2）与当前块的 TSTORE（MTE3）重叠执行。
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GResult  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GResult dstG(result);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
+    comm::TGATHER(group, dstG, pingTile, pongTile);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TGET.md b/designs/outerCube/PTOISA/comm/TGET.md
new file mode 100644
index 00000000..0e000561
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TGET.md
@@ -0,0 +1,109 @@
+﻿# TGET
+
+## Introduction
+
+Remote read operation: read remote NPU's data to local memory. Data is transferred via a UB tile as intermediate staging buffer.
+
+When the GlobalTensor exceeds the UB tile capacity, TGET automatically performs **2D sliding** �?chunking rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \mathrm{src}^{\mathrm{remote}}_{i,j} $$
+
+Data flow: `srcGlobalData (remote GM)` �?`stagingTileData (UB)` �?`dstGlobalData (local GM)`
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+tget %dst_local, %src_remote : (!pto.memref<...>, !pto.memref<...>)
+```
+Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`
+
+### Single-tile (auto-chunking)
+
+```cpp
+template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &stagingTileData, WaitEvents&... events);
+```
+
+### Ping-pong double buffering
+
+Uses two staging tiles to overlap TLOAD and TSTORE for adjacent chunks, hiding one DMA transfer behind the other.
+
+```cpp
+template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`.
+    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
+    - `GlobalSrcData::layout` must equal `GlobalDstData::layout`.
+- **Memory constraints**:
+    - `srcGlobalData` must point to remote address (on source NPU).
+    - `dstGlobalData` must point to local address (on current NPU).
+    - `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in Unified Buffer.
+- **Valid region**:
+    - Transfer size is determined by `GlobalTensor` shape (auto-chunked to fit tile).
+- **Ping-pong**:
+    - `pingTile` and `pongTile` must have the same type and dimensions.
+    - Must reside at non-overlapping UB offsets.
+
+## Examples
+
+### Basic Usage
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_tget(__gm__ T* local_data, __gm__ T* remote_addr) {
+    using TileT = Tile<TileType::Vec, T, 16, 16>;
+    using GShape = Shape<1, 1, 1, 16, 16>;
+    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+    /* 
+    If the globalTensor is larger than UB Tile, TGET will perform 2D sliding automatically. 
+    using GShape = Shape<1, 1, 1, 4096, 4096>;
+    using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
+    */
+    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+    GTensor srcG(remote_addr);
+    GTensor dstG(local_data);
+    TileT stagingTile;
+    TASSIGN(stagingTile, 0);
+
+    // Basic remote read
+    comm::TGET(dstG, srcG, stagingTile);
+}
+```
+
+### Ping-pong Double Buffering
+
+```cpp
+constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
+TileT pingTile(64, 64);
+TileT pongTile(64, 64);
+TASSIGN(pingTile, 0);
+TASSIGN(pongTile, tileUBBytes);  // Non-overlapping UB region
+
+// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
+comm::TGET(dstG, srcG, pingTile, pongTile);
+```
diff --git a/designs/outerCube/PTOISA/comm/TGET_ASYNC.md b/designs/outerCube/PTOISA/comm/TGET_ASYNC.md
new file mode 100644
index 00000000..9a576051
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TGET_ASYNC.md
@@ -0,0 +1,130 @@
+# TGET_ASYNC
+
+## Introduction
+
+`TGET_ASYNC` is an asynchronous remote read primitive. It starts a transfer from remote GM to local GM and returns an `AsyncEvent` immediately.
+
+Data flow:
+
+`srcGlobalData (remote GM) -> DMA engine -> dstGlobalData (local GM)`
+
+## Template Parameter
+
+- `engine`:
+    - `DmaEngine::SDMA` (default)
+    - `DmaEngine::URMA` (todo)
+
+> **Important (SDMA path)**  
+> `TGET_ASYNC` with `DmaEngine::SDMA` currently supports **only flat contiguous logical 1D tensors**.  
+> Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`.
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA,
+          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
+PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                               const AsyncSession &session, WaitEvents &... events);
+```
+
+`AsyncSession` is an engine-agnostic session object. Build once with
+`BuildAsyncSession<engine>()`, then pass to all async calls and event waits.
+The template `engine` parameter selects the DMA backend at compile time, making the
+code forward-compatible with future engines (URMA, CCU, etc.).
+
+## AsyncSession Construction
+
+Use `BuildAsyncSession` from `include/pto/comm/async/async_event_impl.hpp`:
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
+PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
+                                    __gm__ uint8_t *workspace,
+                                    AsyncSession &session,
+                                    uint32_t syncId = 0,
+                                    const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
+                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
+```
+
+The engine template parameter selects the backend (currently only SDMA).
+
+Parameters with defaults:
+
+| Parameter | Default | Description |
+|---|---|---|
+| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id. |
+| `baseConfig` | `{32*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers. |
+| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios. |
+
+## Constraints
+
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+- SDMA path requires source tensor to be **flat contiguous logical 1D only**
+- workspace must be a valid GM pointer allocated by host-side `SdmaWorkspaceManager`
+
+If the 1D contiguous requirement is not met, current implementation returns an invalid async event (`handle == 0`).
+
+## scratchTile Role
+
+`scratchTile` is **not** used to hold transferred payload data.  
+It is converted to `TmpBuffer` and used as temporary UB workspace for:
+
+- writing/reading SDMA control words (flag, sq_tail, channel_info)
+- polling event completion flags
+- committing queue tail during completion
+
+The real payload path remains remote GM -> DMA engine -> local GM; `scratchTile` is only for control/synchronization metadata.
+
+## scratchTile Type and Size Constraints
+
+- must be a `pto::Tile` type
+- must be UB/Vec tile (`ScratchTile::Loc == TileType::Vec`)
+- available bytes must be at least `sizeof(uint64_t)` (8 bytes)
+
+Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
+
+## Completion Semantics
+
+Use `AsyncEvent` to synchronize:
+
+- `event.Wait(session)` — blocks until the transfer is complete
+
+After wait succeeds, reads into `dstGlobalData` are complete.
+
+## Example
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
+                                 __gm__ uint8_t *sdmaWorkspace)
+{
+    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(localDst, shape, stride);
+    GT srcG(remoteSrc, shape, stride);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TGET_ASYNC_zh.md b/designs/outerCube/PTOISA/comm/TGET_ASYNC_zh.md
new file mode 100644
index 00000000..8f9a6383
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TGET_ASYNC_zh.md
@@ -0,0 +1,126 @@
+# TGET_ASYNC
+
+## 简介
+
+`TGET_ASYNC` 是异步远程读原语。它启动一次从远端 GM 到本地 GM 的传输，并立即返回 `AsyncEvent`。
+
+数据流：
+
+`srcGlobalData（远端 GM）` → DMA 引擎 → `dstGlobalData（本地 GM）`
+
+## 模板参数
+
+- `engine`：
+    - `DmaEngine::SDMA`（默认）
+    - `DmaEngine::URMA`（待实现）
+
+> **注意（SDMA 路径）**
+> `TGET_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
+> 当前 SDMA 异步实现不支持非一维或非连续布局。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA,
+          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
+PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                               const AsyncSession &session, WaitEvents &... events);
+```
+
+`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端，使代码对未来引擎（URMA、CCU 等）保持前向兼容。
+
+## AsyncSession 构建
+
+使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`：
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
+PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
+                                    __gm__ uint8_t *workspace,
+                                    AsyncSession &session,
+                                    uint32_t syncId = 0,
+                                    const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
+                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
+```
+
+带默认值的参数说明：
+
+| 参数 | 默认值 | 说明 |
+|---|---|---|
+| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
+| `baseConfig` | `{32*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
+| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
+
+## 约束
+
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+- SDMA 路径要求源 tensor 为**扁平连续的逻辑一维**
+- workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
+
+若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）。
+
+## scratchTile 的作用
+
+`scratchTile` **不是**用于传输数据负载的暂存缓冲区。
+它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
+
+- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
+- 轮询事件完成标志
+- 完成时提交队列尾部
+
+实际数据路径为远端 GM → DMA 引擎 → 本地 GM；`scratchTile` 仅用于控制和同步元数据。
+
+## scratchTile 类型与大小约束
+
+- 必须是 `pto::Tile` 类型
+- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
+- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
+
+推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
+
+## 完成语义
+
+使用 `AsyncEvent` 同步：
+
+- `event.Wait(session)` — 阻塞直到传输完成
+
+wait 成功后，读入 `dstGlobalData` 的数据已全部就绪。
+
+## 示例
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
+                                 __gm__ uint8_t *sdmaWorkspace)
+{
+    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(localDst,  shape, stride);
+    GT srcG(remoteSrc, shape, stride);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TGET_zh.md b/designs/outerCube/PTOISA/comm/TGET_zh.md
new file mode 100644
index 00000000..2c39826f
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TGET_zh.md
@@ -0,0 +1,106 @@
+# TGET
+
+## 简介
+
+远程读操作：将远端 NPU 的数据读取到本地内存。数据通过 UB Tile 作为中间暂存缓冲区进行传输。
+
+当 GlobalTensor 超出 UB Tile 容量时，TGET 将自动执行**二维滑动**——沿行（DIM_3）和列（DIM_4）分块以适配 Tile，并遍历所有外层维度（DIM_0、DIM_1、DIM_2）。
+
+## 数学语义
+
+对有效区域内每个元素 `(i, j)`：
+
+$$\mathrm{dst}^{\mathrm{local}}_{i,j} = \mathrm{src}^{\mathrm{remote}}_{i,j}$$
+
+数据流：`srcGlobalData（远端 GM）` → `stagingTileData（UB）` → `dstGlobalData（本地 GM）`
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+tget %dst_local, %src_remote : (!pto.memref<...>, !pto.memref<...>)
+```
+
+降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`
+
+### 单 Tile（自动分块）
+
+```cpp
+template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &stagingTileData, WaitEvents&... events);
+```
+
+### 乒乓双缓冲
+
+使用两个暂存 Tile，将相邻块的 TLOAD 与 TSTORE 重叠执行，隐藏 DMA 传输延迟。
+
+```cpp
+template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `GlobalSrcData::RawDType` 必须等于 `GlobalDstData::RawDType`。
+    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
+    - `GlobalSrcData::layout` 必须等于 `GlobalDstData::layout`。
+- **内存约束**：
+    - `srcGlobalData` 必须指向远端地址（源 NPU）。
+    - `dstGlobalData` 必须指向本地地址（当前 NPU）。
+    - `stagingTileData` / `pingTile` / `pongTile` 必须预先在统一缓冲区中分配。
+- **有效区域**：
+    - 传输大小由 `GlobalTensor` 的形状决定（自动分块以适配 Tile）。
+- **乒乓约束**：
+    - `pingTile` 和 `pongTile` 必须具有相同的类型和维度。
+    - 必须位于不重叠的 UB 偏移处。
+
+## 示例
+
+### 基础用法
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_tget(__gm__ T* local_data, __gm__ T* remote_addr) {
+    using TileT   = Tile<TileType::Vec, T, 16, 16>;
+    using GShape  = Shape<1, 1, 1, 16, 16>;
+    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+    GTensor srcG(remote_addr);
+    GTensor dstG(local_data);
+    TileT stagingTile;
+    TASSIGN(stagingTile, 0);
+
+    // 基础远程读
+    comm::TGET(dstG, srcG, stagingTile);
+}
+```
+
+### 乒乓双缓冲
+
+```cpp
+constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
+TileT pingTile(64, 64);
+TileT pongTile(64, 64);
+TASSIGN(pingTile, 0);
+TASSIGN(pongTile, tileUBBytes);  // 不重叠的 UB 区域
+
+// 将 TLOAD[i+1] 与 TSTORE[i] 重叠执行以提升流水线利用率
+comm::TGET(dstG, srcG, pingTile, pongTile);
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TNOTIFY.md b/designs/outerCube/PTOISA/comm/TNOTIFY.md
new file mode 100644
index 00000000..6bb80aa4
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TNOTIFY.md
@@ -0,0 +1,100 @@
+﻿# TNOTIFY
+
+## Introduction
+
+Send flag notification to remote NPU. Used for lightweight synchronization between NPUs without transferring bulk data.
+
+## Math Interpretation
+
+For `NotifyOp::Set`:
+
+$$ \mathrm{signal}^{\mathrm{remote}} = \mathrm{value} $$
+
+For `NotifyOp::AtomicAdd`:
+
+$$ \mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} \quad (\text{atomic}) $$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+```text
+tnotify %signal_remote, %value {op = #pto.notify_op<Set>} : (!pto.memref<i32>, i32)
+tnotify %signal_remote, %value {op = #pto.notify_op<AtomicAdd>} : (!pto.memref<i32>, i32)
+```
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+template <typename GlobalSignalData, typename... WaitEvents>
+PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
+- **Memory constraints**:
+    - `dstSignalData` must point to remote address (on target NPU).
+    - `dstSignalData` should be 4-byte aligned.
+- **Operation semantics**:
+    - `NotifyOp::Set`: Direct store to remote memory.
+    - `NotifyOp::AtomicAdd`: Hardware atomic add using `st_atomic` instruction.
+
+## Examples
+
+### Basic Set Notification
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void notify_set(__gm__ int32_t* remote_signal) {
+    comm::Signal sig(remote_signal);
+    
+    // Set remote signal to 1
+    comm::TNOTIFY(sig, 1, comm::NotifyOp::Set);
+}
+```
+
+### Atomic Counter Increment
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void atomic_increment(__gm__ int32_t* remote_counter) {
+    comm::Signal counter(remote_counter);
+    
+    // Atomically add 1 to remote counter
+    comm::TNOTIFY(counter, 1, comm::NotifyOp::AtomicAdd);
+}
+```
+
+### Producer-Consumer Pattern
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// Producer: notify when data is ready
+void producer(__gm__ int32_t* remote_flag) {
+    // ... produce data ...
+    
+    comm::Signal flag(remote_flag);
+    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
+}
+
+// Consumer: wait for data
+void consumer(__gm__ int32_t* local_flag) {
+    comm::Signal flag(local_flag);
+    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
+    
+    // ... consume data ...
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TNOTIFY_zh.md b/designs/outerCube/PTOISA/comm/TNOTIFY_zh.md
new file mode 100644
index 00000000..d484bcfe
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TNOTIFY_zh.md
@@ -0,0 +1,101 @@
+# TNOTIFY
+
+## 简介
+
+向远端 NPU 发送标志通知。用于 NPU 之间的轻量级同步，无需传输大量数据。
+
+## 数学语义
+
+`NotifyOp::Set` 时：
+
+$$\mathrm{signal}^{\mathrm{remote}} = \mathrm{value}$$
+
+`NotifyOp::AtomicAdd` 时：
+
+$$\mathrm{signal}^{\mathrm{remote}} \mathrel{+}= \mathrm{value} \quad (\text{原子操作})$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+```text
+tnotify %signal_remote, %value {op = #pto.notify_op<Set>} : (!pto.memref<i32>, i32)
+tnotify %signal_remote, %value {op = #pto.notify_op<AtomicAdd>} : (!pto.memref<i32>, i32)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+template <typename GlobalSignalData, typename... WaitEvents>
+PTO_INST void TNOTIFY(GlobalSignalData &dstSignalData, int32_t value, NotifyOp op, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `GlobalSignalData::DType` 必须为 `int32_t`（32 位信号）。
+- **内存约束**：
+    - `dstSignalData` 必须指向远端地址（目标 NPU）。
+    - `dstSignalData` 应 4 字节对齐。
+- **操作语义**：
+    - `NotifyOp::Set`：直接存储到远端内存。
+    - `NotifyOp::AtomicAdd`：使用 `st_atomic` 指令执行硬件原子加。
+
+## 示例
+
+### 基础 Set 通知
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void notify_set(__gm__ int32_t* remote_signal) {
+    comm::Signal sig(remote_signal);
+
+    // 将远端信号置为 1
+    comm::TNOTIFY(sig, 1, comm::NotifyOp::Set);
+}
+```
+
+### 原子计数器自增
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void atomic_increment(__gm__ int32_t* remote_counter) {
+    comm::Signal counter(remote_counter);
+
+    // 对远端计数器原子加 1
+    comm::TNOTIFY(counter, 1, comm::NotifyOp::AtomicAdd);
+}
+```
+
+### 生产者-消费者模式
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// 生产者：数据就绪后发送通知
+void producer(__gm__ int32_t* remote_flag) {
+    // ... 生产数据 ...
+
+    comm::Signal flag(remote_flag);
+    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
+}
+
+// 消费者：等待数据就绪
+void consumer(__gm__ int32_t* local_flag) {
+    comm::Signal flag(local_flag);
+    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
+
+    // ... 消费数据 ...
+}
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TPUT.md b/designs/outerCube/PTOISA/comm/TPUT.md
new file mode 100644
index 00000000..0c16a735
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TPUT.md
@@ -0,0 +1,131 @@
+﻿# TPUT
+
+## Introduction
+
+Remote write operation: write local data to remote NPU's memory. Data is transferred via a UB tile as intermediate staging buffer.
+
+When the GlobalTensor exceeds the UB tile capacity, TPUT automatically performs **2D sliding** — chunking rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}^{\mathrm{remote}}_{i,j} = \mathrm{src}^{\mathrm{local}}_{i,j} $$
+
+Data flow: `srcGlobalData (local GM)` → `stagingTileData (UB)` → `dstGlobalData (remote GM)`
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+tput %dst_remote, %src_local : (!pto.memref<...>, !pto.memref<...>)
+```
+Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`
+
+### Single-tile (auto-chunking)
+
+```cpp
+template <AtomicType atomicType = AtomicType::AtomicNone,
+          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &stagingTileData, WaitEvents&... events);
+```
+
+### Ping-pong double buffering
+
+Uses two staging tiles to overlap TLOAD and TSTORE for adjacent chunks, hiding one DMA transfer behind the other.
+
+```cpp
+template <AtomicType atomicType = AtomicType::AtomicNone,
+          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+### Runtime atomic type
+
+```cpp
+template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &stagingTileData, AtomicType atomicType, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `GlobalSrcData::RawDType` must equal `GlobalDstData::RawDType`.
+    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
+    - `GlobalSrcData::layout` must equal `GlobalDstData::layout`.
+- **Memory constraints**:
+    - `dstGlobalData` must point to remote address (on target NPU).
+    - `srcGlobalData` must point to local address (on current NPU).
+    - `stagingTileData` / `pingTile` / `pongTile` must be pre-allocated in Unified Buffer.
+- **Valid region**:
+    - Transfer size is determined by `GlobalTensor` shape (auto-chunked to fit tile).
+- **Atomic operation**:
+    - `atomicType` supports `AtomicNone` and `AtomicAdd`.
+- **Ping-pong**:
+    - `pingTile` and `pongTile` must have the same type and dimensions.
+    - Must reside at non-overlapping UB offsets.
+
+## Examples
+
+### Basic Usage
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_tput(__gm__ T* local_data, __gm__ T* remote_addr) {
+    using TileT = Tile<TileType::Vec, T, 16, 16>;
+    using GShape = Shape<1, 1, 1, 16, 16>;
+    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+    /* 
+    If the globalTensor is larger than UB Tile, TPUT will perform 2D sliding automatically. 
+    using GShape = Shape<1, 1, 1, 4096, 4096>;
+    using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
+    */
+    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+    GTensor srcG(local_data);
+    GTensor dstG(remote_addr);
+    TileT stagingTile;
+    TASSIGN(stagingTile, 0);
+
+    // Basic remote write
+    comm::TPUT(dstG, srcG, stagingTile);
+
+    // Remote write with atomic add
+    comm::TPUT<AtomicType::AtomicAdd>(dstG, srcG, stagingTile);
+}
+```
+
+### Ping-pong Double Buffering
+
+```cpp
+constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
+TileT pingTile(64, 64);
+TileT pongTile(64, 64);
+TASSIGN(pingTile, 0);
+TASSIGN(pongTile, tileUBBytes);  // Non-overlapping UB region
+
+// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
+comm::TPUT(dstG, srcG, pingTile, pongTile);
+```
+
+### Runtime Atomic Type
+
+```cpp
+// Select atomic type at runtime instead of compile-time template parameter
+comm::TPUT(dstG, srcG, stagingTile, AtomicType::AtomicAdd);
+```
\ No newline at end of file
diff --git a/designs/outerCube/PTOISA/comm/TPUT_ASYNC.md b/designs/outerCube/PTOISA/comm/TPUT_ASYNC.md
new file mode 100644
index 00000000..3d8a16aa
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TPUT_ASYNC.md
@@ -0,0 +1,132 @@
+# TPUT_ASYNC
+
+## Introduction
+
+`TPUT_ASYNC` is an asynchronous remote write primitive. It starts a transfer from local GM to remote GM and returns an `AsyncEvent` immediately.
+
+Data flow:
+
+`srcGlobalData (local GM) -> DMA engine -> dstGlobalData (remote GM)`
+
+
+## Template Parameter
+
+- `engine`:
+    - `DmaEngine::SDMA` (default)
+    - `DmaEngine::URMA` (todo)
+
+> **Important (SDMA path)**  
+> `TPUT_ASYNC` with `DmaEngine::SDMA` currently supports **only flat contiguous logical 1D tensors**.  
+> Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.
+
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`.
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA,
+          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
+PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                               const AsyncSession &session, WaitEvents &... events);
+```
+
+`AsyncSession` is an engine-agnostic session object. Build once with
+`BuildAsyncSession<engine>()`, then pass to all async calls and event waits.
+The template `engine` parameter selects the DMA backend at compile time, making the
+code forward-compatible with future engines (URMA, CCU, etc.).
+
+## AsyncSession Construction
+
+Use `BuildAsyncSession` from `include/pto/comm/async/async_event_impl.hpp`:
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
+PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
+                                    __gm__ uint8_t *workspace,
+                                    AsyncSession &session,
+                                    uint32_t syncId = 0,
+                                    const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
+                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
+```
+
+The engine template parameter selects the backend (currently only SDMA).
+
+Parameters with defaults:
+
+| Parameter | Default | Description |
+|---|---|---|
+| `syncId` | `0` | MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id. |
+| `baseConfig` | `{32*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers. |
+| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios. |
+
+## Constraints
+
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+- SDMA path requires source tensor to be **flat contiguous logical 1D only**
+- workspace must be a valid GM pointer allocated by host-side `SdmaWorkspaceManager`
+
+If the 1D contiguous requirement is not met, current implementation returns an invalid async event (`handle == 0`).
+
+## scratchTile Role
+
+`scratchTile` is **not** the payload staging buffer for user data.  
+It is converted to `TmpBuffer` and used as temporary UB workspace for:
+
+- writing/reading SDMA control words (flag, sq_tail, channel_info)
+- polling event completion flags
+- committing queue tail during completion
+
+Data payload moves between GM buffers directly; `scratchTile` only supports control and synchronization metadata.
+
+## scratchTile Type and Size Constraints
+
+- must be a `pto::Tile` type
+- must be UB/Vec tile (`ScratchTile::Loc == TileType::Vec`)
+- available bytes must be at least `sizeof(uint64_t)` (8 bytes)
+
+Recommended: `Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>` (256B).
+
+## Completion Semantics
+
+Use `AsyncEvent` to synchronize:
+
+- `event.Wait(session)` — blocks until the transfer is complete
+
+After wait succeeds, writes to `dstGlobalData` are complete.
+
+## Example
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
+                                 __gm__ uint8_t *sdmaWorkspace)
+{
+    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(remoteDst, shape, stride);
+    GT srcG(localSrc, shape, stride);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TPUT_ASYNC_zh.md b/designs/outerCube/PTOISA/comm/TPUT_ASYNC_zh.md
new file mode 100644
index 00000000..ceb83763
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TPUT_ASYNC_zh.md
@@ -0,0 +1,126 @@
+# TPUT_ASYNC
+
+## 简介
+
+`TPUT_ASYNC` 是异步远程写原语。它启动一次从本地 GM 到远端 GM 的传输，并立即返回 `AsyncEvent`。
+
+数据流：
+
+`srcGlobalData（本地 GM）` → DMA 引擎 → `dstGlobalData（远端 GM）`
+
+## 模板参数
+
+- `engine`：
+    - `DmaEngine::SDMA`（默认）
+    - `DmaEngine::URMA`（待实现）
+
+> **注意（SDMA 路径）**
+> `TPUT_ASYNC` 配合 `DmaEngine::SDMA` 目前**仅支持扁平连续的逻辑一维 tensor**。
+> 当前 SDMA 异步实现不支持非一维或非连续布局。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA,
+          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
+PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                               const AsyncSession &session, WaitEvents &... events);
+```
+
+`AsyncSession` 是引擎无关的会话对象。使用 `BuildAsyncSession<engine>()` 构建一次后，传递给所有异步调用和事件等待。模板参数 `engine` 在编译期选择 DMA 后端，使代码对未来引擎（URMA、CCU 等）保持前向兼容。
+
+## AsyncSession 构建
+
+使用 `include/pto/comm/async/async_event_impl.hpp` 中的 `BuildAsyncSession`：
+
+```cpp
+template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
+PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
+                                    __gm__ uint8_t *workspace,
+                                    AsyncSession &session,
+                                    uint32_t syncId = 0,
+                                    const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
+                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
+```
+
+带默认值的参数说明：
+
+| 参数 | 默认值 | 说明 |
+|---|---|---|
+| `syncId` | `0` | MTE3/MTE2 管道同步事件 ID（0-7）。若 kernel 在相同 ID 上使用了其他管道屏障，则需覆盖此值。|
+| `baseConfig` | `{32*1024, 0, 1}` | `{block_bytes, comm_block_offset, queue_num}`。适用于大多数单队列传输场景。|
+| `channelGroupIdx` | `kAutoChannelGroupIdx` | SDMA 通道组索引。默认内部使用 `get_block_idx()` 映射到当前 AI Core。多 block 或自定义通道映射场景下需覆盖此值。|
+
+## 约束
+
+- `GlobalSrcData::RawDType == GlobalDstData::RawDType`
+- `GlobalSrcData::layout == GlobalDstData::layout`
+- SDMA 路径要求源 tensor 为**扁平连续的逻辑一维**
+- workspace 必须是由主机侧 `SdmaWorkspaceManager` 分配的有效 GM 指针
+
+若不满足一维连续要求，当前实现返回无效 async event（`handle == 0`）。
+
+## scratchTile 的作用
+
+`scratchTile` **不是**用于存放用户数据负载的暂存缓冲区。
+它被转换为 `TmpBuffer`，用作临时 UB 工作区，用于：
+
+- 写入/读取 SDMA 控制字（flag、sq_tail、channel_info）
+- 轮询事件完成标志
+- 完成时提交队列尾部
+
+实际数据负载直接在 GM 缓冲区之间传输；`scratchTile` 仅用于控制和同步元数据。
+
+## scratchTile 类型与大小约束
+
+- 必须是 `pto::Tile` 类型
+- 必须是 UB/Vec tile（`ScratchTile::Loc == TileType::Vec`）
+- 可用字节数至少为 `sizeof(uint64_t)`（8 字节）
+
+推荐使用：`Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>`（256B）。
+
+## 完成语义
+
+使用 `AsyncEvent` 同步：
+
+- `event.Wait(session)` — 阻塞直到传输完成
+
+wait 成功后，对 `dstGlobalData` 的写入已全部完成。
+
+## 示例
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/common/pto_tile.hpp>
+
+using namespace pto;
+
+template <typename T>
+__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
+                                 __gm__ uint8_t *sdmaWorkspace)
+{
+    using ShapeDyn  = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
+    using GT        = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
+    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
+
+    ShapeDyn shape(1, 1, 1, 1, 1024);
+    StrideDyn stride(1024, 1024, 1024, 1024, 1);
+    GT dstG(remoteDst, shape, stride);
+    GT srcG(localSrc,  shape, stride);
+
+    ScratchTile scratchTile;
+    TASSIGN(scratchTile, 0x0);
+
+    comm::AsyncSession session;
+    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
+        return;
+    }
+
+    auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
+    (void)event.Wait(session);
+}
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TPUT_zh.md b/designs/outerCube/PTOISA/comm/TPUT_zh.md
new file mode 100644
index 00000000..ad4a2da7
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TPUT_zh.md
@@ -0,0 +1,128 @@
+# TPUT
+
+## 简介
+
+远程写操作：将本地数据写入远端 NPU 的内存。数据通过 UB Tile 作为中间暂存缓冲区进行传输。
+
+当 GlobalTensor 超出 UB Tile 容量时，TPUT 将自动执行**二维滑动**——沿行（DIM_3）和列（DIM_4）分块以适配 Tile，并遍历所有外层维度（DIM_0、DIM_1、DIM_2）。
+
+## 数学语义
+
+对有效区域内每个元素 `(i, j)`：
+
+$$\mathrm{dst}^{\mathrm{remote}}_{i,j} = \mathrm{src}^{\mathrm{local}}_{i,j}$$
+
+数据流：`srcGlobalData（本地 GM）` → `stagingTileData（UB）` → `dstGlobalData（远端 GM）`
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+tput %dst_remote, %src_local : (!pto.memref<...>, !pto.memref<...>)
+```
+
+降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`
+
+### 单 Tile（自动分块）
+
+```cpp
+template <AtomicType atomicType = AtomicType::AtomicNone,
+          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &stagingTileData, WaitEvents&... events);
+```
+
+### 乒乓双缓冲
+
+使用两个暂存 Tile，将相邻块的 TLOAD 与 TSTORE 重叠执行，隐藏 DMA 传输延迟。
+
+```cpp
+template <AtomicType atomicType = AtomicType::AtomicNone,
+          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+### 运行时原子类型
+
+```cpp
+template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
+                          TileData &stagingTileData, AtomicType atomicType, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `GlobalSrcData::RawDType` 必须等于 `GlobalDstData::RawDType`。
+    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
+    - `GlobalSrcData::layout` 必须等于 `GlobalDstData::layout`。
+- **内存约束**：
+    - `dstGlobalData` 必须指向远端地址（目标 NPU）。
+    - `srcGlobalData` 必须指向本地地址（当前 NPU）。
+    - `stagingTileData` / `pingTile` / `pongTile` 必须预先在统一缓冲区中分配。
+- **有效区域**：
+    - 传输大小由 `GlobalTensor` 的形状决定（自动分块以适配 Tile）。
+- **原子操作**：
+    - `atomicType` 支持 `AtomicNone` 和 `AtomicAdd`。
+- **乒乓约束**：
+    - `pingTile` 和 `pongTile` 必须具有相同的类型和维度。
+    - 必须位于不重叠的 UB 偏移处。
+
+## 示例
+
+### 基础用法
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+#include <pto/pto-inst.hpp>
+
+using namespace pto;
+
+template <typename T>
+void example_tput(__gm__ T* local_data, __gm__ T* remote_addr) {
+    using TileT   = Tile<TileType::Vec, T, 16, 16>;
+    using GShape  = Shape<1, 1, 1, 16, 16>;
+    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
+    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
+
+    GTensor srcG(local_data);
+    GTensor dstG(remote_addr);
+    TileT stagingTile;
+    TASSIGN(stagingTile, 0);
+
+    // 基础远程写
+    comm::TPUT(dstG, srcG, stagingTile);
+
+    // 带原子加的远程写
+    comm::TPUT<AtomicType::AtomicAdd>(dstG, srcG, stagingTile);
+}
+```
+
+### 乒乓双缓冲
+
+```cpp
+constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
+TileT pingTile(64, 64);
+TileT pongTile(64, 64);
+TASSIGN(pingTile, 0);
+TASSIGN(pongTile, tileUBBytes);  // 不重叠的 UB 区域
+
+// 将 TLOAD[i+1] 与 TSTORE[i] 重叠执行以提升流水线利用率
+comm::TPUT(dstG, srcG, pingTile, pongTile);
+```
+
+### 运行时原子类型
+
+```cpp
+// 在运行时而非编译期模板参数中选择原子类型
+comm::TPUT(dstG, srcG, stagingTile, AtomicType::AtomicAdd);
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TREDUCE.md b/designs/outerCube/PTOISA/comm/TREDUCE.md
new file mode 100644
index 00000000..1a2e2c7e
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TREDUCE.md
@@ -0,0 +1,118 @@
+﻿# TREDUCE
+
+## Introduction
+
+Reduce operation: gather data from multiple remote NPUs and perform element-wise reduction locally. 
+
+
+Only the root needs to execute `TREDUCE`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `TREDUCE` on non-root ranks is undefined behavior.
+
+**Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the reduction is automatically chunked via 2D sliding.
+
+## Math Interpretation
+
+For each element `(i, j)` in the valid region:
+
+$$ \mathrm{dst}^{\mathrm{local}}_{i,j} = \bigoplus_{r=0}^{N-1} \mathrm{src}^{(r)}_{i,j} $$
+
+where $N$ is the number of ranks and $\oplus$ is the reduction operation (sum, max, min, etc.).
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
+treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
+```
+Lowering introduces internal accumulator and receive tiles for the reduce pipeline; the C++ intrinsic requires explicit `accTileData`, `recvTileData` (or `accTileData`, `pingTileData`, `pongTileData`) operand(s).
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+// Basic reduce (accumulator + receive tile)
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData, 
+                              TileData &accTileData, TileData &recvTileData, ReduceOp op, WaitEvents&... events);
+
+// Ping-pong reduce (accumulator + ping + pong tiles for double buffering)
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                              TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
+                              ReduceOp op, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `ParallelGroup::value_type::RawDType` must equal `GlobalDstData::RawDType`.
+    - `TileData::DType` must equal `GlobalDstData::RawDType`.
+- **Memory constraints**:
+    - `dstGlobalData` must point to local address (on current NPU).
+    - `accTileData`, `recvTileData` (or `accTileData`, `pingTileData`, `pongTileData`) must be pre-allocated UB tiles.
+- **ParallelGroup constraints**:
+    - `parallelGroup.tensors[r]` must refer to rank `r`'s source buffer (remote GM as seen by the root).
+    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the reduce root.
+    - All source tensors are assumed to have the same shape and strides.
+- **Chunked mode constraints** (when data exceeds a single UB tile):
+    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
+    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+
+## Examples
+
+### Basic Reduce Sum
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int SIZE, int NRANKS>
+void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT = Tile<TileType::Vec, T, 1, SIZE>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>, 
+                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+
+    // Stack-allocated tensors
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GTensor(group_addrs[i]);
+    }
+    
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor dstG(result);
+    TileT accTile, recvTile;
+
+    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Sum);
+}
+```
+
+### Max Reduce
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int SIZE, int NRANKS>
+void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT = Tile<TileType::Vec, T, 1, SIZE>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>, 
+                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GTensor(group_addrs[i]);
+    }
+    
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor dstG(result);
+    TileT accTile, recvTile;
+
+    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TREDUCE_zh.md b/designs/outerCube/PTOISA/comm/TREDUCE_zh.md
new file mode 100644
index 00000000..5f165efe
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TREDUCE_zh.md
@@ -0,0 +1,112 @@
+# TREDUCE
+
+## 简介
+
+Reduce 操作：从多个远端 NPU 收集数据并在本地执行逐元素归约。
+
+只有根节点需要执行 `TREDUCE`。非根节点只需确保在操作期间其源缓冲区已就绪且保持有效。在非根节点上调用 `TREDUCE` 属于未定义行为。
+
+**大 Tile 支持**：当 GlobalTensor 在行和/或列方向超出 UB Tile 容量时，归约操作将通过二维滑动自动分块。
+
+## 数学语义
+
+对有效区域内每个元素 `(i, j)`：
+
+$$\mathrm{dst}^{\mathrm{local}}_{i,j} = \bigoplus_{r=0}^{N-1} \mathrm{src}^{(r)}_{i,j}$$
+
+其中 $N$ 为 rank 总数，$\oplus$ 为归约运算（求和、取最大值、取最小值等）。
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
+treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
+```
+
+降级时会为 reduce 流水线引入内部累加 Tile 和接收 Tile；C++ 内建接口需要显式传入 `accTileData`、`recvTileData`（或 `accTileData`、`pingTileData`、`pongTileData`）操作数。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+// 基础 reduce（累加 Tile + 接收 Tile）
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                              TileData &accTileData, TileData &recvTileData, ReduceOp op, WaitEvents&... events);
+
+// 乒乓 reduce（累加 Tile + ping/pong Tile 实现双缓冲）
+template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TREDUCE(ParallelGroupType &parallelGroup, GlobalDstData &dstGlobalData,
+                              TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
+                              ReduceOp op, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalDstData::RawDType`。
+    - `TileData::DType` 必须等于 `GlobalDstData::RawDType`。
+- **内存约束**：
+    - `dstGlobalData` 必须指向本地内存（当前 NPU）。
+    - `accTileData`、`recvTileData`（或 `accTileData`、`pingTileData`、`pongTileData`）必须为预先分配的 UB Tile。
+- **ParallelGroup 约束**：
+    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的源缓冲区（从根节点视角看到的远端 GM）。
+    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 reduce 根节点。
+    - 所有源 tensor 假定具有相同的形状和步幅。
+- **分块模式约束**（数据超出单个 UB Tile 时）：
+    - 若 `TileData` 具有静态 `ValidRow`，则 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
+    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
+
+## 示例
+
+### 基础求和归约
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int SIZE, int NRANKS>
+void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
+                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) tensors[i] = GTensor(group_addrs[i]);
+
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor dstG(result);
+    TileT accTile, recvTile;
+    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Sum);
+}
+```
+
+### 最大值归约
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int SIZE, int NRANKS>
+void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
+    using TileT   = Tile<TileType::Vec, T, 1, SIZE>;
+    using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
+                                 BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
+
+    GTensor tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) tensors[i] = GTensor(group_addrs[i]);
+
+    comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
+    GTensor dstG(result);
+    TileT accTile, recvTile;
+    comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
+}
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TSCATTER.md b/designs/outerCube/PTOISA/comm/TSCATTER.md
new file mode 100644
index 00000000..66e00ca3
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TSCATTER.md
@@ -0,0 +1,126 @@
+﻿# TSCATTER
+
+## Introduction
+
+Scatter operation: the calling NPU (root) distributes data to all ranks in the parallel group by splitting the local source tensor along **DIM_3** (row dimension). This is the inverse of `TGATHER`.
+
+
+Only the root needs to execute `TSCATTER`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `TSCATTER` on non-root ranks is undefined behavior.
+
+**Large Tile Support**: When the per-rank data exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.
+
+## Math Interpretation
+
+The local source tensor has shape $(D_0, D_1, D_2, N \times H, W)$, where $N$ is the number of ranks and each rank receives $H$ rows. After the operation:
+
+$$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+Synchronous form:
+
+```text
+tscatter %group, %src : (!pto.group<...>, !pto.memref<...>)
+```
+Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit `stagingTileData` (or `pingTile` / `pongTile`) operand(s).
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+// Basic scatter (single staging tile)
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                              TileData &stagingTileData, WaitEvents&... events);
+
+// Ping-pong scatter (double buffering with two staging tiles)
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                              TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `ParallelGroup::value_type::RawDType` must equal `GlobalSrcData::RawDType`.
+    - `TileData::DType` must equal `GlobalSrcData::RawDType`.
+- **Memory constraints**:
+    - `srcGlobalData` must point to local memory (current NPU) and be large enough to hold data for all ranks. Specifically, `srcGlobalData.GetShape(DIM_3)` must be $\geq N \times H$ where $H$ is each rank's `GetShape(DIM_3)`.
+    - If `srcGlobalData.GetShape(DIM_3) > N × H`, only the first `N × H` rows are read; remaining rows are ignored.
+    - `stagingTileData` (or `pingTile` / `pongTile`) must be pre-allocated in UB.
+- **ParallelGroup constraints**:
+    - `parallelGroup.tensors[r]` must refer to rank `r`'s destination buffer (remote GM as seen by the root).
+    - `parallelGroup.GetRootIdx()` identifies the calling NPU as the scatter root.
+    - All destination tensors are assumed to have the same shape and strides; behavior is undefined if they differ.
+- **Chunked mode constraints** (when per-rank data exceeds a single UB tile):
+    - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's destination must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
+    - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
+
+## Examples
+
+### Basic Scatter (Single Staging Tile)
+
+Root has `NRANKS * ROWS` rows of width `COLS`. Each rank receives `ROWS × COLS`, split along DIM_3.
+The tile size (`TILE_ROWS × TILE_COLS`) can be smaller than the per-rank data — when it is, the implementation automatically chunks the transfer along both DIM_3 and DIM_4 via 2D sliding.
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GSource = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GPerRank(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GSource srcG(local_data);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+
+    comm::TSCATTER(group, srcG, stagingTile);
+}
+```
+
+### Ping-Pong Scatter (Double Buffering)
+
+Uses two UB tiles to overlap TLOAD of the next chunk (MTE2) with TSTORE of the current chunk (MTE3).
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
+    // Tile can be smaller than the data in both dimensions
+    using TileT = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GSource = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) {
+        tensors[i] = GPerRank(group_addrs[i]);
+    }
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GSource srcG(local_data);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+
+    // Ping-pong: overlaps TLOAD and TSTORE for better throughput
+    comm::TSCATTER(group, srcG, pingTile, pongTile);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TSCATTER_zh.md b/designs/outerCube/PTOISA/comm/TSCATTER_zh.md
new file mode 100644
index 00000000..c353beab
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TSCATTER_zh.md
@@ -0,0 +1,120 @@
+# TSCATTER
+
+## 简介
+
+Scatter 操作：调用方 NPU（根节点）将本地源 tensor 沿 **DIM_3**（行维度）拆分后分发到并行组中所有 rank。该操作是 `TGATHER` 的逆操作。
+
+只有根节点需要执行 `TSCATTER`。非根节点只需确保在操作期间其目标缓冲区已分配且可写。在非根节点上调用 `TSCATTER` 属于未定义行为。
+
+**大 Tile 支持**：当每 rank 的数据在行和/或列方向超出 UB Tile 容量时，传输将通过二维滑动自动分块。
+
+## 数学语义
+
+本地源 tensor 的形状为 $(D_0, D_1, D_2, N \times H, W)$，其中 $N$ 为 rank 总数，每个 rank 接收 $H$ 行。操作完成后：
+
+$$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+同步形式：
+
+```text
+tscatter %group, %src : (!pto.group<...>, !pto.memref<...>)
+```
+
+降级时会为 GM→UB→GM 数据路径引入 UB 暂存 Tile；C++ 内建接口需要显式传入 `stagingTileData`（或 `pingTile` / `pongTile`）操作数。
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+// 基础 scatter（单暂存 Tile）
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                              TileData &stagingTileData, WaitEvents&... events);
+
+// 乒乓 scatter（使用两个暂存 Tile 实现双缓冲）
+template <typename ParallelGroupType, typename GlobalSrcData, typename TileData, typename... WaitEvents>
+PTO_INST RecordEvent TSCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &srcGlobalData,
+                              TileData &pingTile, TileData &pongTile, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `ParallelGroup::value_type::RawDType` 必须等于 `GlobalSrcData::RawDType`。
+    - `TileData::DType` 必须等于 `GlobalSrcData::RawDType`。
+- **内存约束**：
+    - `srcGlobalData` 必须指向本地内存（当前 NPU），且足够容纳所有 rank 的数据。具体要求：`srcGlobalData.GetShape(DIM_3)` 必须 $\geq N \times H$，其中 $H$ 为每个 rank 的 `GetShape(DIM_3)`。
+    - 若 `srcGlobalData.GetShape(DIM_3) > N × H`，则只读取前 `N × H` 行，其余行被忽略。
+    - `stagingTileData`（或 `pingTile` / `pongTile`）必须预先在 UB 中分配。
+- **ParallelGroup 约束**：
+    - `parallelGroup.tensors[r]` 必须指向 rank `r` 的目标缓冲区（从根节点视角看到的远端 GM）。
+    - `parallelGroup.GetRootIdx()` 标识调用方 NPU 为 scatter 根节点。
+    - 所有目标 tensor 假定具有相同的形状和步幅；否则行为未定义。
+- **分块模式约束**（每 rank 数据超出单个 UB Tile 时）：
+    - 若 `TileData` 具有静态 `ValidRow`，则每个 rank 目标数据的 `GetShape(DIM_3)` 必须能被 `ValidRow` 整除。如需支持不足一行的情况，请使用 `DYNAMIC` ValidRow 的 Tile。
+    - 若 `TileData` 具有静态 `ValidCol`，则 `GetShape(DIM_4)` 必须能被 `ValidCol` 整除。如需支持不足一列的情况，请使用 `DYNAMIC` ValidCol 的 Tile。
+
+## 示例
+
+### 基础 Scatter（单暂存 Tile）
+
+根节点拥有 `NRANKS * ROWS` 行、宽度为 `COLS` 的数据，每个 rank 接收 `ROWS × COLS`，沿 DIM_3 拆分。
+Tile 大小可小于每 rank 的数据——此时实现会自动通过二维滑动进行分块传输。
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void scatter(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GSource srcG(local_data);
+    TileT stagingTile(TILE_ROWS, TILE_COLS);
+    comm::TSCATTER(group, srcG, stagingTile);
+}
+```
+
+### 乒乓 Scatter（双缓冲）
+
+使用两个 UB Tile，将下一块的 TLOAD（MTE2）与当前块的 TSTORE（MTE3）重叠执行。
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+template <typename T, int ROWS, int COLS, int TILE_ROWS, int TILE_COLS, int NRANKS>
+void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int my_rank) {
+    using TileT    = Tile<TileType::Vec, T, TILE_ROWS, TILE_COLS, BLayout::RowMajor, -1, -1>;
+    using GPerRank = GlobalTensor<T, Shape<1,1,1,ROWS,COLS>,
+                                  BaseShape2D<T, ROWS, COLS, Layout::ND>, Layout::ND>;
+    using GSource  = GlobalTensor<T, Shape<1,1,1,NRANKS*ROWS,COLS>,
+                                  BaseShape2D<T, NRANKS*ROWS, COLS, Layout::ND>, Layout::ND>;
+
+    GPerRank tensors[NRANKS];
+    for (int i = 0; i < NRANKS; ++i) tensors[i] = GPerRank(group_addrs[i]);
+
+    comm::ParallelGroup<GPerRank> group(tensors, NRANKS, my_rank);
+    GSource srcG(local_data);
+    TileT pingTile(TILE_ROWS, TILE_COLS);
+    TileT pongTile(TILE_ROWS, TILE_COLS);
+    // 乒乓模式：将 TLOAD 与 TSTORE 重叠执行以提升吞吐量
+    comm::TSCATTER(group, srcG, pingTile, pongTile);
+}
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TTEST.md b/designs/outerCube/PTOISA/comm/TTEST.md
new file mode 100644
index 00000000..184f1f43
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TTEST.md
@@ -0,0 +1,150 @@
+﻿# TTEST
+
+## Introduction
+
+Non-blocking test if signal(s) meet comparison condition. Returns `true` if condition is satisfied, `false` otherwise. Used for polling-based synchronization with timeout or interleaved work.
+
+Supports single signal or multi-dimensional signal tensor (up to 5-D, shape derived from GlobalTensor). For tensor, returns `true` only if ALL signals meet the condition.
+
+## Math Interpretation
+
+Test and return result:
+
+Single signal:
+
+$$ \mathrm{result} = (\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
+
+Signal tensor (all must satisfy):
+
+$$ \mathrm{result} = \bigwedge_{d_0, d_1, d_2, d_3, d_4} (\mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue}) $$
+
+where `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+```text
+%result = ttest %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32) -> i1
+%result = ttest %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32) -> i1
+```
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+template <typename GlobalSignalData, typename... WaitEvents>
+PTO_INST bool TTEST(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
+- **Memory constraints**:
+    - `signalData` must point to local address (on current NPU).
+- **Return value**:
+    - Returns `true` if condition is satisfied, `false` otherwise.
+    - For signal tensor, returns `true` only if ALL signals satisfy the condition.
+- **Shape semantics**:
+    - For single signal: Shape is `<1,1,1,1,1>`.
+    - For signal tensor: Shape determines the multi-dimensional region (up to 5-D) to test.
+- **Comparison operators** (WaitCmp):
+  | Value | Condition |
+  |-------|-----------|
+  | `EQ` | `signal == cmpValue` |
+  | `NE` | `signal != cmpValue` |
+  | `GT` | `signal > cmpValue` |
+  | `GE` | `signal >= cmpValue` |
+  | `LT` | `signal < cmpValue` |
+  | `LE` | `signal <= cmpValue` |
+
+## Examples
+
+### Basic Test
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+bool check_ready(__gm__ int32_t* local_signal) {
+    comm::Signal sig(local_signal);
+    
+    // Check if signal == 1
+    return comm::TTEST(sig, 1, comm::WaitCmp::EQ);
+}
+```
+
+### Test Signal Matrix
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// Test if all signals from a 4x8 dense grid of workers are ready
+bool check_worker_grid(__gm__ int32_t* signal_matrix) {
+    comm::Signal2D<4, 8> grid(signal_matrix);
+    
+    // Returns true only if all 32 signals == 1
+    return comm::TTEST(grid, 1, comm::WaitCmp::EQ);
+}
+```
+
+### Polling with Timeout
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+bool poll_with_timeout(__gm__ int32_t* local_signal, int max_iterations) {
+    comm::Signal sig(local_signal);
+    
+    for (int i = 0; i < max_iterations; ++i) {
+        if (comm::TTEST(sig, 1, comm::WaitCmp::EQ)) {
+            return true;  // Signal received
+        }
+        // Could do other work here between polls
+    }
+    return false;  // Timeout
+}
+```
+
+### Progress-Based Polling
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void process_with_progress(__gm__ int32_t* local_counter, int expected_count) {
+    comm::Signal counter(local_counter);
+    
+    while (!comm::TTEST(counter, expected_count, comm::WaitCmp::GE)) {
+        // Do some useful work while waiting
+        // ...
+    }
+    // All expected signals received
+}
+```
+
+### Compare TWAIT vs TTEST
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void compare_wait_test(__gm__ int32_t* local_signal) {
+    comm::Signal sig(local_signal);
+
+    // Blocking: spins until signal == 1
+    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
+
+    // Non-blocking: returns immediately with result
+    bool ready = comm::TTEST(sig, 1, comm::WaitCmp::EQ);
+}
+```
diff --git a/designs/outerCube/PTOISA/comm/TTEST_zh.md b/designs/outerCube/PTOISA/comm/TTEST_zh.md
new file mode 100644
index 00000000..1b71740a
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TTEST_zh.md
@@ -0,0 +1,151 @@
+# TTEST
+
+## 简介
+
+非阻塞检测信号是否满足比较条件。满足则返回 `true`，否则返回 `false`。适用于基于轮询的同步（含超时）或与其他工作交错执行的场景。
+
+支持单个信号或多维信号 tensor（最高 5 维，形状由 GlobalTensor 决定）。对于 tensor，仅当**所有**信号均满足条件时才返回 `true`。
+
+## 数学语义
+
+检测并返回结果：
+
+单个信号：
+
+$$\mathrm{result} = (\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue})$$
+
+信号 tensor（所有元素均须满足）：
+
+$$\mathrm{result} = \bigwedge_{d_0, d_1, d_2, d_3, d_4} (\mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue})$$
+
+其中 `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+```text
+%result = ttest %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32) -> i1
+%result = ttest %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32) -> i1
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+template <typename GlobalSignalData, typename... WaitEvents>
+PTO_INST bool TTEST(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `GlobalSignalData::DType` 必须为 `int32_t`（32 位信号）。
+- **内存约束**：
+    - `signalData` 必须指向本地地址（当前 NPU）。
+- **返回值**：
+    - 条件满足时返回 `true`，否则返回 `false`。
+    - 对于信号 tensor，仅当所有信号均满足条件时才返回 `true`。
+- **形状语义**：
+    - 单个信号：形状为 `<1,1,1,1,1>`。
+    - 信号 tensor：形状决定要检测的多维区域（最高 5 维）。
+- **比较运算符**（WaitCmp）：
+  | 值 | 条件 |
+  |-------|--------|
+  | `EQ` | `signal == cmpValue` |
+  | `NE` | `signal != cmpValue` |
+  | `GT` | `signal > cmpValue` |
+  | `GE` | `signal >= cmpValue` |
+  | `LT` | `signal < cmpValue` |
+  | `LE` | `signal <= cmpValue` |
+
+## 示例
+
+### 基础检测
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+bool check_ready(__gm__ int32_t* local_signal) {
+    comm::Signal sig(local_signal);
+
+    // 检测 signal == 1
+    return comm::TTEST(sig, 1, comm::WaitCmp::EQ);
+}
+```
+
+### 检测信号矩阵
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// 检测 4x8 网格中所有 worker 的信号是否就绪
+bool check_worker_grid(__gm__ int32_t* signal_matrix) {
+    comm::Signal2D<4, 8> grid(signal_matrix);
+
+    // 仅当所有 32 个信号均为 1 时返回 true
+    return comm::TTEST(grid, 1, comm::WaitCmp::EQ);
+}
+```
+
+### 带超时的轮询
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+bool poll_with_timeout(__gm__ int32_t* local_signal, int max_iterations) {
+    comm::Signal sig(local_signal);
+
+    for (int i = 0; i < max_iterations; ++i) {
+        if (comm::TTEST(sig, 1, comm::WaitCmp::EQ)) {
+            return true;  // 收到信号
+        }
+        // 两次轮询之间可执行其他工作
+    }
+    return false;  // 超时
+}
+```
+
+### 基于进度的轮询
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void process_with_progress(__gm__ int32_t* local_counter, int expected_count) {
+    comm::Signal counter(local_counter);
+
+    while (!comm::TTEST(counter, expected_count, comm::WaitCmp::GE)) {
+        // 等待期间执行其他有用工作
+        // ...
+    }
+    // 所有预期信号均已收到
+}
+```
+
+### TWAIT 与 TTEST 对比
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void compare_wait_test(__gm__ int32_t* local_signal) {
+    comm::Signal sig(local_signal);
+
+    // 阻塞：自旋直到 signal == 1
+    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
+
+    // 非阻塞：立即返回结果
+    bool ready = comm::TTEST(sig, 1, comm::WaitCmp::EQ);
+}
+```
+
diff --git a/designs/outerCube/PTOISA/comm/TWAIT.md b/designs/outerCube/PTOISA/comm/TWAIT.md
new file mode 100644
index 00000000..4d9af8d8
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TWAIT.md
@@ -0,0 +1,131 @@
+﻿# TWAIT
+
+## Introduction
+
+Blocking wait until signal(s) meet comparison condition. Used in conjunction with `TNOTIFY` for flag-based synchronization.
+
+Supports single signal or multi-dimensional signal tensor (up to 5-D, shape derived from GlobalTensor).
+
+
+## Math Interpretation
+
+Wait (spin) until the following condition is satisfied:
+
+Single signal:
+
+$$ \mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
+
+Signal tensor (all elements must satisfy):
+
+$$ \forall d_0, d_1, d_2, d_3, d_4: \mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue} $$
+
+where `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
+
+## Assembly Syntax
+
+PTO-AS form: see [PTO-AS Specification](../../assembly/PTO-AS.md).
+
+```text
+twait %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32)
+twait %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32)
+```
+
+## C++ Intrinsic
+
+Declared in `include/pto/comm/pto_comm_inst.hpp`:
+
+```cpp
+template <typename GlobalSignalData, typename... WaitEvents>
+PTO_INST void TWAIT(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
+```
+
+## Constraints
+
+- **Type constraints**:
+    - `GlobalSignalData::DType` must be `int32_t` (32-bit signal).
+- **Memory constraints**:
+    - `signalData` must point to local address (on current NPU).
+- **Shape semantics**:
+    - For single signal: Shape is `<1,1,1,1,1>`.
+    - For signal tensor: Shape determines the multi-dimensional region (up to 5-D) to wait on. All signals in the tensor must satisfy the condition.
+- **Comparison operators** (WaitCmp):
+  | Value | Condition |
+  |-------|-----------|
+  | `EQ` | `signal == cmpValue` |
+  | `NE` | `signal != cmpValue` |
+  | `GT` | `signal > cmpValue` |
+  | `GE` | `signal >= cmpValue` |
+  | `LT` | `signal < cmpValue` |
+  | `LE` | `signal <= cmpValue` |
+
+## Examples
+
+### Wait for Single Signal
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void wait_for_ready(__gm__ int32_t* local_signal) {
+    comm::Signal sig(local_signal);
+    
+    // Wait until signal == 1
+    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
+}
+```
+
+### Wait for Signal Matrix
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// Wait for signals from a 4x8 dense grid of workers
+void wait_worker_grid(__gm__ int32_t* signal_matrix) {
+    comm::Signal2D<4, 8> grid(signal_matrix);
+    
+    // Wait until all 32 signals == 1
+    comm::TWAIT(grid, 1, comm::WaitCmp::EQ);
+}
+```
+
+### Wait for Counter Threshold
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void wait_for_count(__gm__ int32_t* local_counter, int expected_count) {
+    comm::Signal counter(local_counter);
+    
+    // Wait until counter >= expected_count
+    comm::TWAIT(counter, expected_count, comm::WaitCmp::GE);
+}
+```
+
+### Producer-Consumer Pattern
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// Producer: notify when data is ready
+void producer(__gm__ int32_t* remote_flag) {
+    // ... produce data ...
+    
+    comm::Signal flag(remote_flag);
+    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
+}
+
+// Consumer: wait for data
+void consumer(__gm__ int32_t* local_flag) {
+    comm::Signal flag(local_flag);
+    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
+    
+    // ... consume data ...
+}
+```
\ No newline at end of file
diff --git a/designs/outerCube/PTOISA/comm/TWAIT_zh.md b/designs/outerCube/PTOISA/comm/TWAIT_zh.md
new file mode 100644
index 00000000..0ed76431
--- /dev/null
+++ b/designs/outerCube/PTOISA/comm/TWAIT_zh.md
@@ -0,0 +1,131 @@
+# TWAIT
+
+## 简介
+
+阻塞等待，直到信号满足比较条件。与 `TNOTIFY` 配合使用，实现基于标志的同步。
+
+支持单个信号或多维信号 tensor（最高 5 维，形状由 GlobalTensor 决定）。
+
+## 数学语义
+
+自旋等待，直到以下条件满足：
+
+单个信号：
+
+$$\mathrm{signal} \;\mathtt{cmp}\; \mathrm{cmpValue}$$
+
+信号 tensor（所有元素均须满足）：
+
+$$\forall d_0, d_1, d_2, d_3, d_4: \mathrm{signal}_{d_0, d_1, d_2, d_3, d_4} \;\mathtt{cmp}\; \mathrm{cmpValue}$$
+
+其中 `cmp` ∈ {`EQ`, `NE`, `GT`, `GE`, `LT`, `LE`}
+
+## 汇编语法
+
+PTO-AS 形式：参见 [PTO-AS 规范](../../assembly/PTO-AS_zh.md)。
+
+```text
+twait %signal, %cmp_value {cmp = #pto.cmp<EQ>} : (!pto.memref<i32>, i32)
+twait %signal_matrix, %cmp_value {cmp = #pto.cmp<GE>} : (!pto.memref<i32, MxN>, i32)
+```
+
+## C++ 内建接口
+
+声明于 `include/pto/comm/pto_comm_inst.hpp`：
+
+```cpp
+template <typename GlobalSignalData, typename... WaitEvents>
+PTO_INST void TWAIT(GlobalSignalData &signalData, int32_t cmpValue, WaitCmp cmp, WaitEvents&... events);
+```
+
+## 约束
+
+- **类型约束**：
+    - `GlobalSignalData::DType` 必须为 `int32_t`（32 位信号）。
+- **内存约束**：
+    - `signalData` 必须指向本地地址（当前 NPU）。
+- **形状语义**：
+    - 单个信号：形状为 `<1,1,1,1,1>`。
+    - 信号 tensor：形状决定要等待的多维区域（最高 5 维）。tensor 中所有信号必须满足条件。
+- **比较运算符**（WaitCmp）：
+  | 值 | 条件 |
+  |-------|--------|
+  | `EQ` | `signal == cmpValue` |
+  | `NE` | `signal != cmpValue` |
+  | `GT` | `signal > cmpValue` |
+  | `GE` | `signal >= cmpValue` |
+  | `LT` | `signal < cmpValue` |
+  | `LE` | `signal <= cmpValue` |
+
+## 示例
+
+### 等待单个信号
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void wait_for_ready(__gm__ int32_t* local_signal) {
+    comm::Signal sig(local_signal);
+
+    // 等待 signal == 1
+    comm::TWAIT(sig, 1, comm::WaitCmp::EQ);
+}
+```
+
+### 等待信号矩阵
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// 等待 4x8 网格中所有 worker 的信号就绪
+void wait_worker_grid(__gm__ int32_t* signal_matrix) {
+    comm::Signal2D<4, 8> grid(signal_matrix);
+
+    // 等待所有 32 个信号均为 1
+    comm::TWAIT(grid, 1, comm::WaitCmp::EQ);
+}
+```
+
+### 等待计数器阈值
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+void wait_for_count(__gm__ int32_t* local_counter, int expected_count) {
+    comm::Signal counter(local_counter);
+
+    // 等待 counter >= expected_count
+    comm::TWAIT(counter, expected_count, comm::WaitCmp::GE);
+}
+```
+
+### 生产者-消费者模式
+
+```cpp
+#include <pto/comm/pto_comm_inst.hpp>
+
+using namespace pto;
+
+// 生产者：数据就绪后发送通知
+void producer(__gm__ int32_t* remote_flag) {
+    // ... 生产数据 ...
+
+    comm::Signal flag(remote_flag);
+    comm::TNOTIFY(flag, 1, comm::NotifyOp::Set);
+}
+
+// 消费者：等待数据就绪
+void consumer(__gm__ int32_t* local_flag) {
+    comm::Signal flag(local_flag);
+    comm::TWAIT(flag, 1, comm::WaitCmp::EQ);
+
+    // ... 消费数据 ...
+}
+```
+
diff --git a/designs/outerCube/PTOISA/conventions.md b/designs/outerCube/PTOISA/conventions.md
new file mode 100644
index 00000000..c1d9dd82
--- /dev/null
+++ b/designs/outerCube/PTOISA/conventions.md
@@ -0,0 +1,41 @@
+﻿# PTO ISA Conventions
+
+This page defines shared conventions used by the per-instruction ISA reference pages in `docs/isa/` and the corresponding C++ intrinsics in `include/pto/common/pto_instr.hpp`.
+
+## Notation
+
+- **Tile**: A fixed-size on-chip tile object (e.g., `pto::Tile<...>`). Many instructions operate on tiles and use the tile’s valid region (`GetValidRow()`, `GetValidCol()`).
+- **GM (global memory)**: Off-chip memory accessed via `pto::GlobalTensor<...>`.
+- **Scalar / immediate**: A host-side scalar value or an encoded immediate used by `*S` / `*C` variants.
+
+For the detailed C++ programming model behind these terms, see:
+
+- Tiles: `docs/coding/Tile.md`
+- GlobalTensor: `docs/coding/GlobalTensor.md`
+- Scalars and enums: `docs/coding/Scalar.md`
+
+## Shapes and layouts
+
+- **Row-major vs. column-major**: Unless stated otherwise, CPU simulator kernels assume row-major tiles. Instructions that support multiple layouts will state supported layouts explicitly.
+- **Valid region**: The runtime compute region of a tile, expressed as `(valid_row, valid_col)` and queried via `GetValidRow()` / `GetValidCol()`.
+
+### Valid Region Semantics
+
+For instruction pages, when we say “for each element `(i, j)` in the valid region”, we mean:
+
+- `valid_row = dst.GetValidRow()` and `valid_col = dst.GetValidCol()` unless the instruction explicitly defines a different domain (e.g., some ops may use the source tile’s valid region).
+- The math interpretation defines `dst[i, j]` only for indices where `0 <= i < valid_row` and `0 <= j < valid_col`.
+- Elements outside the valid region are **unspecified** unless the instruction explicitly states otherwise (do not assume they are zeroed or preserved).
+
+For multi-operand instructions (e.g., `src0`, `src1`), the docs assume the input tiles are compatible with the iteration domain unless the constraints section states stricter requirements.
+
+## Types
+
+- The instruction page lists supported data types (e.g., `fp16`, `fp32`, `int8`, `int16`, `int32`, `uint8`, `uint16`, `uint32`). CPU simulator support may be a subset and is documented in `include/README.md`.
+
+## Events and synchronization
+
+- Instructions may require ordering between memory and vector pipelines. When examples show events (e.g., `set_flag(...)` / `wait_flag(...)`), they indicate the required ordering constraints on the target backend.
+- `TSYNC` is used for explicit synchronization when needed by a sequence of instructions.
+
+See `docs/coding/Event.md` for the event model used by PTO Tile Lib.
diff --git a/designs/outerCube/PTOISA/conventions_zh.md b/designs/outerCube/PTOISA/conventions_zh.md
new file mode 100644
index 00000000..e7a292e4
--- /dev/null
+++ b/designs/outerCube/PTOISA/conventions_zh.md
@@ -0,0 +1,42 @@
+﻿# PTO ISA 通用约定
+
+本页定义 `docs/isa/` 指令参考文档中使用的通用术语与写法，并与 `include/pto/common/pto_instr.hpp` 中的 C++ 内建接口保持一致。
+
+## 记号
+
+- **Tile**：片上二维操作数对象（例如 `pto::Tile<...>`）。大量指令以 Tile 作为输入/输出，并通过 `GetValidRow()` / `GetValidCol()` 使用 Tile 的有效区域（valid region）。
+- **GM（全局内存）**：通过 `pto::GlobalTensor<...>` 访问的片外内存视图。
+- **标量 / 立即数**：主机侧标量值，或在 `*S` / `*C` 等变体中编码的立即数参数。
+
+关于这些对象的 C++ 编程模型（类型、布局、枚举、约束等），可参考：
+
+- Tile：`docs/coding/Tile_zh.md`
+- GlobalTensor：`docs/coding/GlobalTensor_zh.md`
+- 标量与枚举：`docs/coding/Scalar_zh.md`
+
+## 形状与布局
+
+- **行主序 / 列主序**：除非指令页明确声明支持多种布局，否则示例与参考实现默认假设为行主序 Tile。支持多布局的指令会在约束小节中列出具体要求。
+- **有效区域（valid region）**：Tile 运行时计算域，通常写作 `(valid_row, valid_col)`，并通过 `GetValidRow()` / `GetValidCol()` 查询。
+
+### 有效区域语义
+
+在指令页中，当我们写“对有效区域内的每个元素 `(i, j)`”，含义为：
+
+- 除非指令显式定义不同的迭代域，否则默认使用 `valid_row = dst.GetValidRow()`、`valid_col = dst.GetValidCol()`。
+- 数学语义仅对 `0 <= i < valid_row` 且 `0 <= j < valid_col` 的 `dst[i, j]` 做出定义。
+- 有效区域之外元素的值为**未指定**，除非指令页明确说明（不要假设一定清零或保持不变）。
+
+对多输入指令（例如 `src0`、`src1`），除非约束小节有更严格的要求，文档默认输入 Tile 与迭代域在形状/有效区域上是兼容的。
+
+## 数据类型
+
+每条指令页会列出支持的数据类型（例如 `fp16`、`fp32`、`int8`、`int16`、`int32`、`uint8`、`uint16`、`uint32` 等）。
+不同后端/目标对数据类型与布局支持可能不同，具体以对应实现与编译期检查为准。
+
+## 事件与同步
+
+- 某些指令序列需要建立内存与向量流水线之间的顺序关系。示例中出现的事件（例如 `set_flag(...)` / `wait_flag(...)`）用于表达后端需要满足的顺序约束。
+- 在需要显式同步的场景，使用 `TSYNC` 建立阶段间的顺序关系。
+
+事件模型可参考：`docs/coding/Event_zh.md`。
diff --git a/designs/outerCube/nv_shuffle.py b/designs/outerCube/nv_shuffle.py
new file mode 100644
index 00000000..319f29be
--- /dev/null
+++ b/designs/outerCube/nv_shuffle.py
@@ -0,0 +1,304 @@
+#!/usr/bin/env python3
+"""
+NVIDIA warp shuffle schematic (16 B register tile).
+
+One PNG montage (4×2): rows = __shfl_sync (per-lane indexed read; here a fixed
+permutation), __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync; columns =
+**FP32 / FP16 (BF16)** (E = 4 / 2 B). Smaller-precision formats (FP8, FP4) are
+intentionally excluded — the two remaining columns are the only storage widths
+supported by the current VEC-4K / VEC-512 subset.
+
+Each pane: input lanes (▭ width ∝ E) on top, output lanes on bottom, curved wires
+lane i ← lane src(i). Not cycle-accurate; illustration only.
+"""
+
+from __future__ import annotations
+
+import argparse
+import math
+import os
+from dataclasses import dataclass
+from typing import List, Sequence, Tuple
+
+from matplotlib.axes import Axes
+from matplotlib.patches import Circle, FancyArrowPatch, Rectangle
+
+import matplotlib.pyplot as plt
+
+TILE_BYTES = 16
+E_REF = 4.0
+
+
+@dataclass(frozen=True)
+class FloatForm:
+    E: float  # bytes per element
+    label: str
+
+    @property
+    def n(self) -> int:
+        return int(round(TILE_BYTES / self.E))
+
+
+def formats() -> List[FloatForm]:
+    # Only FP32 and FP16/BF16 are supported. BF16 shares E = 2 B with FP16, so
+    # one column covers both; the label mentions both for clarity.
+    return [
+        FloatForm(4.0, "FP32"),
+        FloatForm(2.0, "FP16 / BF16"),
+    ]
+
+
+def _rect(ax: Axes, cx: float, cy: float, w: float, h: float, **kwargs) -> Rectangle:
+    kwargs.setdefault("zorder", 2)
+    r = Rectangle((cx - w / 2, cy - h / 2), w, h, fill=True, **kwargs)
+    ax.add_patch(r)
+    return r
+
+
+def _rect_top(cy: float, h: float) -> float:
+    return cy + h / 2
+
+
+def _rect_bottom(cy: float, h: float) -> float:
+    return cy - h / 2
+
+
+def _lane_layout(E: float, n: int, x0: float, x1: float) -> Tuple[float, float, List[float]]:
+    """Single contiguous row of n elements; rectangle width ∝ E (same spirit as tile16 script)."""
+    e = max(float(E), 0.5)
+    span = x1 - x0
+    g_in = min(0.028, 0.11 * span / max(n, 2))
+    gaps_inside = max(n - 1, 0) * g_in
+    slack = span - gaps_inside
+    if slack <= 0:
+        g_in *= 0.5
+        gaps_inside = max(n - 1, 0) * g_in
+        slack = span - gaps_inside
+    K = slack / (n * E_REF)
+    w = max(0.0028, K * e)
+    if n * w + gaps_inside > span:
+        w = max(0.0028, (span - gaps_inside) / max(n, 1))
+    h = min(0.038, 0.48 / max(n, 1))
+    xs: List[float] = []
+    x = x0
+    for i in range(n):
+        xs.append(x + w / 2)
+        x += w
+        if i < n - 1:
+            x += g_in
+    return w, h, xs
+
+
+def _shfl_linear_permutation_params(n: int) -> Tuple[int, int]:
+    """
+    Coefficients (a, b) for src(dst) = (a * dst + b) % n with gcd(a, n) == 1 so each
+    destination reads from a distinct source (full permutation on lanes).
+    """
+    if n <= 1:
+        return 1, 0
+    a = 1
+    for cand in range(2, n):
+        if math.gcd(cand, n) == 1:
+            a = cand
+            break
+    b = 1 if n > 1 else 0
+    return a, b
+
+
+def _shuffle_src(
+    op: str, n: int, dst: int, *, delta: int = 1, xor_mask: int = 4
+) -> int:
+    """Map destination lane -> source lane for schematic (CUDA-ish)."""
+    if n <= 0:
+        return 0
+    if op == "shfl":
+        a, b = _shfl_linear_permutation_params(n)
+        return (a * dst + b) % n
+    if op == "shfl_up":
+        j = dst - delta
+        return j if j >= 0 else 0
+    if op == "shfl_down":
+        j = dst + delta
+        return j if j < n else n - 1
+    if op == "shfl_xor":
+        m = xor_mask % n
+        if m == 0:
+            m = 1 if n > 1 else 0
+        return dst ^ m
+    return dst
+
+
+def render_shuffle_pane(
+    ax: Axes,
+    op: str,
+    op_title: str,
+    cuda_name: str,
+    spec: FloatForm,
+    *,
+    title_fs: float = 8.0,
+    lane_fs: float = 5.0,
+) -> None:
+    n = spec.n
+    ax.set_xlim(0, 1)
+    ax.set_ylim(0, 1)
+    ax.axis("off")
+
+    E = float(spec.E)
+    x0, x1 = 0.05, 0.95
+    w, h, xs = _lane_layout(E, n, x0, x1)
+
+    y_in = 0.74
+    y_out = 0.26
+    ax.set_title(
+        f"{op_title}\n{cuda_name} — {spec.label} — E={E:g} B, N={n} ({TILE_BYTES} B tile)",
+        fontsize=title_fs,
+        pad=4,
+    )
+
+    for i in range(n):
+        _rect(ax, xs[i], y_in, w, h, facecolor="#cfe8ff", edgecolor="#1a4f8c", linewidth=0.9)
+        if n <= 16:
+            ax.text(
+                xs[i],
+                y_in + h / 2 + 0.022,
+                f"L{i}",
+                ha="center",
+                va="bottom",
+                fontsize=max(4.0, lane_fs),
+                color="#123",
+            )
+
+    for i in range(n):
+        _rect(ax, xs[i], y_out, w, h, facecolor="#dff5df", edgecolor="#1a6c2e", linewidth=0.9)
+        if n <= 16:
+            ax.text(
+                xs[i],
+                y_out - h / 2 - 0.018,
+                f"O{i}",
+                ha="center",
+                va="top",
+                fontsize=max(4.0, lane_fs),
+                color="#123",
+            )
+
+    # Legend for op-specific params (tiny)
+    if op == "shfl":
+        pa, pb = _shfl_linear_permutation_params(n)
+        note = f"per-lane srcLane: ( {pa}·i + {pb} ) mod {n} (each Oi reads a different Li)"
+    elif op == "shfl_up":
+        note = "delta=1 (from lower lane id)"
+    elif op == "shfl_down":
+        note = "delta=1 (from higher lane id)"
+    else:
+        m = 4 % n if n else 0
+        if m == 0:
+            m = 1 if n > 1 else 0
+        note = f"maskLane={m} (XOR index)"
+
+    ax.text(0.5, 0.06, note, ha="center", va="center", fontsize=max(3.8, lane_fs - 0.8), color="#444")
+
+    xor_mask = 4
+    for dst in range(n):
+        src = _shuffle_src(op, n, dst, delta=1, xor_mask=xor_mask)
+        x0a, y0a = xs[src], _rect_bottom(y_in, h)
+        x1a, y1a = xs[dst], _rect_top(y_out, h)
+        dx = x1a - x0a
+        rad = 0.22 * (1.0 if dx >= 0 else -1.0) * (0.35 + min(abs(dx) * 1.8, 0.9))
+        style = "arc3,rad=" + str(rad)
+        a = FancyArrowPatch(
+            (x0a, y0a),
+            (x1a, y1a),
+            arrowstyle="-|>",
+            linestyle="-",
+            linewidth=0.75,
+            color="#333",
+            connectionstyle=style,
+            mutation_scale=7,
+            zorder=3,
+            clip_on=False,
+        )
+        ax.add_patch(a)
+
+    # Optional small node at merge (purely decorative for dense N)
+    if n <= 8:
+        cr = min(0.014, 0.5 * h)
+        for dst in range(n):
+            xm = 0.5 * (xs[_shuffle_src(op, n, dst, xor_mask=xor_mask)] + xs[dst])
+            ym = 0.5 * (y_in + y_out)
+            c = Circle((xm, ym), cr, facecolor="#f6f0ff", edgecolor="#555", linewidth=0.6, zorder=1)
+            ax.add_patch(c)
+
+
+def _save_montage_4x2(out_path: str, *, dpi: int = 140) -> None:
+    forms = formats()
+    ops: List[Tuple[str, str, str]] = [
+        ("shfl", "shfl (per-lane indexed)", "__shfl_sync"),
+        ("shfl_up", "shfl_up", "__shfl_up_sync"),
+        ("shfl_down", "shfl_down", "__shfl_down_sync"),
+        ("shfl_xor", "shfl_xor", "__shfl_xor_sync"),
+    ]
+
+    fig, axes = plt.subplots(4, 2, figsize=(14, 22), dpi=dpi)
+    fig.suptitle(
+        f"NVIDIA warp shuffle — {TILE_BYTES} B tile, ▭ width ∝ element size (FP32 / FP16–BF16)",
+        fontsize=13,
+        y=0.995,
+    )
+
+    for r, (op, short, cuda) in enumerate(ops):
+        for c, spec in enumerate(forms):
+            render_shuffle_pane(axes[r, c], op, short, cuda, spec, title_fs=7.6, lane_fs=5.0)
+
+    for c, spec in enumerate(forms):
+        axes[0, c].text(
+            0.5,
+            1.02,
+            spec.label,
+            transform=axes[0, c].transAxes,
+            ha="center",
+            va="bottom",
+            fontsize=10,
+            fontweight="600",
+            color="#111",
+        )
+
+    fig.text(
+        0.5,
+        0.012,
+        (
+            "Each column fixes E (bytes/element) with R×C×E = 16 B; rows are shuffle kinds. "
+            "Only FP32 (E=4) and FP16/BF16 (E=2) are shown; smaller-precision formats are "
+            "out of scope. Top row: each Oi uses its own srcLane(i) (here a linear permutation "
+            "on lanes; broadcast is the special case srcLane(i)=const). "
+            "Arrows: value at Oi from Lsrc (schematic; mask/warp participation omitted)."
+        ),
+        ha="center",
+        fontsize=8.5,
+        color="#222",
+    )
+    fig.subplots_adjust(left=0.04, right=0.96, top=0.94, bottom=0.04, hspace=0.42, wspace=0.20)
+    fig.savefig(out_path, bbox_inches="tight")
+    plt.close(fig)
+
+
+def main(argv: Sequence[str] | None = None) -> int:
+    p = argparse.ArgumentParser(description="Plot NVIDIA warp shuffle 4×2 montage (one PNG).")
+    p.add_argument(
+        "-o",
+        "--out",
+        default=os.path.join(os.path.dirname(__file__), "tile16_figures", "nv_shuffle_all.png"),
+        help="Output PNG path (directories created if missing).",
+    )
+    p.add_argument("--dpi", type=int, default=140, help="Figure DPI.")
+    args = p.parse_args(list(argv) if argv is not None else None)
+
+    out_dir = os.path.dirname(os.path.abspath(args.out))
+    if out_dir:
+        os.makedirs(out_dir, exist_ok=True)
+    _save_montage_4x2(args.out, dpi=args.dpi)
+    print(f"Wrote {args.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/designs/outerCube/plot_tile16_vector_datapath.py b/designs/outerCube/plot_tile16_vector_datapath.py
new file mode 100644
index 00000000..7d21b991
--- /dev/null
+++ b/designs/outerCube/plot_tile16_vector_datapath.py
@@ -0,0 +1,770 @@
+#!/usr/bin/env python3
+"""
+Generate small-tile (16 B) vector datapath illustrations inspired by
+`vector4k.md`: A and B holding registers as contiguous blocks (all A, then all
+B), ▭ width ∝ E, circles = 2-input execution units, crossed wires.
+
+Only **FP32** (E = 4 B/elem) and **FP16 / BF16** (E = 2 B/elem) are supported;
+smaller-precision formats (FP8 / MXFP4 / HiFP4) are intentionally excluded.
+
+Writes 4 PNGs (one montage per operation type: 8 subplots each) to an output
+directory (default: ./tile16_figures next to this script). Types 1–4: elementwise,
+reduce, expand, mergesort (compare–swap levels).
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+from dataclasses import dataclass
+from typing import Callable, List, Sequence, Tuple
+
+from matplotlib.axes import Axes
+
+import matplotlib.pyplot as plt
+from matplotlib.patches import Circle, FancyArrowPatch, Rectangle
+
+TILE_BYTES = 16
+# Reference element size (bytes) for rectangle width scaling in figures.
+E_REF = 4.0
+
+
+@dataclass(frozen=True)
+class TileShape:
+    idx: int
+    R: int
+    C: int
+    E: float  # bytes per logical element — only 4.0 (FP32) or 2.0 (FP16/BF16) supported
+    label: str
+
+    def __post_init__(self) -> None:
+        if self.E not in (4.0, 2.0):
+            raise ValueError(
+                f"Unsupported element size E={self.E}; only FP32 (4 B) and "
+                f"FP16/BF16 (2 B) are allowed."
+            )
+
+    @property
+    def N(self) -> int:
+        assert abs(self.R * self.C * self.E - TILE_BYTES) < 1e-6, self
+        return int(round(TILE_BYTES / self.E))
+
+    def fmt_desc(self) -> str:
+        return f"R×C = {self.R}×{self.C}, E = {self.E} B/elem → N = {self.N}"
+
+
+def canonical_shapes() -> List[TileShape]:
+    # Eight distinct (R, C, E) with R*C*E = 16 B; only FP32 (E=4) and FP16/BF16 (E=2).
+    # FP32 has 3 legal shapes, FP16 has 4; one shape appears with both FP16 and BF16
+    # labels to emphasise that BF16 shares E=2 B/elem storage with FP16.
+    return [
+        TileShape(1, 1, 4, 4.0, "1×4 FP32 (wide row)"),
+        TileShape(2, 2, 2, 4.0, "2×2 FP32"),
+        TileShape(3, 4, 1, 4.0, "4×1 FP32 (tall col)"),
+        TileShape(4, 1, 8, 2.0, "1×8 FP16 (wide row)"),
+        TileShape(5, 2, 4, 2.0, "2×4 FP16"),
+        TileShape(6, 4, 2, 2.0, "4×2 FP16"),
+        TileShape(7, 8, 1, 2.0, "8×1 FP16 (tall col)"),
+        TileShape(8, 2, 4, 2.0, "2×4 BF16 (same E as FP16)"),
+    ]
+
+
+def _rect(ax: Axes, cx: float, cy: float, w: float, h: float, **kwargs) -> Rectangle:
+    """Axis-aligned rectangle centered at (cx, cy)."""
+    kwargs.setdefault("zorder", 2)
+    r = Rectangle((cx - w / 2, cy - h / 2), w, h, fill=True, **kwargs)
+    ax.add_patch(r)
+    return r
+
+
+def _rect_top(cy: float, h: float) -> float:
+    return cy + h / 2
+
+
+def _rect_bottom(cy: float, h: float) -> float:
+    return cy - h / 2
+
+
+def _circ_top(cy: float, r: float) -> float:
+    return cy + r
+
+
+def _circ_bottom(cy: float, r: float) -> float:
+    return cy - r
+
+
+def _ab_block_layout(
+    E: float,
+    n_a: int,
+    n_b: int,
+    x0: float,
+    x1: float,
+) -> Tuple[float, float, float, float, List[float], List[float]]:
+    """
+    All A operands contiguous, then a gap, then all B operands contiguous.
+    Same width w for every element rectangle (w ∝ E). Returns:
+    (w, h, gap_inside_block, gap_between_ab, xa_centers, xb_centers).
+    """
+    # E ∈ {2.0, 4.0} post-simplification (FP16/BF16 or FP32). Keep a small floor
+    # for numerical safety if a future caller passes an unexpected E.
+    e = max(float(E), 1.0)
+    span = x1 - x0
+    gap_block = min(0.032, 0.09 * span)
+    g_in = min(0.009, 0.14 * span / max(n_a + n_b, 2))
+    gaps_inside = max(n_a - 1, 0) * g_in + max(n_b - 1, 0) * g_in
+    coef = n_a + n_b
+    slack = span - gap_block - gaps_inside
+    if slack <= 0:
+        slack = 0.01 * span
+        gap_block *= 0.5
+        g_in *= 0.5
+        gaps_inside = max(n_a - 1, 0) * g_in + max(n_b - 1, 0) * g_in
+        slack = span - gap_block - gaps_inside
+    K = slack / (coef * E_REF)
+    w = max(0.0025, K * e)
+    if coef * w + gaps_inside + gap_block > span:
+        w = max(0.0025, (span - gap_block - gaps_inside) / max(coef, 1))
+    h = min(0.032, 0.52 / max(n_a + n_b, 1))
+
+    xa: List[float] = []
+    x = x0
+    for i in range(n_a):
+        xa.append(x + w / 2)
+        x += w
+        if i < n_a - 1:
+            x += g_in
+    x += gap_block
+    xb: List[float] = []
+    for i in range(n_b):
+        xb.append(x + w / 2)
+        x += w
+        if i < n_b - 1:
+            x += g_in
+    return w, h, g_in, gap_block, xa, xb
+
+
+def _circle(ax, cx: float, cy: float, r: float, **kwargs) -> Circle:
+    kwargs.setdefault("zorder", 2)
+    c = Circle((cx, cy), r, fill=True, **kwargs)
+    ax.add_patch(c)
+    return c
+
+
+def _wire_straight(
+    ax: Axes,
+    p0: Tuple[float, float],
+    p1: Tuple[float, float],
+    *,
+    color: str = "#444",
+    lw: float = 0.9,
+    zorder: float = 4,
+) -> None:
+    """Straight segment (exact endpoints; draws above operand patches)."""
+    ax.plot(
+        [p0[0], p1[0]],
+        [p0[1], p1[1]],
+        color=color,
+        linewidth=lw,
+        solid_capstyle="round",
+        zorder=zorder,
+        clip_on=False,
+    )
+
+
+def _wire(
+    ax,
+    p0: Tuple[float, float],
+    p1: Tuple[float, float],
+    rad: float = 0.0,
+    color: str = "#444",
+    lw: float = 0.9,
+    zorder: float = 4,
+) -> None:
+    if abs(rad) < 1e-9:
+        _wire_straight(ax, p0, p1, color=color, lw=lw, zorder=zorder)
+        return
+    style = "arc3,rad=" + str(rad)
+    a = FancyArrowPatch(
+        p0,
+        p1,
+        arrowstyle="-",
+        linestyle="-",
+        linewidth=lw,
+        color=color,
+        connectionstyle=style,
+        mutation_scale=1,
+        zorder=zorder,
+        clip_on=False,
+    )
+    ax.add_patch(a)
+
+
+def _odd_even_transposition_stages(n: int, max_stages: int) -> List[List[Tuple[int, int]]]:
+    """Disjoint compare–swap pairs per stage (odd / even offset), straight multistage network."""
+    out: List[List[Tuple[int, int]]] = []
+    for p in range(max_stages):
+        pairs: List[Tuple[int, int]] = []
+        start = p % 2
+        for i in range(start, n - 1, 2):
+            pairs.append((i, i + 1))
+        out.append(pairs)
+    return out
+
+
+def _title_block(ax: Axes, kind: str, spec: TileShape, *, fontsize: float = 11) -> None:
+    ax.set_title(
+        f"{kind} — {TILE_BYTES} B tile\n{spec.label} — {spec.fmt_desc()}",
+        fontsize=fontsize,
+        pad=6,
+    )
+
+
+def render_elementwise(
+    ax: Axes,
+    spec: TileShape,
+    *,
+    title_fontsize: float = 9,
+    label_fs: float = 6,
+    show_footer: bool = False,
+) -> None:
+    N = spec.N
+    ax.set_xlim(0, 1)
+    ax.set_ylim(0, 1)
+    ax.axis("off")
+    _title_block(ax, "Element-wise (2-input / lane)", spec, fontsize=title_fontsize)
+
+    # One row: all A contiguous, gap, all B contiguous; width ∝ E; ALU below each (A[i],B[i]) pair.
+    x0, x1 = 0.04, 0.96
+    y_reg, y_u, y_o = 0.86, 0.50, 0.30
+    w, h, _g_in, _gap_ab, xa_list, xb_list = _ab_block_layout(float(spec.E), N, N, x0, x1)
+    cr = min(0.022, 0.04 / max(N / 8, 1), 0.55 * h)
+
+    ax.text(0.02, y_reg + h / 2 + 0.045, "A block | B block (width ∝ E)", fontsize=max(5.0, label_fs))
+
+    for i in range(N):
+        xa_c, xb_c = xa_list[i], xb_list[i]
+        x_mid = 0.5 * (xa_c + xb_c)
+        _rect(ax, xa_c, y_reg, w, h, facecolor="#cfe8ff", edgecolor="#1a4f8c", linewidth=1.0)
+        ax.text(xa_c, y_reg + h / 2 + 0.028, f"A{i}", ha="center", va="bottom", fontsize=label_fs)
+        _rect(ax, xb_c, y_reg, w, h, facecolor="#ffe8cf", edgecolor="#8c4f1a", linewidth=1.0)
+        ax.text(xb_c, y_reg + h / 2 + 0.028, f"B{i}", ha="center", va="bottom", fontsize=label_fs)
+
+        _circle(ax, x_mid, y_u, cr, facecolor="#e8e8ff", edgecolor="#333", linewidth=1.0)
+        ax.text(x_mid, y_u, "⊕", ha="center", va="center", fontsize=max(5.0, label_fs), color="#222")
+
+        rad_a = 0.18 * ((-1) ** i)
+        rad_b = -rad_a
+        _wire(
+            ax,
+            (xa_c, _rect_bottom(y_reg, h)),
+            (x_mid, _circ_top(y_u, cr)),
+            rad=rad_a,
+        )
+        _wire(
+            ax,
+            (xb_c, _rect_bottom(y_reg, h)),
+            (x_mid, _circ_top(y_u, cr)),
+            rad=rad_b,
+        )
+
+        wo, ho = w * 0.92, h * 0.92
+        _rect(ax, x_mid, y_o, wo, ho, facecolor="#dff5df", edgecolor="#1a6c2e", linewidth=0.9)
+        _wire_straight(
+            ax,
+            (x_mid, _circ_bottom(y_u, cr)),
+            (x_mid, _rect_top(y_o, ho)),
+            lw=0.7,
+            color="#555",
+        )
+
+    if show_footer:
+        ax.text(
+            0.5,
+            0.02,
+            "▭ = element (width ∝ E); ○ = 2-input ALU; opposite bends = crossed operand paths.",
+            ha="center",
+            va="bottom",
+            fontsize=6,
+            color="#333",
+        )
+
+
+def render_reduce(
+    ax: Axes,
+    spec: TileShape,
+    *,
+    title_fontsize: float = 9,
+    label_fs: float = 5.5,
+    show_footer: bool = False,
+) -> None:
+    """Row-style reduce along C for fiber row 0; then binary reduction tree."""
+    C = spec.C
+    ax.set_xlim(0, 1)
+    ax.set_ylim(0, 1)
+    ax.axis("off")
+    _title_block(ax, "Reduce (row 0, axis C)", spec, fontsize=title_fontsize)
+
+    N = C
+    x0, x1 = 0.06, 0.94
+    y_reg = 0.12
+    y_leaf_circ = 0.30
+    w, h, _g_in, _gap_ab, xa_list, xb_list = _ab_block_layout(float(spec.E), N, N, x0, x1)
+    cr = min(0.024, 0.05 / max(N / 8, 1), 0.55 * h)
+
+    ax.text(0.02, 0.94, f"fiber row 0; R={spec.R}", fontsize=max(5.0, label_fs))
+    ax.text(0.02, y_reg + h / 2 + 0.05, "A block | B block (width ∝ E)", fontsize=max(5.0, label_fs))
+
+    for i in range(N):
+        xa_c, xb_c = xa_list[i], xb_list[i]
+        x_mid = 0.5 * (xa_c + xb_c)
+        _rect(ax, xa_c, y_reg, w, h, facecolor="#cfe8ff", edgecolor="#1a4f8c", linewidth=1.0)
+        ax.text(xa_c, y_reg + h / 2 + 0.028, f"A0,{i}", ha="center", va="bottom", fontsize=label_fs)
+        _rect(ax, xb_c, y_reg, w, h, facecolor="#ffe8cf", edgecolor="#8c4f1a", linewidth=1.0)
+        ax.text(xb_c, y_reg + h / 2 + 0.028, f"B0,{i}", ha="center", va="bottom", fontsize=label_fs)
+
+    leaf_out_x: List[float] = []
+    for i in range(N):
+        xa_c, xb_c = xa_list[i], xb_list[i]
+        x_mid = 0.5 * (xa_c + xb_c)
+        _circle(ax, x_mid, y_leaf_circ, cr, facecolor="#e8e8ff", edgecolor="#333", linewidth=1.0)
+        rad_a = 0.12 * ((-1) ** i)
+        _wire(
+            ax,
+            (xa_c, _rect_top(y_reg, h)),
+            (x_mid, _circ_bottom(y_leaf_circ, cr)),
+            rad=rad_a,
+        )
+        _wire(
+            ax,
+            (xb_c, _rect_top(y_reg, h)),
+            (x_mid, _circ_bottom(y_leaf_circ, cr)),
+            rad=-rad_a,
+        )
+        leaf_out_x.append(x_mid)
+
+    leaves = [(x, y_leaf_circ) for x in leaf_out_x]
+    cur = leaves
+    dy = 0.09
+    r_tree = cr * 1.05
+    y = y_leaf_circ + cr + dy * 0.55
+    depth = 0
+    while len(cur) > 1:
+        r_child = cr if depth == 0 else r_tree
+        nxt: List[Tuple[float, float]] = []
+        for j in range(0, len(cur), 2):
+            if j + 1 < len(cur):
+                xm = 0.5 * (cur[j][0] + cur[j + 1][0])
+                xL, xR = cur[j][0], cur[j + 1][0]
+            else:
+                xm = cur[j][0]
+                xL = xR = cur[j][0]
+            _circle(ax, xm, y, r_tree, facecolor="#f3e8ff", edgecolor="#333", linewidth=1.0)
+            _wire(
+                ax,
+                (xL, cur[j][1] + r_child),
+                (xm, _circ_bottom(y, r_tree)),
+                rad=0.05,
+            )
+            if j + 1 < len(cur):
+                _wire(
+                    ax,
+                    (xR, cur[j + 1][1] + r_child),
+                    (xm, _circ_bottom(y, r_tree)),
+                    rad=-0.05,
+                )
+            nxt.append((xm, y))
+        cur = nxt
+        y += dy
+        depth += 1
+
+    root = cur[0]
+    w_acc = max(w * 1.25, 0.014)
+    y_acc_c = root[1] + 0.08
+    _rect(
+        ax,
+        root[0],
+        y_acc_c,
+        w_acc,
+        h,
+        facecolor="#ffd7d7",
+        edgecolor="#8c1a1a",
+        linewidth=1.1,
+    )
+    _wire_straight(
+        ax,
+        (root[0], _circ_top(root[1], r_tree)),
+        (root[0], _rect_bottom(y_acc_c, h)),
+        lw=0.85,
+        color="#333",
+    )
+    ax.text(root[0], y_acc_c + h / 2 + 0.045, "Acc", ha="center", va="bottom", fontsize=max(5.0, label_fs + 1))
+
+    if show_footer:
+        ax.text(
+            0.5,
+            0.01,
+            "○ = 2-input combine; upper = reduction tree.",
+            ha="center",
+            fontsize=6,
+        )
+
+
+def _fanout_levels_top_down(
+    C: int, x0: float, x1: float, y_top: float, y_bottom: float
+) -> List[List[Tuple[float, float]]]:
+    """Balanced pairing tree: index 0 is root, last level has C leaves (left-to-right x)."""
+    xs = [x0 + (x1 - x0) * i / max(C - 1, 1) for i in range(C)]
+    layers_bottom_up: List[List[Tuple[float, float]]] = [[(x, 0.0) for x in xs]]
+    while len(layers_bottom_up[-1]) > 1:
+        prev = layers_bottom_up[-1]
+        nxt: List[Tuple[float, float]] = []
+        for j in range(0, len(prev), 2):
+            if j + 1 < len(prev):
+                xm = 0.5 * (prev[j][0] + prev[j + 1][0])
+            else:
+                xm = prev[j][0]
+            nxt.append((xm, 0.0))
+        layers_bottom_up.append(nxt)
+    layers = list(reversed(layers_bottom_up))
+    nh = len(layers)
+    for i, lev in enumerate(layers):
+        if nh == 1:
+            yi = y_top
+        else:
+            yi = y_top - i * ((y_top - y_bottom) / (nh - 1))
+        layers[i] = [(x, yi) for (x, _) in lev]
+    return layers
+
+
+def _fanout_levels_from_leaf_xs(
+    leaf_xs: List[float], y_top: float, y_bottom: float
+) -> List[List[Tuple[float, float]]]:
+    """Same pairing tree as `_fanout_levels_top_down`, but leaf x positions are explicit."""
+    if not leaf_xs:
+        return []
+    layers_bottom_up: List[List[Tuple[float, float]]] = [[(x, 0.0) for x in leaf_xs]]
+    while len(layers_bottom_up[-1]) > 1:
+        prev = layers_bottom_up[-1]
+        nxt: List[Tuple[float, float]] = []
+        for j in range(0, len(prev), 2):
+            if j + 1 < len(prev):
+                xm = 0.5 * (prev[j][0] + prev[j + 1][0])
+            else:
+                xm = prev[j][0]
+            nxt.append((xm, 0.0))
+        layers_bottom_up.append(nxt)
+    layers = list(reversed(layers_bottom_up))
+    nh = len(layers)
+    for i, lev in enumerate(layers):
+        if nh == 1:
+            yi = y_top
+        else:
+            yi = y_top - i * ((y_top - y_bottom) / (nh - 1))
+        layers[i] = [(x, yi) for (x, _) in lev]
+    return layers
+
+
+def render_expand(
+    ax: Axes,
+    spec: TileShape,
+    *,
+    title_fontsize: float = 9,
+    label_fs: float = 5.5,
+    show_footer: bool = False,
+) -> None:
+    """Expand along C: v[r] in A fans out; B holds src row; leaf 2-input combines."""
+    C = spec.C
+    ax.set_xlim(0, 1)
+    ax.set_ylim(0, 1)
+    ax.axis("off")
+    _title_block(ax, "Expand (row 0, fanout C)", spec, fontsize=title_fontsize)
+
+    y_fan_top = 0.82
+    y_fan_bot = 0.52
+    y_leaf_circ = 0.34
+    y_row = 0.12
+
+    x0, x1 = 0.05, 0.95
+    w, h, _g_in, _gap_ab, xa_list, xb_list = _ab_block_layout(float(spec.E), 1, C, x0, x1)
+    cr = min(0.026, 0.05 / max(C / 8, 1), 0.55 * h)
+    xv = xa_list[0]
+    leaves_x = list(xb_list)
+
+    levels = _fanout_levels_from_leaf_xs(leaves_x, y_top=y_fan_top, y_bottom=y_fan_bot)
+    root_fan_x, root_fan_y = levels[0][0]
+
+    ax.text(0.02, 0.94, f"row 0; R={spec.R}", fontsize=max(5.0, label_fs))
+    ax.text(0.02, y_row + h / 2 + 0.05, "A block (v) | B block (width ∝ E)", fontsize=max(5.0, label_fs))
+
+    _rect(ax, xv, y_row, w, h, facecolor="#cfe8ff", edgecolor="#1a4f8c", linewidth=1.0)
+    ax.text(xv, y_row + h / 2 + 0.03, "v[r]", ha="center", va="bottom", fontsize=max(5.0, label_fs + 0.5))
+    for i, xb in enumerate(xb_list):
+        _rect(ax, xb, y_row, w, h, facecolor="#ffe8cf", edgecolor="#8c4f1a", linewidth=1.0)
+        ax.text(xb, y_row + h / 2 + 0.03, f"B0,{i}", ha="center", va="bottom", fontsize=label_fs)
+
+    for lev in levels:
+        for (x, y) in lev:
+            _circle(ax, x, y, cr, facecolor="#eef8ff", edgecolor="#333", linewidth=1.0)
+
+    _wire_straight(
+        ax,
+        (xv, _rect_top(y_row, h)),
+        (root_fan_x, _circ_bottom(root_fan_y, cr)),
+        lw=0.85,
+        color="#333",
+    )
+
+    for li in range(len(levels) - 1):
+        parents = levels[li]
+        children = levels[li + 1]
+        for pi, (xp, yp) in enumerate(parents):
+            for ci in (2 * pi, 2 * pi + 1):
+                if ci < len(children):
+                    xc, yc = children[ci]
+                    _wire(
+                        ax,
+                        (xp, _circ_bottom(yp, cr)),
+                        (xc, _circ_top(yc, cr)),
+                        rad=0.035 * ((-1) ** (ci + pi)),
+                    )
+
+    y_fan_leaves = levels[-1][0][1]
+    for i, x_lane in enumerate(leaves_x):
+        xb = xb_list[i]
+        y_out_c = y_leaf_circ + 0.09
+        wo, ho = w * 0.9, h * 0.9
+        _circle(ax, x_lane, y_leaf_circ, cr, facecolor="#e8ffe8", edgecolor="#333", linewidth=1.0)
+        _wire_straight(
+            ax,
+            (x_lane, _circ_bottom(y_fan_leaves, cr)),
+            (x_lane, _circ_top(y_leaf_circ, cr)),
+            lw=0.9,
+            color="#444",
+        )
+        _wire(
+            ax,
+            (xb, _rect_top(y_row, h)),
+            (x_lane, _circ_bottom(y_leaf_circ, cr)),
+            rad=0.14 * ((-1) ** i),
+            color="#8c4f1a",
+            lw=1.0,
+        )
+        _rect(ax, x_lane, y_out_c, wo, ho, facecolor="#dff5df", edgecolor="#1a6c2e", linewidth=0.9)
+        _wire_straight(
+            ax,
+            (x_lane, _circ_bottom(y_leaf_circ, cr)),
+            (x_lane, _rect_top(y_out_c, ho)),
+            lw=0.75,
+            color="#555",
+        )
+
+    if show_footer:
+        ax.text(
+            0.5,
+            0.01,
+            "Fanout tree + leaf ○ with B lanes.",
+            ha="center",
+            fontsize=6,
+        )
+
+
+def render_mergesort(
+    ax: Axes,
+    spec: TileShape,
+    *,
+    title_fontsize: float = 9,
+    label_fs: float = 5.5,
+    show_footer: bool = False,
+) -> None:
+    """
+    Multi-level compare–swap (2-in / 2-out) only: straight wires from data to each ○,
+    straight wires between levels on fixed tracks; no shuffle bus / no horizontal rails.
+    """
+    ax.set_xlim(0, 1)
+    ax.set_ylim(0, 1)
+    ax.axis("off")
+    _title_block(ax, "Mergesort (compare–swap)", spec, fontsize=title_fontsize)
+
+    Ntile = spec.N
+    n_vis = min(Ntile, 8)
+    if n_vis < 2:
+        n_vis = 2
+    if n_vis % 2 == 1:
+        n_vis -= 1
+    m = n_vis // 2
+    E = float(spec.E)
+    x0, x1 = 0.06, 0.94
+    w, h, _g, _ggap, xa_list, xb_list = _ab_block_layout(E, m, m, x0, x1)
+
+    y_reg = 0.84
+    for i in range(m):
+        _rect(ax, xa_list[i], y_reg, w, h, facecolor="#cfe8ff", edgecolor="#1a4f8c", linewidth=0.9)
+        ax.text(
+            xa_list[i],
+            y_reg + h / 2 + 0.02,
+            f"A{i}",
+            ha="center",
+            va="bottom",
+            fontsize=max(4.5, label_fs - 0.5),
+        )
+    for i in range(m):
+        _rect(ax, xb_list[i], y_reg, w, h, facecolor="#ffe8cf", edgecolor="#8c4f1a", linewidth=0.9)
+        ax.text(
+            xb_list[i],
+            y_reg + h / 2 + 0.02,
+            f"B{i}",
+            ha="center",
+            va="bottom",
+            fontsize=max(4.5, label_fs - 0.5),
+        )
+
+    note = f"n={n_vis} lanes" + (f" (tile N={Ntile})" if n_vis < Ntile else "")
+    ax.text(0.02, 0.97, note, fontsize=max(4.5, label_fs - 0.5))
+
+    n = 2 * m
+    xs_lane: List[float] = []
+    for i in range(m):
+        xs_lane.append(xa_list[i])
+        xs_lane.append(xb_list[i])
+    y_rect_bot = _rect_bottom(y_reg, h)
+    cr = min(0.017, 0.32 / n, 0.45 * h)
+    leg = max(0.028, 0.05 - 0.002 * n)
+    max_stages = max(3, min(2 * n - 1, 10))
+    stages = _odd_even_transposition_stages(n, max_stages)
+
+    y_track = [y_rect_bot] * n
+    for pairs in stages:
+        y_in = min(y_track)
+        y_cs = y_in - leg - cr
+        y_next = y_cs - cr - 0.012
+        paired: set[int] = set()
+        for (i, j) in pairs:
+            paired.add(i)
+            paired.add(j)
+            xm = 0.5 * (xs_lane[i] + xs_lane[j])
+            _circle(ax, xm, y_cs, cr, facecolor="#f0e8ff", edgecolor="#333", linewidth=0.85)
+            ax.text(
+                xm,
+                y_cs,
+                "C&S",
+                ha="center",
+                va="center",
+                fontsize=max(4.0, label_fs - 1.0),
+                color="#222",
+            )
+            _wire_straight(ax, (xs_lane[i], y_track[i]), (xm, _circ_top(y_cs, cr)), lw=0.75, color="#444")
+            _wire_straight(ax, (xs_lane[j], y_track[j]), (xm, _circ_top(y_cs, cr)), lw=0.75, color="#444")
+            dx = cr * 0.55
+            _wire_straight(ax, (xm - dx, _circ_bottom(y_cs, cr)), (xs_lane[i], y_next), lw=0.75, color="#444")
+            _wire_straight(ax, (xm + dx, _circ_bottom(y_cs, cr)), (xs_lane[j], y_next), lw=0.75, color="#444")
+            y_track[i] = y_next
+            y_track[j] = y_next
+        for k in range(n):
+            if k not in paired:
+                _wire_straight(ax, (xs_lane[k], y_track[k]), (xs_lane[k], y_next), lw=0.65, color="#888")
+                y_track[k] = y_next
+
+    y_cur = min(y_track)
+    y_out = max(0.07, y_cur - 0.025)
+    wo, ho = w * 0.82, h * 0.82
+    for k in range(n):
+        _wire_straight(
+            ax,
+            (xs_lane[k], y_track[k]),
+            (xs_lane[k], _rect_top(y_out, ho)),
+            lw=0.55,
+            color="#666",
+        )
+    for k in range(n):
+        _rect(ax, xs_lane[k], y_out, wo, ho, facecolor="#dff5df", edgecolor="#1a6c2e", linewidth=0.8)
+    ax.text(0.5, y_out - ho / 2 - 0.018, "merged lane values", ha="center", va="top", fontsize=5)
+
+    if show_footer:
+        ax.text(
+            0.5,
+            0.015,
+            "▭ = key; ○ = compare–swap (2 in / 2 out); only straight segments, no shuffle / bus lines.",
+            ha="center",
+            fontsize=5,
+            color="#333",
+        )
+
+
+def _save_montage(
+    out_path: str,
+    shapes: List[TileShape],
+    render: Callable[..., None],
+    suptitle: str,
+    footnote: str,
+    *,
+    figsize: Tuple[float, float] = (28, 16),
+    dpi: int = 140,
+) -> None:
+    fig, axes = plt.subplots(2, 4, figsize=figsize, dpi=dpi)
+    fig.suptitle(suptitle, fontsize=14, y=0.995)
+    for ax, spec in zip(axes.ravel(), shapes):
+        render(ax, spec, title_fontsize=8.5, label_fs=5.5, show_footer=False)
+    fig.text(0.5, 0.012, footnote, ha="center", fontsize=9, color="#222")
+    fig.subplots_adjust(left=0.03, right=0.97, top=0.93, bottom=0.06, hspace=0.38, wspace=0.22)
+    fig.savefig(out_path, bbox_inches="tight")
+    plt.close(fig)
+
+
+def main(argv: Sequence[str] | None = None) -> int:
+    p = argparse.ArgumentParser(description="Plot 16 B tile vector datapath montages (4 PNGs).")
+    p.add_argument(
+        "-o",
+        "--out-dir",
+        default=os.path.join(os.path.dirname(__file__), "tile16_figures"),
+        help="Directory for PNG output (created if missing).",
+    )
+    args = p.parse_args(list(argv) if argv is not None else None)
+
+    os.makedirs(args.out_dir, exist_ok=True)
+    shapes = canonical_shapes()
+
+    _save_montage(
+        os.path.join(args.out_dir, "elementwise_all.png"),
+        shapes,
+        render_elementwise,
+        suptitle=f"Element-wise — 8 tile shapes ({TILE_BYTES} B holding register each)",
+        footnote=(
+            "▭ = element (width ∝ E); all A contiguous, then all B; ○ = 2-input unit; "
+            "crossed bends on operand paths (cf. vector4k dual-port strip buffers)."
+        ),
+    )
+    _save_montage(
+        os.path.join(args.out_dir, "reduce_all.png"),
+        shapes,
+        render_reduce,
+        suptitle=f"Reduce (representative row / fiber) — 8 tile shapes ({TILE_BYTES} B tile)",
+        footnote=(
+            "Per subplot: A block then B block (▭ width ∝ E); lane i pairing into ○, then reduction tree to Acc."
+        ),
+    )
+    _save_montage(
+        os.path.join(args.out_dir, "expand_all.png"),
+        shapes,
+        render_expand,
+        suptitle=f"Expand / fanout — 8 tile shapes ({TILE_BYTES} B tile)",
+        footnote=(
+            "A block (v) then B block (▭ width ∝ E); v fans out; leaf ○ combines fanout with each B lane."
+        ),
+    )
+    _save_montage(
+        os.path.join(args.out_dir, "mergesort_all.png"),
+        shapes,
+        render_mergesort,
+        suptitle=f"Mergesort (compare–swap) — 8 tile shapes ({TILE_BYTES} B tile)",
+        footnote=(
+            "Type 4: A|B keys (▭ ∝ E); multi-stage compare–swap (2-in/2-out ○); straight wires only, "
+            "odd–even transposition layers (schematic multistage C&S, not a full minimal merge depth proof)."
+        ),
+    )
+
+    print(f"Wrote 4 PNG montages to {args.out_dir}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/designs/outerCube/tile16_figures/elementwise_all.png b/designs/outerCube/tile16_figures/elementwise_all.png
new file mode 100644
index 00000000..9926134d
Binary files /dev/null and b/designs/outerCube/tile16_figures/elementwise_all.png differ
diff --git a/designs/outerCube/tile16_figures/expand_all.png b/designs/outerCube/tile16_figures/expand_all.png
new file mode 100644
index 00000000..fb9cbe5f
Binary files /dev/null and b/designs/outerCube/tile16_figures/expand_all.png differ
diff --git a/designs/outerCube/tile16_figures/mergesort_all.png b/designs/outerCube/tile16_figures/mergesort_all.png
new file mode 100644
index 00000000..3c09cf00
Binary files /dev/null and b/designs/outerCube/tile16_figures/mergesort_all.png differ
diff --git a/designs/outerCube/tile16_figures/nv_shuffle_all.png b/designs/outerCube/tile16_figures/nv_shuffle_all.png
new file mode 100644
index 00000000..5c26a9d5
Binary files /dev/null and b/designs/outerCube/tile16_figures/nv_shuffle_all.png differ
diff --git a/designs/outerCube/tile16_figures/reduce_all.png b/designs/outerCube/tile16_figures/reduce_all.png
new file mode 100644
index 00000000..5e4a2f4f
Binary files /dev/null and b/designs/outerCube/tile16_figures/reduce_all.png differ
diff --git a/designs/outerCube/tregfile4k.md b/designs/outerCube/tregfile4k.md
index 8ec4f874..f1c0dbd2 100644
--- a/designs/outerCube/tregfile4k.md
+++ b/designs/outerCube/tregfile4k.md
@@ -1,6 +1,6 @@
 ### Tile Register File (TRegFile-4K)
 
-The TRegFile-4K is an **8-read / 8-write tile register file** built from **64 physical 1R1W SRAM banks** at **1× core clock**. Storage is organized into **4 KB tiles**, each interleaved across all 64 banks (64 B per bank). An **8-cycle synchronized calendar** rotates port-to-group assignments so that every bank sees exactly **1R + 1W per cycle**. Each port accepts one `reg_idx` which is latched and drives the next **8-cycle epoch**; a new address is accepted every **8 cycles** (one per epoch boundary), enabling zero-bubble back-to-back tile accesses.
+The TRegFile-4K is an **8-read / 8-write tile register file** built from **64 physical 1R1W SRAM banks** at **1× core clock**. Storage is organized into **4 KB tiles**, each striped across all 64 banks (64 B per bank) using a **diagonal (skewed) bank map** so that *both* a row-wise sweep and a column-wise sweep of the tile's logical 8×8 chunk grid are bank-conflict-free. Each read port carries an **`is_transpose` bit** that selects between the two delivery orders at run time (§3, §4.2, §7). An **8-cycle synchronized calendar** rotates port-to-group assignments so that every bank sees exactly **1R + 1W per cycle**. Each port accepts one `reg_idx` (plus `is_transpose` on reads) which is latched and drives the next **8-cycle epoch**; a new address is accepted every **8 cycles** (one per epoch boundary), enabling zero-bubble back-to-back tile accesses.
 
 #### 1. Core Parameters
 
@@ -16,27 +16,61 @@ The TRegFile-4K is an **8-read / 8-write tile register file** built from **64 ph
 
 #### 2. Tile Layout & Physical Organization
 
-Each 4 KB tile is striped across all 64 banks. Bank select is pure wiring (zero decode logic):
+Each 4 KB tile is striped across all 64 banks via a **diagonal (skewed) bank map**. Viewing the 4 KB tile as an **8 × 8 chunk grid** of 64 B chunks, let:
 
-```
-  bank[5:0]    = chunk_offset[5:0]       ← pure wiring
-  SRAM addr    = tile_idx[7:0]           ← 8 bits → 256 rows
+- `g = chunk_offset[5:3]` — **group row** of the chunk grid (0..7).
+- `l = chunk_offset[2:0]` — **local col** of the chunk grid within row g (0..7).
+
+The bank select now includes a **3-bit rotator** (not pure wiring); the SRAM address is unchanged:
 
-  Bank groups (8 banks each):
-    G0 = banks  0– 7    (c[5:3] = 000)
-    G1 = banks  8–15    (c[5:3] = 001)
-    G2 = banks 16–23    (c[5:3] = 010)
-    G3 = banks 24–31    (c[5:3] = 011)
-    G4 = banks 32–39    (c[5:3] = 100)
-    G5 = banks 40–47    (c[5:3] = 101)
-    G6 = banks 48–55    (c[5:3] = 110)
-    G7 = banks 56–63    (c[5:3] = 111)
+```
+  Skewed bank map (replaces the naive rectangular decode):
+    g          = chunk_offset[5:3]         ← group index (chunk-grid row)
+    l          = chunk_offset[2:0]         ← local index (chunk-grid col)
+    bank_group = g                          ← pure wiring
+    bank_local = (l + g) mod 8              ← 3-bit rotator controlled by g
+    bank_id    = 8·g + bank_local           ← 0..63
+    SRAM_addr  = tile_idx[7:0]              ← unchanged (256 rows / bank)
+
+  Bank groups (8 banks each, unchanged):
+    G0 = banks  0– 7    (chunk-grid row 0)
+    G1 = banks  8–15    (chunk-grid row 1)
+    G2 = banks 16–23    (chunk-grid row 2)
+    G3 = banks 24–31    (chunk-grid row 3)
+    G4 = banks 32–39    (chunk-grid row 4)
+    G5 = banks 40–47    (chunk-grid row 5)
+    G6 = banks 48–55    (chunk-grid row 6)
+    G7 = banks 56–63    (chunk-grid row 7)
 
   1 bank  → 64 B   (one chunk)
-  1 group → 512 B  (8 banks, one cycle per port)
+  1 group → 512 B  (8 banks; one cycle per port in row-mode)
   8 groups → 4 KB  (full tile, 8 cycles)
 ```
 
+**Diamond placement of a single tile.** Under this skew, the 64 chunks of tile T occupy **one distinct bank in each group row, along a wrapped diagonal** — not a rectangular column-aligned pattern:
+
+```
+  (cells list the LOGICAL chunk (g, l) of tile T at each physical bank slot)
+
+                 bank_local = 0      1      2      3      4      5      6      7
+    G0 (g=0):              (0,0)  (0,1)  (0,2)  (0,3)  (0,4)  (0,5)  (0,6)  (0,7)
+    G1 (g=1):              (1,7)  (1,0)  (1,1)  (1,2)  (1,3)  (1,4)  (1,5)  (1,6)
+    G2 (g=2):              (2,6)  (2,7)  (2,0)  (2,1)  (2,2)  (2,3)  (2,4)  (2,5)
+    G3 (g=3):              (3,5)  (3,6)  (3,7)  (3,0)  (3,1)  (3,2)  (3,3)  (3,4)
+    G4 (g=4):              (4,4)  (4,5)  (4,6)  (4,7)  (4,0)  (4,1)  (4,2)  (4,3)
+    G5 (g=5):              (5,3)  (5,4)  (5,5)  (5,6)  (5,7)  (5,0)  (5,1)  (5,2)
+    G6 (g=6):              (6,2)  (6,3)  (6,4)  (6,5)  (6,6)  (6,7)  (6,0)  (6,1)
+    G7 (g=7):              (7,1)  (7,2)  (7,3)  (7,4)  (7,5)  (7,6)  (7,7)  (7,0)
+
+  Logical row g = {(g, 0), …, (g, 7)}  — entirely inside group G_g
+                                         (local order rotated by g).
+  Logical col l = {(0, l), (1, l), …, (7, l)} — one bank per group,
+                                                along a wrapped diagonal
+                                                (physical locals = (l+0, l+1, …, l+7) mod 8).
+```
+
+Every logical row (chunk-grid row) and every logical column (chunk-grid col) therefore touches **exactly one bank in each group** — i.e., both access patterns cover all 64 banks with no bank visited twice. This is the essential property that enables the transposed-read capability in §7.
+
 ```
  ┌────────────────────────────────────────────────────────────────────────────────────────┐
  │  TRegFile-4K:  256 tiles × 4 KB = 1 MB                                               │
@@ -89,7 +123,12 @@ Each 4 KB tile is striped across all 64 banks. Bank select is pure wiring (zero
 
 #### 3. Port Interface
 
-Each port presents **512 B per cycle** (one group of 8 banks × 64 B). A port accepts one `reg_idx[7:0]` which is **latched** internally at the epoch boundary. The latched address then drives data delivery (read) or acceptance (write) for the addressed tile over the next **8 consecutive cycles** — one bank-group per cycle per the calendar rotation. Since a 4 KB tile requires 8 × 512 B reads, the port is occupied for the full epoch and can only accept a **new `reg_idx` every 8 cycles**.
+Each port presents **512 B per cycle** (8 banks × 64 B; in row-mode these are the 8 banks of one group, in col-mode these are one bank per group, §4.2). A port accepts one `reg_idx[7:0]` (read ports additionally accept **`is_transpose[0]`**) which is **latched** internally at the epoch boundary. The latched address then drives data delivery (read) or acceptance (write) for the addressed tile over the next **8 consecutive cycles** — one bank-group per cycle per the calendar rotation. Since a 4 KB tile requires 8 × 512 B reads, the port is occupied for the full epoch and can only accept a **new `reg_idx` every 8 cycles**.
+
+- **`is_transpose = 0` (ROW mode)** — default. Cycle-by-cycle the port delivers the chunk-grid rows in order, i.e. a linearly-addressed 512 B strip per cycle. Physical banks visited per cycle = **all 8 banks of one group** (same access pattern as a naive rectangular layout); the 3-bit rotator reorders the 8 lanes so that logical chunk `(g, l)` appears at output lane `l`.
+- **`is_transpose = 1` (COL mode)** — delivers the chunk-grid columns in order. Physical banks visited per cycle = **one bank per group** (wrapped diagonal selection). The output lane `i` carries logical chunk `(i, l_active)` where `l_active = (p + cy) mod 8` is the column being scanned this cycle.
+
+Both modes deliver the full 4 KB tile in exactly **8 cycles**; the only difference is the traversal order of the 8 × 8 chunk grid. **Write ports are always row-oriented** — the producer presents 8 chunks of one chunk-grid row per cycle and the write-side rotator places them at the skewed physical banks.
 
 **Epoch-aligned address acceptance:** The port contains a **pending** address register and an **active** address register. A client can write a new `reg_idx` into the pending register at any time during the current epoch. At the next epoch boundary (`cy[2:0]=0`), pending promotes to active and the port begins serving the new tile with **zero bubble**:
 
@@ -110,48 +149,96 @@ Each port presents **512 B per cycle** (one group of 8 banks × 64 B). A port ac
 
 | Ports | Direction | Data Width | Address | Addr Rate |
 |-------|-----------|------------|---------|-----------|
-| **R0–R7** | Read | 512 B (4096 bits) / cy | `reg_idx[7:0]` | 1 addr / 8 cy |
+| **R0–R7** | Read | 512 B (4096 bits) / cy | `reg_idx[7:0]` + **`is_transpose[0]`** | 1 addr / 8 cy |
 | **W0–W7** | Write | 512 B (4096 bits) / cy | `reg_idx[7:0]` + `w_en` | 1 addr / 8 cy |
 
-**Per-port sustained throughput:** 1 tile (4 KB) every 8 cycles = 512 B/cy.
-**Address registers:** 1 pending + 1 active (double-register for zero-bubble epoch chaining).
+**Per-port sustained throughput:** 1 tile (4 KB) every 8 cycles = 512 B/cy (row-mode and col-mode both).
+**Address registers:** 1 pending + 1 active (double-register for zero-bubble epoch chaining). On read ports `is_transpose` is double-registered together with `reg_idx` and is held constant for the full epoch.
 
 **Port microarchitecture (read port Rp):**
 
 ```
-                 reg_idx[7:0]
-                      │
-                      ▼
-              ┌───────────────┐
-              │  Addr Latch   │◄── write any time during epoch
-              │  (pending)    │
-              └──────┬────────┘
+             reg_idx[7:0]   is_transpose
+                  │               │
+                  ▼               ▼
+          ┌───────────────────────────┐
+          │  Addr+Mode Latch          │◄── written any time during epoch
+          │  (pending)                │
+          └──────────┬────────────────┘
                      │ epoch boundary: pending → active
                      ▼
-              ┌───────────────┐     ┌──────────────────────────────┐
-              │  Addr Active  │────▶│  Bank-Group Mux (calendar)   │
-              │  (current)    │     │  cy[2:0] selects G(p+cy)%8   │
-              └───────────────┘     └──────────┬───────────────────┘
-                                               │ 8 banks × 64 B
-                                               ▼
-                                       ┌──────────────┐
-                                       │  512 B / cy   │──▶ data out
-                                       │  read data    │
-                                       └──────────────┘
+          ┌───────────────────────────┐
+          │  Addr+Mode Active         │
+          │  (current epoch)          │
+          └──────────┬────────────────┘
+                     │ tile_idx, is_transpose, phase p, cy[2:0]
+                     ▼
+            ┌──────────────────────────────────────────┐
+            │  Bank-Select Calendar (§4)               │
+            │    is_transpose = 0 : row-mode           │
+            │         → 8 banks of G_{(p+cy) mod 8}    │
+            │    is_transpose = 1 : col-mode           │
+            │         → bank_i = 8·i + (p+cy+i) mod 8  │
+            │           for i ∈ {0..7}                 │
+            └──────────┬───────────────────────────────┘
+                       │ 8 bank reads × 64 B
+                       ▼
+            ┌──────────────────────────────────────────┐
+            │  Output Rotator / Lane Permute           │
+            │    row-mode: rotate-left by g = (p+cy)%8 │
+            │              (restore logical l-order)   │
+            │    col-mode: identity on group axis      │
+            │              (lane i = chunk (i, l_act)) │
+            └──────────┬───────────────────────────────┘
+                       │ 512 B / cy
+                       ▼
+                  ┌───────────┐
+                  │ data out  │──▶ VEC
+                  └───────────┘
 
   Timing:
-    Cycle c     : client writes reg_idx → Addr Latch (pending)
+    Cycle c     : client writes {reg_idx, is_transpose} → pending latch
     Cycle c'    : next epoch boundary (cy[2:0]=0) → pending promotes to active
-    Cycle c'..c'+7 : active addr drives 8 bank-group reads (OF0–OF7)
-    Cycle c'..c'+7 : client may write next reg_idx → Addr Latch (next pending)
+    Cycle c'..c'+7 : active tile_idx + mode drive 8 consecutive bank reads
+    Cycle c'..c'+7 : client may write next {reg_idx, is_transpose} → new pending
     Cycle c'+8  : next epoch boundary → new pending promotes to active
 ```
 
-Write port Wp is identical except data flows inward (client → bank-group mux → SRAM write).
+**Write port Wp** is identical except data flows inward and there is no `is_transpose` input. The write-side lane permute is a fixed **rotate-right by g = (p+cy) mod 8** so that the 8 lanes of the chunk-grid row (logical `l = 0..7`) land at physical `bank_local = (l + g) mod 8` in group G_g:
+
+```
+                 reg_idx[7:0]              w_en
+                      │                     │
+                      ▼                     │
+              ┌───────────────┐             │
+              │  Addr Latch   │◄── client   │
+              │  (pending)    │             │
+              └──────┬────────┘             │
+                     │ epoch boundary       │
+                     ▼                      ▼
+              ┌───────────────┐     ┌────────────────────────────┐
+              │  Addr Active  │────▶│  Bank-Group Sel (cy)       │
+              │  (current)    │     │  → group G_{(p+cy) mod 8}  │
+              └───────────────┘     └──────────┬─────────────────┘
+                                               │ target group g
+                                               ▼
+                                      ┌────────────────────────┐
+           512 B in                   │  Write Lane Rotator    │
+         (8 lanes, l=0..7) ──────────▶│  rotate-right by g     │
+                                      │  l_phys = (l + g) % 8  │
+                                      └──────────┬─────────────┘
+                                                 │
+                                                 ▼  8 bank writes × 64 B
+                                         [SRAM write]
+```
 
 #### 4. 8-Cycle Synchronized Calendar
 
-All 16 ports share a global 3-bit **epoch counter** (`cy[2:0]`). Read and write ports follow the **same** rotation pattern — port *p* (phase offset *p*) accesses group `(p + cy) % 8`:
+All 16 ports share a global 3-bit **epoch counter** (`cy[2:0]`). Read and write ports follow the **same** base rotation pattern — port *p* (phase offset *p*) is associated with group `(p + cy) % 8` every cycle. Within an epoch a read port then applies one of two bank-select patterns depending on its latched `is_transpose` bit (writes always use the row-mode pattern).
+
+##### 4.1 Row-mode calendar (`is_transpose = 0`, and all writes)
+
+Port *p* at cycle `cy` accesses all 8 banks of group `G_{(p + cy) % 8}`:
 
 | Cycle | Phase 0 (R0/W0) | Phase 1 (R1/W1) | Phase 2 (R2/W2) | Phase 3 (R3/W3) | Phase 4 (R4/W4) | Phase 5 (R5/W5) | Phase 6 (R6/W6) | Phase 7 (R7/W7) |
 |-------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
@@ -164,7 +251,29 @@ All 16 ports share a global 3-bit **epoch counter** (`cy[2:0]`). Read and write
 | 6 | **G6** | **G7** | **G0** | **G1** | **G2** | **G3** | **G4** | **G5** |
 | 7 | **G7** | **G0** | **G1** | **G2** | **G3** | **G4** | **G5** | **G6** |
 
-Over 8 cycles each port visits all 8 groups exactly once → reads/writes one complete 4 KB tile.
+Over 8 cycles each port visits all 8 groups exactly once → reads/writes one complete 4 KB tile in chunk-grid row order.
+
+##### 4.2 Col-mode calendar (`is_transpose = 1`, reads only)
+
+Port *p* at cycle `cy` delivers **chunk-grid column** `l_active = (p + cy) % 8`. The 8 banks accessed are *one per group*, with
+
+```
+  for each i ∈ {0..7}:
+      bank_i = 8·i + ((p + cy + i) mod 8)       (group G_i, local (l_active + i) mod 8)
+```
+
+i.e. every cycle the port fetches a **wrapped-diagonal set of 8 banks**, one from each group. Equivalently, the per-group local-bank table (entries = `bank_local`) for each (phase, cycle) is:
+
+| Cycle | Phase 0 (R0) | Phase 1 (R1) | Phase 2 (R2) | Phase 3 (R3) | Phase 4 (R4) | Phase 5 (R5) | Phase 6 (R6) | Phase 7 (R7) |
+|-------|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|
+| 0 | col 0 → locals (0,1,2,3,4,5,6,7) | col 1 → (1,2,3,4,5,6,7,0) | col 2 → (2,3,4,5,6,7,0,1) | col 3 → (3,4,5,6,7,0,1,2) | col 4 → (4,5,6,7,0,1,2,3) | col 5 → (5,6,7,0,1,2,3,4) | col 6 → (6,7,0,1,2,3,4,5) | col 7 → (7,0,1,2,3,4,5,6) |
+| 1 | col 1 | col 2 | col 3 | col 4 | col 5 | col 6 | col 7 | col 0 |
+| … | … | … | … | … | … | … | … | … |
+| 7 | col 7 | col 0 | col 1 | col 2 | col 3 | col 4 | col 5 | col 6 |
+
+(tuples list `bank_local` for groups `G_0 .. G_7` in order; the cycle-1..7 rows follow the same wrapped-diagonal pattern with `col = (p + cy) mod 8`).
+
+Over 8 cycles each col-mode port visits all 8 columns of the chunk grid exactly once → reads one complete 4 KB tile in chunk-grid column order (i.e. the transpose of the chunk grid).
 
 **Epoch chaining (pipelined address):** The epoch counter is free-running and global. A port's active address drives all 8 cycles of the current epoch. At the next `cy[2:0]=0` boundary, the pending address (latched at any point during the previous epoch) automatically promotes to active. This produces **zero-bubble back-to-back tile accesses** — the port never idles between consecutive tiles:
 
@@ -176,7 +285,7 @@ Over 8 cycles each port visits all 8 groups exactly once → reads/writes one co
                            T1 promotes                T2 promotes
 ```
 
-**Conflict-free proof:** At every cycle, the 8 read ports cover {G0..G7} and the 8 write ports independently cover {G0..G7}. Each group sees exactly 1R + 1W. The reader and writer assigned to the same group are always the **same-phase** pair (R0/W0, R1/W1, ..., R7/W7).
+**Conflict-free proof (row-mode, `is_transpose = 0` for all reads):** At every cycle, the 8 read ports cover {G0..G7} and the 8 write ports independently cover {G0..G7}. Each group sees exactly 1R + 1W. The reader and writer assigned to the same group are always the **same-phase** pair (R0/W0, R1/W1, ..., R7/W7).
 
 ```
   Cy 0: R = G0(R0) G1(R1) G2(R2) G3(R3) G4(R4) G5(R5) G6(R6) G7(R7)
@@ -190,18 +299,38 @@ Over 8 cycles each port visits all 8 groups exactly once → reads/writes one co
   Per bank: ≤ 1R + 1W per cycle.  Two-port SRAM satisfied.  ✓
 ```
 
+**Conflict-free proof (col-mode, `is_transpose = 1` for all reads):** At cycle `cy`, read port R_p accesses bank `(G_i, (p + cy + i) mod 8)` for each `i ∈ {0..7}`. At a given group `G_i` and cycle `cy`, as `p` ranges over `{0..7}`, the local index `(p + cy + i) mod 8` is a **bijection** onto `{0..7}`. Hence the 8 read ports together cover every bank of every group exactly once — 64 reads, all distinct. Writes (always row-mode) still cover {G0..G7} once each; each bank sees ≤ 1R + ≤ 1W per cycle.
+
+```
+  Cy 0 col-mode reads (bank_id of each port = 8·i + local):
+      R0 → 0, 9,18,27,36,45,54,63       (col 0: locals 0,1,…,7)
+      R1 → 1,10,19,28,37,46,55,56       (col 1: locals 1,2,…,7,0)
+      R2 → 2,11,20,29,38,47,48,57       (col 2: locals 2,3,…,7,0,1)
+      R3 → 3,12,21,30,39,40,49,58       (col 3)
+      R4 → 4,13,22,31,32,41,50,59       (col 4)
+      R5 → 5,14,23,24,33,42,51,60       (col 5)
+      R6 → 6,15,16,25,34,43,52,61       (col 6)
+      R7 → 7, 8,17,26,35,44,53,62       (col 7)
+
+  Each physical bank 0..63 appears in exactly one port list ⇒ no conflict. ✓
+  Writes (all row-mode) land on the same 8 groups with same-phase bypass. ✓
+```
+
+**Mixed-mode across read ports is forbidden** — see §6 scheduling rule and §7 proof of the collision.
+
 #### 5. Throughput
 
 | Metric | Value |
 |--------|-------|
-| Per port data BW | 8 banks × 64 B = **512 B/cy** |
-| Per port per epoch (8 cy) | 8 groups × 512 B = **4 KB** (1 tile) |
-| Addr acceptance rate | **1 `reg_idx` / port / 8 cycles** (epoch-aligned) |
-| Addr-to-data latency | 0–7 cy (depends on when within epoch the pending addr is written) |
+| Per port data BW | 8 banks × 64 B = **512 B/cy** (row-mode and col-mode) |
+| Per port per epoch (8 cy) | 8 chunk-grid rows **or** 8 chunk-grid cols × 512 B = **4 KB** (1 tile) |
+| Addr acceptance rate | **1 `reg_idx` (+ `is_transpose` on reads) / port / 8 cycles** (epoch-aligned) |
+| Addr-to-data latency | 0–7 cy (depends on when within epoch the pending addr/mode is written) |
 | Sustained tile rate | 1 tile / 8 cy / port (zero-bubble epoch chaining) |
-| Aggregate read BW | 8 ports × 512 B/cy = **4 KB/cy** |
+| Aggregate read BW | 8 ports × 512 B/cy = **4 KB/cy** (either all row-mode or all col-mode, §6) |
 | Aggregate write BW | 8 ports × 512 B/cy = **4 KB/cy** |
 | Total per epoch | **16 tile ops** (8R + 8W), zero bank conflicts |
+| Transpose cost | **0 cycles** — a col-mode read delivers the chunk-grid transpose at full 512 B/cy without any extra latency, storage, or copy |
 
 #### 6. Write-to-Read Bypass & Scheduling Constraint
 
@@ -228,6 +357,223 @@ Example — R0 (phase 0) and W1 (phase 1) on the same tile:
   G7     │  cycle 7   │  cycle 6    │  Write 1 cy before read → SRAM OK ✓
 ```
 
-**Scheduling rule (enforced by upstream scheduler):**
+**Scheduling rules (enforced by upstream scheduler):**
+
+> **(R1)** Within the same 8-cycle epoch, no two different-phase read/write ports shall operate on the same `tile_idx`. Same-phase pairs (R0/W0, R1/W1, ..., R7/W7) are always safe and fully bypassed. Cross-phase pairs on the same tile must be separated by at least one full epoch (8 cycles).
+>
+> **(R2) Uniform transpose mode per epoch.** Within the same 8-cycle epoch, **all 8 active read ports must share the same `is_transpose` value**. Row-mode and col-mode reads cannot coexist in the same epoch because a row-mode reader occupies all 8 banks of its group G_a while every col-mode reader simultaneously wants exactly one bank inside G_a — the two patterns collide on the 1R SRAM port of that bank.
+>
+> Because `is_transpose` is double-registered on a per-port basis, each port *can* switch between row and col across successive epochs, but the scheduler must ensure the 8 active reads of any given epoch agree. Writes are always row-mode and impose no new constraint.
+
+**Why mixed-mode is disallowed (sketch).** At cycle `cy`, a row-mode port R_p occupies *all* 8 banks of group `G_{(p+cy) mod 8}`. A col-mode port R_q at the same cycle needs the bank at `(G_i, local = (q + cy + i) mod 8)` for every `i`, including `i = (p + cy) mod 8`, which collides with the group R_p has fully claimed. Since each SRAM bank has only 1R port, the collision is unresolvable by rotation or reorder. The uniform-mode rule sidesteps this cleanly. For row-mode + row-mode or col-mode + col-mode, the bijection arguments in the §4 proofs guarantee zero overlap.
+
+#### 7. Transposed Read — Diagonal Skew, Datapath, and Semantics
+
+This section consolidates the **transposed-read enhancement**: how the bank-skew of §2 together with the `is_transpose` bit on the read port (§3) turns the TRegFile into a bank-conflict-free *row-or-column* tile fetcher, at a small fixed datapath cost and without any extra SRAM storage, redundancy, or latency.
+
+##### 7.1 Motivation
+
+Many tile operations — matrix transpose, GEMM lhs/rhs reshape, strided reductions across "columns", butterfly shuffles, etc. — need to consume the **columns** of an R × C tile at the same bandwidth they consume its rows. A naive rectangular bank decode (`bank = chunk_offset[5:0]`) is ideal for row-major sweeps but forces all 8 chunks of a logical column into the *same bank group*, causing a 1× → 1/8× bandwidth collapse plus a hard bank conflict.
+
+The diagonal skew solves this with one small hardware change (a per-group 3-bit rotator on writes, a matching inverse rotator plus a bank-select mux on reads) and **zero cost** in throughput, latency, storage, or port count.
+
+##### 7.2 Recap of the skewed bank map
+
+From §2:
+
+```
+  For chunk (g, l) of tile T, with g = chunk_offset[5:3], l = chunk_offset[2:0]:
+      bank_id    = 8·g + ((l + g) mod 8)
+      SRAM_addr  = tile_idx[7:0]
+```
+
+The mapping is:
+
+- **Injective within a tile** — the 64 chunks of one tile occupy 64 distinct banks (one per physical bank).
+- **Row-closed** — all 8 chunks of chunk-grid row `g` lie in the same group `G_g`, just rotated in local order.
+- **Column-spread** — the 8 chunks of chunk-grid column `l` lie one-per-group, along a wrapped diagonal (`bank_local = (l + g) mod 8` for `g = 0..7`).
+
+Both properties together give the **row × col conflict-free** guarantee.
+
+##### 7.3 Write datapath (always row-oriented)
+
+The producer presents each 512 B strip as 8 logical lanes (`l = 0..7`) of chunk-grid row `g` (chosen by the calendar from §4.1). A fixed **3-bit write-lane rotator** steered by `g` places logical lane `l` at physical `bank_local = (l + g) mod 8`:
+
+```
+    logical lanes  l=0 1 2 3 4 5 6 7   (512 B in)
+                       │ │ │ │ │ │ │ │
+                       ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
+                  ╔════════════════════╗
+                  ║  Rotate-right by g ║  g = (p + cy) mod 8  (from calendar)
+                  ╚════════════════════╝
+                       │ │ │ │ │ │ │ │
+                       ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
+    physical       bank_local = 0 1 2 3 4 5 6 7  of group G_g
+    banks          (SRAM_addr = tile_idx)
+```
+
+The rotator is 8-way, 64 B wide, with a 3-bit select — one instance per write port (8 total). No extra storage, no per-bank decode.
+
+##### 7.4 Read datapath (row-mode *or* col-mode)
+
+Read ports implement both traversal orders over a **shared bank-select mux** steered by `{phase p, cy[2:0], is_transpose}`. A 9-to-8 lane permute on the output side restores logical order.
+
+```
+                 ┌──────── row-mode ────────┐    ┌──────── col-mode ────────┐
+                 │  fetch 8 banks of G_g    │    │  fetch 1 bank per group  │
+    Bank sel:    │   g = (p + cy) mod 8     │    │   bank_i = 8·i           │
+                 │   bank_i = 8·g + i       │    │          + (p+cy+i) mod 8│
+                 │   (i = local 0..7)       │    │   (i = group 0..7)       │
+                 └──────────────────────────┘    └──────────────────────────┘
+                                │                            │
+                                └─────────┬──────────────────┘
+                                          ▼
+                           ┌──────────────────────────────┐
+                           │   Output Lane Permute        │
+                           │   row-mode: rotate-left  g   │
+                           │   col-mode: identity         │
+                           └──────────────┬───────────────┘
+                                          ▼
+                                   512 B to VEC
+```
+
+- **Row-mode output meaning.** Output lane `l` carries logical chunk `(g_active, l)`; strips arrive in chunk-grid-row order over the 8 cycles of the epoch (bytes `0..4095` of the tile in linear address order, modulo the phase offset).
+- **Col-mode output meaning.** Output lane `i` carries logical chunk `(i, l_active)` with `l_active = (p + cy) mod 8`; strips arrive in chunk-grid-*column* order — i.e., **the transpose of the 8 × 8 chunk grid**, still delivered at full 512 B/cy.
+
+The row-mode output rotator is the inverse of the write-side rotator (rotate-left by `g` vs. rotate-right by `g`); col-mode needs no lane permute because the fetched 8-tuple is already `chunk (0, l), chunk (1, l), …, chunk (7, l)` in group order.
+
+##### 7.5 Semantics: what "transpose" means at the TRegFile level
+
+The skew provides a **chunk-granular transpose** of the tile, not a byte- or element-granular one. Specifically:
+
+> Col-mode reads deliver the 4 KB tile as the **transpose of its 8 × 8 chunk-grid partition**. The 64 B inside each chunk are *not* transposed — they come out of the SRAM in their stored byte order.
+
+The TRegFile deliberately does **not** attempt an element-level transpose; finer permutation belongs to the VEC front-end, which already has a 512 B Align/Unpack/Permute stage.
+
+**Strip-fill invariant.** Both row-mode and col-mode always deliver the full **512 B** per cycle. Col-mode has a clean physical description (derived directly from the bank map):
+
+> **Col-mode strip `s` = the eight 64 B windows at tile byte positions `{ s·64 + k·512 : k = 0..7 }`.**
+
+Equivalently, col-mode picks *the same 64 B sub-chunk (offset `s·64`) from each of the 8 row-mode strips*, and delivers them concatenated as one 512 B strip.
+
+**Full-tile coverage invariant (independent of shape).** Over an 8-cycle epoch `s = 0..7`, col-mode covers chunk-id sets
+`{ 8·g + s : g = 0..7 }` which are disjoint for distinct `s` and whose union is exactly `{0, 1, …, 63}`. The map `(s, g) ↦ 8·g + s` is a **bijection** onto the 64 chunks of the tile; therefore *every* col-mode read delivers the **complete 4096 B of the tile in 8 cycles — no duplicate, no omission — for every one of the 23 legal `(E, R, C)` shapes**. The ✅ markings below only classify whether the delivery is semantically a clean whole-row transpose view (✅) or an interleaved row-segment view that VEC's Align/Unpack/Permute front-end must recombine (no ✅); the *data completeness* is identical in both cases.
+
+With the coverage and strip-fill formulas, for any legal `(R, C, E)` with row width `W = C·E` bytes, the set of tile rows touched by col-mode strip `s` is
+
+```
+  touched-rows(s) = { ( s·64 + k·512 ) div W :  k = 0..7 }     (as a multiset if W < 64)
+  byte-offset-in-row(s, k) = ( s·64 + k·512 ) mod W
+```
+
+and the 512 B payload is exactly `512 / W` rows' worth of data (whole rows when `W ≤ 64`, row-segments otherwise). The following regime table and the shape-by-shape enumeration are direct corollaries.
+
+| `W` range | Row-mode strip contents | Col-mode strip contents |
+|-----------|-------------------------|-------------------------|
+| `W ≤ 64` (row fits in 1 chunk, `64/W` rows per chunk) | `512/W` contiguous rows | `512/W` **whole** rows, arranged as `8` groups of `64/W` consecutive rows, with group `k` centered at base row `k·(R/8) + s·(64/W)` |
+| `W = 128, 256, 512` (row = 2, 4, 8 chunks) | `512/W` contiguous rows | 8 × 64 B *row-segments* picked from 8 distinct rows, stride `R/8`; each segment covers one of the `W/64` sub-chunks of its row, cycled by `s` |
+| `W = 1024, 2048, 4096` (one row spans `W/512` row-mode strips) | `1 / (W/512)` row (a half / quarter / eighth) | 8 × 64 B row-segments drawn from the `R ≤ 4` rows of the tile, at stride-512 byte positions inside each row |
+
+**Per-shape coverage across all 23 legal `(E, R, C)` geometries** of [`vector4k.md`](vector4k.md) §9.7 (FP16 and BF16 share shapes):
+
+| `R × C` (FP32, `E=4`) | `W=C·E` | Row-mode rows / strip | Col-mode strip `s` delivers | `R × C` (FP16 / BF16, `E=2`) | `W` | Row-mode rows / strip | Col-mode strip `s` delivers |
+|-----------------------|---------|-----------------------|-----------------------------|------------------------------|-----|-----------------------|-----------------------------|
+| 1 × 1024 | 4096 B | `1/8` row | 8 × 64 B of the sole row, stride 512 B (= 128 FP32 stride-128) | 1 × 2048 | 4096 B | `1/8` row | 8 × 64 B of the sole row, stride 512 B (= 256 FP16 stride-256) |
+| 2 × 512  | 2048 B | `1/4` row | 4 × 64 B of row 0 + 4 × 64 B of row 1, stride 512 B inside each row | 2 × 1024 | 2048 B | `1/4` row | same, scaled by `E` |
+| 4 × 256  | 1024 B | `1/2` row | 2 × 64 B of each of 4 rows, stride 512 B inside each row | 4 × 512  | 1024 B | `1/2` row | same, scaled by `E` |
+| **8 × 128**  | **512 B** | **1 whole row** | **1 × 64 B of each of 8 rows** ⇒ **clean 8 × 8 chunk-grid transpose = 8 rows × 16 FP32 col-band** ✅ | **8 × 256**  | **512 B** | **1 whole row** | **8 rows × 32 FP16 col-band** ✅ |
+| 16 × 64  | 256 B  | 2 rows | 8 rows (stride 2: rows {0,2,…,14} for s ∈ {0..3}; rows {1,3,…,15} for s ∈ {4..7}), each at one of 4 col-bands of 16 FP32 | 16 × 128 | 256 B  | 2 rows | same pattern over 16 rows × 128 FP16 |
+| 32 × 32  | 128 B  | 4 rows | 8 rows (stride 4), each at one of 2 half-row col-bands of 16 FP32 | 32 × 64  | 128 B  | 4 rows | same over 32 rows × 64 FP16 |
+| **64 × 16** | **64 B**  | **8 whole rows** | **8 complete rows (stride 8 over 64 rows)** ✅ | **64 × 32** | **64 B**  | **8 whole rows** | **8 complete rows (stride 8 over 64 rows)** ✅ |
+| 128 × 8  | 32 B   | 16 whole rows | 8 × (2 consecutive rows) with stride 16 over 128 rows | 128 × 16 | 32 B   | 16 whole rows | 8 × (2 consecutive rows), stride 16 over 128 rows |
+| 256 × 4  | 16 B   | 32 whole rows | 8 × (4 consecutive rows) with stride 32 over 256 rows | 256 × 8  | 16 B   | 32 whole rows | 8 × (4 consecutive rows), stride 32 over 256 rows |
+| 512 × 2  | 8 B    | 64 whole rows | 8 × (8 consecutive rows) with stride 64 over 512 rows | 512 × 4  | 8 B    | 64 whole rows | 8 × (8 consecutive rows), stride 64 over 512 rows |
+| 1024 × 1 | 4 B    | 128 whole rows | 8 × (16 consecutive rows) with stride 128 over 1024 rows | 1024 × 2 | 4 B    | 128 whole rows | 8 × (16 consecutive rows), stride 128 over 1024 rows |
+| — | | | | 2048 × 1 | 2 B    | 256 whole rows | 8 × (32 consecutive rows), stride 256 over 2048 rows |
+
+(✅ marks the two regimes in which col-mode produces a clean *whole-row* delivery: `W = 512 B` → eight full rows laid out as a 8 × 8 col-band transpose, and `W ≤ 64 B` → `512/W` full rows chosen as eight stride-`R/8` row-blocks. In every other regime col-mode still fills the 512 B port, but with row-segments instead of whole rows — VEC's Align/Unpack/Permute recombines them.)
+
+**Universal bank-conflict-freeness (proof).** The per-cycle physical bank set accessed by the 8 read ports is
+
+```
+  row-mode (∀ p ∈ {0..7}):  { 8·g + ((l+g) mod 8) : l = 0..7 }  with g = (p+cy) mod 8
+  col-mode (∀ p ∈ {0..7}):  { 8·i + ((p+cy+i) mod 8) : i = 0..7 }
+```
+
+Neither expression depends on `R`, `C`, `E`, `tile_idx`, or any tile content; both are functions only of `(p, cy)`. The two §4 bijection arguments therefore hold for **every** `(E, R, C)` in the 23-row legal-shape table unchanged. Writes (always row-mode) are likewise shape-agnostic. The **"uniform transpose mode per epoch"** rule (§6 R2) is the single necessary and sufficient scheduling constraint; no shape introduces any extra hazard.
+
+Hence:
+
+> **Transposed read is bank-conflict-free for all 23 legal `(E, R, C)` geometries and at every valid `W = C·E` from 2 B to 4096 B.** The 512 B read port is fully filled on every cycle; for `W ≤ 64` it is filled with `512/W` whole rows (stride-`R/8` gather), for `W ∈ (64, 512]` with `512/W` rows' worth of 64 B row-segments, and for `W > 512` with the same fractional-row budget as row-mode.
+
+VEC's existing Align/Unpack/Permute front-end (`vector4k.md` §4) is the sole agent that lifts these strip-level deliveries to the element-level transpose required by `TCOL*` / `TTRANS` class instructions — the TRegFile guarantees only that the raw 4 KB of data arrives, bank-conflict-free, in the chunk-grid-transposed order.
+
+##### 7.6 Datapath cost
+
+| Block | Before | After | Δ |
+|-------|--------|-------|---|
+| Bank decode | pure wiring | 3-bit rotator controlled by `g` | 8 × 64 B 8-way rotators per port (8 read + 8 write) |
+| Bank-select mux | 1 option (row-mode) | 2 options (row or col) steered by `is_transpose` | small 2-to-1 mux per port, plus col-mode address generator (`bank_i = 8·i + (p+cy+i) mod 8`) |
+| Output lane permute (read) | none | rotate-left by `g` (row-mode) / identity (col-mode) | 8 × 64 B 8-way rotators per read port |
+| Storage | — | **no change** | 0 extra SRAM, 0 extra rows |
+| Ports | 8R + 8W | **no change** | 0 extra ports |
+| Latency | read latency of underlying SRAM | **no change** | rotator + mux are combinational and fit inside the existing pipeline stage |
+| `is_transpose` registers | — | 1 bit of pending + 1 bit of active per read port | 16 FFs total |
+
+Net area overhead is dominated by the 24 × 8-way 64 B-wide rotators (8 write + 8 read input-side + 8 read output-side), which at a 64 B (512-bit) granularity are standard building blocks at negligible cost compared to the 64 × (256 × 512 b) SRAM macros.
+
+##### 7.7 Worked example — chunks of one tile
+
+Assume `tile_idx = 0x2A`, producer writes the 4 KB tile with the natural row-major linear address ordering (chunk 0 at bytes 0..63, chunk 63 at bytes 4032..4095). After the write-side rotator, physical SRAM contents at SRAM row `0x2A` are:
+
+```
+  bank  0  (G0,l=0)  ← chunk (0,0)      bank  8  (G1,l=0)  ← chunk (1,7)
+  bank  1  (G0,l=1)  ← chunk (0,1)      bank  9  (G1,l=1)  ← chunk (1,0)
+  bank  2  (G0,l=2)  ← chunk (0,2)      bank 10  (G1,l=2)  ← chunk (1,1)
+  …                                     …
+  bank  7  (G0,l=7)  ← chunk (0,7)      bank 15  (G1,l=7)  ← chunk (1,6)
+
+  bank 16  (G2,l=0)  ← chunk (2,6)      bank 24  (G3,l=0)  ← chunk (3,5)
+  bank 17  (G2,l=1)  ← chunk (2,7)      bank 25  (G3,l=1)  ← chunk (3,6)
+  …                                     …
+```
+
+A subsequent **row-mode** read on port R0 (phase 0) delivers, cycle-by-cycle:
+
+```
+  cy=0: chunk (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)   ← group G0, rotator = 0
+  cy=1: chunk (1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (1,7)   ← group G1, rotator = 1
+  …
+  cy=7: chunk (7,0) (7,1) (7,2) (7,3) (7,4) (7,5) (7,6) (7,7)   ← group G7, rotator = 7
+```
+
+i.e. the tile in natural row-major order (bytes 0..4095).
+
+A **col-mode** read on port R0 delivers:
+
+```
+  cy=0: chunk (0,0) (1,0) (2,0) (3,0) (4,0) (5,0) (6,0) (7,0)   ← col 0
+  cy=1: chunk (0,1) (1,1) (2,1) (3,1) (4,1) (5,1) (6,1) (7,1)   ← col 1
+  …
+  cy=7: chunk (0,7) (1,7) (2,7) (3,7) (4,7) (5,7) (6,7) (7,7)   ← col 7
+```
+
+i.e. the **transpose of the 8 × 8 chunk grid**, still 4 KB in 8 cycles, still conflict-free, with the only per-cycle physical access being a wrapped-diagonal bank set as listed in the §4.2 proof.
+
+##### 7.8 Summary of the enhancement
+
+1. **Storage layout** is a diagonal/diamond skew (`bank = 8·g + (l + g) mod 8`); the SRAM address and bank count are unchanged.
+2. **Read ports** gain an `is_transpose` input (double-registered with `reg_idx`); a port can pick row or col delivery order at each epoch boundary.
+3. **Calendar** is unchanged in its port-to-phase rotation; per-cycle bank-select is either *whole group* (row-mode) or *one bank per group* (col-mode).
+4. **Write ports** are unchanged externally; a fixed 3-bit rotator places written lanes at the skewed physical bank positions.
+5. **Throughput, latency, storage, SRAM port count**: all **unchanged**.
+6. **Scheduling** gains one rule: all 8 active reads of any epoch share the same `is_transpose`. This pairs naturally with the existing "1 address per port per 8 cycles" cadence.
+7. **Gain**: true bank-conflict-free transpose of the 8 × 8 chunk partition of any tile, delivered at full 512 B/cy — no spare cycles, no duplicate storage, no side buffer.
+
+#### 8. Revision History
+
+| Version | Notes |
+|---------|-------|
+| 0.1 | Initial design: rectangular bank decode, row-mode only |
+| **0.2** | **Diagonal skew + `is_transpose` read input: bank-conflict-free row-*or*-col tile delivery at full 512 B/cy. Added §7 detailing storage map, read/write datapath, scheduling rule (R2), cost, and worked examples.** |
 
-> Within the same 8-cycle epoch, no two different-phase read/write ports shall operate on the same `tile_idx`. Same-phase pairs (R0/W0, R1/W1, ..., R7/W7) are always safe and fully bypassed. Cross-phase pairs on the same tile must be separated by at least one full epoch (8 cycles).
diff --git a/designs/outerCube/tregfile4k_v2.md b/designs/outerCube/tregfile4k_v2.md
new file mode 100644
index 00000000..654d084d
--- /dev/null
+++ b/designs/outerCube/tregfile4k_v2.md
@@ -0,0 +1,618 @@
+### Tile Register File (TRegFile-4K-v2)
+
+> **Version & Versioning Convention.** This document is the canonical **v2** specification of the TRegFile-4K register file used by the Davinci-v2 core. It supersedes the v1 baseline `tregfile4k.md` (revision 0.1). Each section below carries a clear marker indicating which content is **v1 baseline (unchanged)**, which content is a **v2 增量** (diagonal skew + `is_transpose`), or which sections **mix v1 baseline with v2 增量 inline**. v1 software and v1 datapath connectivity are preserved unchanged: `tregfile4k_v2.md` is a superset of `tregfile4k.md` (rev. 0.1) and a structural equivalent of `tregfile4k.md` (rev. 0.2 — the working file), with explicit version annotations added so that a reader of v2 alone obtains the complete and current specification.
+>
+> **Capability summary (v1 → v2):**
+> - **v1 (rev. 0.1):** rectangular bank decode (`bank = 8·g + l`), row-mode reads only, 8R+8W ports, 8-cycle synchronized calendar, 4 KB tiles × 256 phys-tiles = 1 MB.
+> - **v2 (rev. 0.2):** **diagonal (skewed) bank decode** (`bank = 8·g + (l+g) mod 8`) + per-port **`is_transpose` bit** on reads (double-registered with `reg_idx`); col-mode reads at full **512 B/cy** with **zero extra storage, latency, or port count**. Same SRAM count, same calendar, same address acceptance cadence. One new scheduling rule (§6 R2: uniform transpose mode per epoch).
+
+The TRegFile-4K is an **8-read / 8-write tile register file** built from **64 physical 1R1W SRAM banks** at **1× core clock**. Storage is organized into **4 KB tiles**, each striped across all 64 banks (64 B per bank) using a **diagonal (skewed) bank map** so that *both* a row-wise sweep and a column-wise sweep of the tile's logical 8×8 chunk grid are bank-conflict-free. Each read port carries an **`is_transpose` bit** that selects between the two delivery orders at run time (§3, §4.2, §7). An **8-cycle synchronized calendar** rotates port-to-group assignments so that every bank sees exactly **1R + 1W per cycle**. Each port accepts one `reg_idx` (plus `is_transpose` on reads) which is latched and drives the next **8-cycle epoch**; a new address is accepted every **8 cycles** (one per epoch boundary), enabling zero-bubble back-to-back tile accesses.
+
+#### 1. Core Parameters
+
+> **(v1 → v2: 表中所有数值参数继承自 v1 baseline,完全未变更。Read-ports 行的 `is_transpose` 字段是 v2 增量,在 v1 中不存在 — 见 §3 / §7。)**
+
+| Parameter | Value | Status |
+|-----------|-------|--------|
+| SRAM instance | **256 × 512 bits** (64 B wide, depth 256, 1R1W) | v1 baseline |
+| Banks | **64** (1 SRAM per bank), 8 groups × 8 banks | v1 baseline |
+| Total size | 64 × 16 KB = **1 MB** | v1 baseline |
+| Tile size / count | **4 KB** (4096 B) / **256** tiles (tile\_idx 0..255) | v1 baseline |
+| Read ports | **8** (R0–R7) — 512 B/cy each, **+ `is_transpose` (v2)** | v1 baseline + v2 增量 |
+| Write ports | **8** (W0–W7) — 512 B/cy each | v1 baseline (always row-mode) |
+| Calendar | **8 cycles**, synchronized; 1 new `reg_idx` / port / 8 cycles (epoch-aligned) | v1 baseline |
+| Bank decode | **diagonal skew** `bank_id = 8·g + ((l+g) mod 8)` | **v2 增量** (v1 used `bank_id = 8·g + l`) |
+| Col-mode read throughput | **512 B/cy** (same as row-mode) | **v2 增量** (no analog in v1) |
+
+#### 2. Tile Layout & Physical Organization
+
+> **(v1 → v2: 整节为 v2 重写。)** v1 (rev. 0.1) 使用 *矩形* 解码 `bank_id = 8·g + l`,所有 64 chunks 与 8 个 banks-of-group 一一对应,但只能支持 row-mode。v2 (rev. 0.2) 使用本节描述的 *对角线偏置 (diagonal skew)* 解码 `bank_id = 8·g + (l+g) mod 8`,使 row-sweep 和 col-sweep 都 bank-conflict-free。SRAM 总数、bank 总数、tile 总数、tile 大小均与 v1 完全相同 — 仅 chunk-to-bank 的映射函数不同。物理 layout 图(下方 1 MB / 64 banks / 8 groups)与 v1 相同。
+
+Each 4 KB tile is striped across all 64 banks via a **diagonal (skewed) bank map**. Viewing the 4 KB tile as an **8 × 8 chunk grid** of 64 B chunks, let:
+
+- `g = chunk_offset[5:3]` — **group row** of the chunk grid (0..7).
+- `l = chunk_offset[2:0]` — **local col** of the chunk grid within row g (0..7).
+
+The bank select now includes a **3-bit rotator** (not pure wiring); the SRAM address is unchanged:
+
+```
+  Skewed bank map (replaces the naive rectangular decode):
+    g          = chunk_offset[5:3]         ← group index (chunk-grid row)
+    l          = chunk_offset[2:0]         ← local index (chunk-grid col)
+    bank_group = g                          ← pure wiring
+    bank_local = (l + g) mod 8              ← 3-bit rotator controlled by g
+    bank_id    = 8·g + bank_local           ← 0..63
+    SRAM_addr  = tile_idx[7:0]              ← unchanged (256 rows / bank)
+
+  Bank groups (8 banks each, unchanged):
+    G0 = banks  0– 7    (chunk-grid row 0)
+    G1 = banks  8–15    (chunk-grid row 1)
+    G2 = banks 16–23    (chunk-grid row 2)
+    G3 = banks 24–31    (chunk-grid row 3)
+    G4 = banks 32–39    (chunk-grid row 4)
+    G5 = banks 40–47    (chunk-grid row 5)
+    G6 = banks 48–55    (chunk-grid row 6)
+    G7 = banks 56–63    (chunk-grid row 7)
+
+  1 bank  → 64 B   (one chunk)
+  1 group → 512 B  (8 banks; one cycle per port in row-mode)
+  8 groups → 4 KB  (full tile, 8 cycles)
+```
+
+**Diamond placement of a single tile.** Under this skew, the 64 chunks of tile T occupy **one distinct bank in each group row, along a wrapped diagonal** — not a rectangular column-aligned pattern:
+
+```
+  (cells list the LOGICAL chunk (g, l) of tile T at each physical bank slot)
+
+                 bank_local = 0      1      2      3      4      5      6      7
+    G0 (g=0):              (0,0)  (0,1)  (0,2)  (0,3)  (0,4)  (0,5)  (0,6)  (0,7)
+    G1 (g=1):              (1,7)  (1,0)  (1,1)  (1,2)  (1,3)  (1,4)  (1,5)  (1,6)
+    G2 (g=2):              (2,6)  (2,7)  (2,0)  (2,1)  (2,2)  (2,3)  (2,4)  (2,5)
+    G3 (g=3):              (3,5)  (3,6)  (3,7)  (3,0)  (3,1)  (3,2)  (3,3)  (3,4)
+    G4 (g=4):              (4,4)  (4,5)  (4,6)  (4,7)  (4,0)  (4,1)  (4,2)  (4,3)
+    G5 (g=5):              (5,3)  (5,4)  (5,5)  (5,6)  (5,7)  (5,0)  (5,1)  (5,2)
+    G6 (g=6):              (6,2)  (6,3)  (6,4)  (6,5)  (6,6)  (6,7)  (6,0)  (6,1)
+    G7 (g=7):              (7,1)  (7,2)  (7,3)  (7,4)  (7,5)  (7,6)  (7,7)  (7,0)
+
+  Logical row g = {(g, 0), …, (g, 7)}  — entirely inside group G_g
+                                         (local order rotated by g).
+  Logical col l = {(0, l), (1, l), …, (7, l)} — one bank per group,
+                                                along a wrapped diagonal
+                                                (physical locals = (l+0, l+1, …, l+7) mod 8).
+```
+
+Every logical row (chunk-grid row) and every logical column (chunk-grid col) therefore touches **exactly one bank in each group** — i.e., both access patterns cover all 64 banks with no bank visited twice. This is the essential property that enables the transposed-read capability in §7.
+
+```
+ ┌────────────────────────────────────────────────────────────────────────────────────────┐
+ │  TRegFile-4K:  256 tiles × 4 KB = 1 MB                                               │
+ │                                                                                       │
+ │  ┌──────────────────────────┐  ┌──────────────────────────┐                           │
+ │  │  Group G0 (banks 0–7)    │  │  Group G1 (banks 8–15)   │                           │
+ │  │  ┌────┐┌────┐ ... ┌────┐ │  │  ┌────┐┌────┐ ... ┌────┐ │                           │
+ │  │  │Bk0 ││Bk1 │     │Bk7 │ │  │  │Bk8 ││Bk9 │     │Bk15│ │                           │
+ │  │  │64B ││64B │     │64B │ │  │  │64B ││64B │     │64B │ │                           │
+ │  │  │×256││×256│     │×256│ │  │  │×256││×256│     │×256│ │                           │
+ │  │  └────┘└────┘     └────┘ │  │  └────┘└────┘     └────┘ │                           │
+ │  └──────────────────────────┘  └──────────────────────────┘                           │
+ │  ┌──────────────────────────┐  ┌──────────────────────────┐                           │
+ │  │  Group G2 (banks 16–23)  │  │  Group G3 (banks 24–31)  │                           │
+ │  │  ┌────┐┌────┐ ... ┌────┐ │  │  ┌────┐┌────┐ ... ┌────┐ │                           │
+ │  │  │Bk16││Bk17│     │Bk23│ │  │  │Bk24││Bk25│     │Bk31│ │                           │
+ │  │  │64B ││64B │     │64B │ │  │  │64B ││64B │     │64B │ │                           │
+ │  │  │×256││×256│     │×256│ │  │  │×256││×256│     │×256│ │                           │
+ │  │  └────┘└────┘     └────┘ │  │  └────┘└────┘     └────┘ │                           │
+ │  └──────────────────────────┘  └──────────────────────────┘                           │
+ │  ┌──────────────────────────┐  ┌──────────────────────────┐                           │
+ │  │  Group G4 (banks 32–39)  │  │  Group G5 (banks 40–47)  │                           │
+ │  │  ┌────┐┌────┐ ... ┌────┐ │  │  ┌────┐┌────┐ ... ┌────┐ │                           │
+ │  │  │Bk32││Bk33│     │Bk39│ │  │  │Bk40││Bk41│     │Bk47│ │                           │
+ │  │  │64B ││64B │     │64B │ │  │  │64B ││64B │     │64B │ │                           │
+ │  │  │×256││×256│     │×256│ │  │  │×256││×256│     │×256│ │                           │
+ │  │  └────┘└────┘     └────┘ │  │  └────┘└────┘     └────┘ │                           │
+ │  └──────────────────────────┘  └──────────────────────────┘                           │
+ │  ┌──────────────────────────┐  ┌──────────────────────────┐                           │
+ │  │  Group G6 (banks 48–55)  │  │  Group G7 (banks 56–63)  │                           │
+ │  │  ┌────┐┌────┐ ... ┌────┐ │  │  ┌────┐┌────┐ ... ┌────┐ │                           │
+ │  │  │Bk48││Bk49│     │Bk55│ │  │  │Bk56││Bk57│     │Bk63│ │                           │
+ │  │  │64B ││64B │     │64B │ │  │  │64B ││64B │     │64B │ │                           │
+ │  │  │×256││×256│     │×256│ │  │  │×256││×256│     │×256│ │                           │
+ │  │  └────┘└────┘     └────┘ │  │  └────┘└────┘     └────┘ │                           │
+ │  └──────────────────────────┘  └──────────────────────────┘                           │
+ │                                                                                       │
+ │  ════════════════════════════════════════════════════════════════════════════════════   │
+ │    Rotating group mux: each port gets 1 group per cycle                               │
+ │  ════════════════════════════════════════════════════════════════════════════════════   │
+ │  ▼(8bk) ▼(8bk) ▼(8bk) ▼(8bk) ▼(8bk) ▼(8bk) ▼(8bk) ▼(8bk)                         │
+ │  R0     R1     R2     R3     R4     R5     R6     R7                                  │
+ │  512B   512B   512B   512B   512B   512B   512B   512B                                │
+ │                                                                                       │
+ │  ▲(8bk) ▲(8bk) ▲(8bk) ▲(8bk) ▲(8bk) ▲(8bk) ▲(8bk) ▲(8bk)                         │
+ │  W0     W1     W2     W3     W4     W5     W6     W7                                  │
+ │  512B   512B   512B   512B   512B   512B   512B   512B                                │
+ └────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+#### 3. Port Interface
+
+> **(v1 → v2: 端口数 / 数据宽度 / 地址接受率均保持 v1 baseline。**v2 增量** 集中在两处:`is_transpose` 输入 + col-mode 数据交付路径。)** v1 read 端口仅接受 `reg_idx[7:0]`,只有 row-mode;v2 增加 1-bit `is_transpose` 与 `reg_idx` 一同 double-register,在 epoch 边界一并 latch。端口微架构图(下方"Port microarchitecture"部分)中的 *Bank-Select Calendar* 与 *Output Rotator* 是 v2 增量,但其上游(pending/active 双寄存器)与下游(`data out → VEC`)接口与 v1 完全兼容。
+
+Each port presents **512 B per cycle** (8 banks × 64 B; in row-mode these are the 8 banks of one group, in col-mode these are one bank per group, §4.2). A port accepts one `reg_idx[7:0]` (read ports additionally accept **`is_transpose[0]`**) which is **latched** internally at the epoch boundary. The latched address then drives data delivery (read) or acceptance (write) for the addressed tile over the next **8 consecutive cycles** — one bank-group per cycle per the calendar rotation. Since a 4 KB tile requires 8 × 512 B reads, the port is occupied for the full epoch and can only accept a **new `reg_idx` every 8 cycles**.
+
+- **`is_transpose = 0` (ROW mode)** — default. Cycle-by-cycle the port delivers the chunk-grid rows in order, i.e. a linearly-addressed 512 B strip per cycle. Physical banks visited per cycle = **all 8 banks of one group** (same access pattern as a naive rectangular layout); the 3-bit rotator reorders the 8 lanes so that logical chunk `(g, l)` appears at output lane `l`.
+- **`is_transpose = 1` (COL mode)** — delivers the chunk-grid columns in order. Physical banks visited per cycle = **one bank per group** (wrapped diagonal selection). The output lane `i` carries logical chunk `(i, l_active)` where `l_active = (p + cy) mod 8` is the column being scanned this cycle.
+
+Both modes deliver the full 4 KB tile in exactly **8 cycles**; the only difference is the traversal order of the 8 × 8 chunk grid. **Write ports are always row-oriented** — the producer presents 8 chunks of one chunk-grid row per cycle and the write-side rotator places them at the skewed physical banks.
+
+**Epoch-aligned address acceptance:** The port contains a **pending** address register and an **active** address register. A client can write a new `reg_idx` into the pending register at any time during the current epoch. At the next epoch boundary (`cy[2:0]=0`), pending promotes to active and the port begins serving the new tile with **zero bubble**:
+
+```
+  Port Rp — back-to-back tile reads (zero gap):
+
+  Cycle:  0    1    2    3    4    5    6    7    8    9   10  ...  15   16  ...
+  Addr:  [T0 latched at boundary]              [T1 latched at boundary]  [T2 ...]
+  Data:  T0   T0   T0   T0   T0   T0   T0   T0   T1   T1   T1  ...  T1   T2  ...
+         .G0  .G1  .G2  .G3  .G4  .G5  .G6  .G7  .G0  .G1  .G2      .G7  .G0
+         └──── epoch 0 (tile T0) ────┘  └──── epoch 1 (tile T1) ────┘  └── ...
+                                     ↑ zero bubble: T1 starts immediately
+```
+
+- T0 address is written to pending before epoch 0; it promotes to active at the boundary.
+- T1 address can be written to pending at any point during epoch 0; it takes effect at cycle 8.
+- **One new tile address per port every 8 cycles** — the port is fully occupied delivering 512 B/cy × 8 cy = 4 KB for the current tile.
+
+| Ports | Direction | Data Width | Address | Addr Rate |
+|-------|-----------|------------|---------|-----------|
+| **R0–R7** | Read | 512 B (4096 bits) / cy | `reg_idx[7:0]` + **`is_transpose[0]`** | 1 addr / 8 cy |
+| **W0–W7** | Write | 512 B (4096 bits) / cy | `reg_idx[7:0]` + `w_en` | 1 addr / 8 cy |
+
+**Per-port sustained throughput:** 1 tile (4 KB) every 8 cycles = 512 B/cy (row-mode and col-mode both).
+**Address registers:** 1 pending + 1 active (double-register for zero-bubble epoch chaining). On read ports `is_transpose` is double-registered together with `reg_idx` and is held constant for the full epoch.
+
+**Port microarchitecture (read port Rp):**
+
+```
+             reg_idx[7:0]   is_transpose
+                  │               │
+                  ▼               ▼
+          ┌───────────────────────────┐
+          │  Addr+Mode Latch          │◄── written any time during epoch
+          │  (pending)                │
+          └──────────┬────────────────┘
+                     │ epoch boundary: pending → active
+                     ▼
+          ┌───────────────────────────┐
+          │  Addr+Mode Active         │
+          │  (current epoch)          │
+          └──────────┬────────────────┘
+                     │ tile_idx, is_transpose, phase p, cy[2:0]
+                     ▼
+            ┌──────────────────────────────────────────┐
+            │  Bank-Select Calendar (§4)               │
+            │    is_transpose = 0 : row-mode           │
+            │         → 8 banks of G_{(p+cy) mod 8}    │
+            │    is_transpose = 1 : col-mode           │
+            │         → bank_i = 8·i + (p+cy+i) mod 8  │
+            │           for i ∈ {0..7}                 │
+            └──────────┬───────────────────────────────┘
+                       │ 8 bank reads × 64 B
+                       ▼
+            ┌──────────────────────────────────────────┐
+            │  Output Rotator / Lane Permute           │
+            │    row-mode: rotate-left by g = (p+cy)%8 │
+            │              (restore logical l-order)   │
+            │    col-mode: identity on group axis      │
+            │              (lane i = chunk (i, l_act)) │
+            └──────────┬───────────────────────────────┘
+                       │ 512 B / cy
+                       ▼
+                  ┌───────────┐
+                  │ data out  │──▶ VEC
+                  └───────────┘
+
+  Timing:
+    Cycle c     : client writes {reg_idx, is_transpose} → pending latch
+    Cycle c'    : next epoch boundary (cy[2:0]=0) → pending promotes to active
+    Cycle c'..c'+7 : active tile_idx + mode drive 8 consecutive bank reads
+    Cycle c'..c'+7 : client may write next {reg_idx, is_transpose} → new pending
+    Cycle c'+8  : next epoch boundary → new pending promotes to active
+```
+
+**Write port Wp** is identical except data flows inward and there is no `is_transpose` input. The write-side lane permute is a fixed **rotate-right by g = (p+cy) mod 8** so that the 8 lanes of the chunk-grid row (logical `l = 0..7`) land at physical `bank_local = (l + g) mod 8` in group G_g:
+
+```
+                 reg_idx[7:0]              w_en
+                      │                     │
+                      ▼                     │
+              ┌───────────────┐             │
+              │  Addr Latch   │◄── client   │
+              │  (pending)    │             │
+              └──────┬────────┘             │
+                     │ epoch boundary       │
+                     ▼                      ▼
+              ┌───────────────┐     ┌────────────────────────────┐
+              │  Addr Active  │────▶│  Bank-Group Sel (cy)       │
+              │  (current)    │     │  → group G_{(p+cy) mod 8}  │
+              └───────────────┘     └──────────┬─────────────────┘
+                                               │ target group g
+                                               ▼
+                                      ┌────────────────────────┐
+           512 B in                   │  Write Lane Rotator    │
+         (8 lanes, l=0..7) ──────────▶│  rotate-right by g     │
+                                      │  l_phys = (l + g) % 8  │
+                                      └──────────┬─────────────┘
+                                                 │
+                                                 ▼  8 bank writes × 64 B
+                                         [SRAM write]
+```
+
+#### 4. 8-Cycle Synchronized Calendar
+
+> **(v1 → v2: 全局 epoch 计数器 + 端口-相位旋转表(§4.1 row-mode 表)完整继承自 v1。§4.2 col-mode calendar 是 **v2 增量**。conflict-free 证明 §4.1 部分(row-mode 部分)继承自 v1;§4.2 部分(col-mode bijection 论证)是 v2 增量。)**
+
+All 16 ports share a global 3-bit **epoch counter** (`cy[2:0]`). Read and write ports follow the **same** base rotation pattern — port *p* (phase offset *p*) is associated with group `(p + cy) % 8` every cycle. Within an epoch a read port then applies one of two bank-select patterns depending on its latched `is_transpose` bit (writes always use the row-mode pattern).
+
+##### 4.1 Row-mode calendar (`is_transpose = 0`, and all writes) — **(v1 baseline, 内容未变更)**
+
+Port *p* at cycle `cy` accesses all 8 banks of group `G_{(p + cy) % 8}`:
+
+| Cycle | Phase 0 (R0/W0) | Phase 1 (R1/W1) | Phase 2 (R2/W2) | Phase 3 (R3/W3) | Phase 4 (R4/W4) | Phase 5 (R5/W5) | Phase 6 (R6/W6) | Phase 7 (R7/W7) |
+|-------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
+| 0 | **G0** | **G1** | **G2** | **G3** | **G4** | **G5** | **G6** | **G7** |
+| 1 | **G1** | **G2** | **G3** | **G4** | **G5** | **G6** | **G7** | **G0** |
+| 2 | **G2** | **G3** | **G4** | **G5** | **G6** | **G7** | **G0** | **G1** |
+| 3 | **G3** | **G4** | **G5** | **G6** | **G7** | **G0** | **G1** | **G2** |
+| 4 | **G4** | **G5** | **G6** | **G7** | **G0** | **G1** | **G2** | **G3** |
+| 5 | **G5** | **G6** | **G7** | **G0** | **G1** | **G2** | **G3** | **G4** |
+| 6 | **G6** | **G7** | **G0** | **G1** | **G2** | **G3** | **G4** | **G5** |
+| 7 | **G7** | **G0** | **G1** | **G2** | **G3** | **G4** | **G5** | **G6** |
+
+Over 8 cycles each port visits all 8 groups exactly once → reads/writes one complete 4 KB tile in chunk-grid row order.
+
+##### 4.2 Col-mode calendar (`is_transpose = 1`, reads only) — **(v2 增量,v1 中无对应)**
+
+Port *p* at cycle `cy` delivers **chunk-grid column** `l_active = (p + cy) % 8`. The 8 banks accessed are *one per group*, with
+
+```
+  for each i ∈ {0..7}:
+      bank_i = 8·i + ((p + cy + i) mod 8)       (group G_i, local (l_active + i) mod 8)
+```
+
+i.e. every cycle the port fetches a **wrapped-diagonal set of 8 banks**, one from each group. Equivalently, the per-group local-bank table (entries = `bank_local`) for each (phase, cycle) is:
+
+| Cycle | Phase 0 (R0) | Phase 1 (R1) | Phase 2 (R2) | Phase 3 (R3) | Phase 4 (R4) | Phase 5 (R5) | Phase 6 (R6) | Phase 7 (R7) |
+|-------|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|
+| 0 | col 0 → locals (0,1,2,3,4,5,6,7) | col 1 → (1,2,3,4,5,6,7,0) | col 2 → (2,3,4,5,6,7,0,1) | col 3 → (3,4,5,6,7,0,1,2) | col 4 → (4,5,6,7,0,1,2,3) | col 5 → (5,6,7,0,1,2,3,4) | col 6 → (6,7,0,1,2,3,4,5) | col 7 → (7,0,1,2,3,4,5,6) |
+| 1 | col 1 | col 2 | col 3 | col 4 | col 5 | col 6 | col 7 | col 0 |
+| … | … | … | … | … | … | … | … | … |
+| 7 | col 7 | col 0 | col 1 | col 2 | col 3 | col 4 | col 5 | col 6 |
+
+(tuples list `bank_local` for groups `G_0 .. G_7` in order; the cycle-1..7 rows follow the same wrapped-diagonal pattern with `col = (p + cy) mod 8`).
+
+Over 8 cycles each col-mode port visits all 8 columns of the chunk grid exactly once → reads one complete 4 KB tile in chunk-grid column order (i.e. the transpose of the chunk grid).
+
+**Epoch chaining (pipelined address):** The epoch counter is free-running and global. A port's active address drives all 8 cycles of the current epoch. At the next `cy[2:0]=0` boundary, the pending address (latched at any point during the previous epoch) automatically promotes to active. This produces **zero-bubble back-to-back tile accesses** — the port never idles between consecutive tiles:
+
+```
+  cy[2:0]: 0  1  2  3  4  5  6  7  0  1  2  3  4  5  6  7  0  1 ...
+  Active:  ──── tile T0 ─────────  ──── tile T1 ─────────  ── T2 ...
+  Pending:       [T1 latched]            [T2 latched]
+                                  ↑                        ↑
+                           T1 promotes                T2 promotes
+```
+
+**Conflict-free proof (row-mode, `is_transpose = 0` for all reads):** At every cycle, the 8 read ports cover {G0..G7} and the 8 write ports independently cover {G0..G7}. Each group sees exactly 1R + 1W. The reader and writer assigned to the same group are always the **same-phase** pair (R0/W0, R1/W1, ..., R7/W7).
+
+```
+  Cy 0: R = G0(R0) G1(R1) G2(R2) G3(R3) G4(R4) G5(R5) G6(R6) G7(R7)
+         W = G0(W0) G1(W1) G2(W2) G3(W3) G4(W4) G5(W5) G6(W6) G7(W7)
+  Cy 1: R = G1(R0) G2(R1) G3(R2) G4(R3) G5(R4) G6(R5) G7(R6) G0(R7)
+         W = G1(W0) G2(W1) G3(W2) G4(W3) G5(W4) G6(W5) G7(W6) G0(W7)
+  ...
+  Cy 7: R = G7(R0) G0(R1) G1(R2) G2(R3) G3(R4) G4(R5) G5(R6) G6(R7)
+         W = G7(W0) G0(W1) G1(W2) G2(W3) G3(W4) G4(W5) G5(W6) G6(W7)
+
+  Per bank: ≤ 1R + 1W per cycle.  Two-port SRAM satisfied.  ✓
+```
+
+**Conflict-free proof (col-mode, `is_transpose = 1` for all reads):** At cycle `cy`, read port R_p accesses bank `(G_i, (p + cy + i) mod 8)` for each `i ∈ {0..7}`. At a given group `G_i` and cycle `cy`, as `p` ranges over `{0..7}`, the local index `(p + cy + i) mod 8` is a **bijection** onto `{0..7}`. Hence the 8 read ports together cover every bank of every group exactly once — 64 reads, all distinct. Writes (always row-mode) still cover {G0..G7} once each; each bank sees ≤ 1R + ≤ 1W per cycle.
+
+```
+  Cy 0 col-mode reads (bank_id of each port = 8·i + local):
+      R0 → 0, 9,18,27,36,45,54,63       (col 0: locals 0,1,…,7)
+      R1 → 1,10,19,28,37,46,55,56       (col 1: locals 1,2,…,7,0)
+      R2 → 2,11,20,29,38,47,48,57       (col 2: locals 2,3,…,7,0,1)
+      R3 → 3,12,21,30,39,40,49,58       (col 3)
+      R4 → 4,13,22,31,32,41,50,59       (col 4)
+      R5 → 5,14,23,24,33,42,51,60       (col 5)
+      R6 → 6,15,16,25,34,43,52,61       (col 6)
+      R7 → 7, 8,17,26,35,44,53,62       (col 7)
+
+  Each physical bank 0..63 appears in exactly one port list ⇒ no conflict. ✓
+  Writes (all row-mode) land on the same 8 groups with same-phase bypass. ✓
+```
+
+**Mixed-mode across read ports is forbidden** — see §6 scheduling rule and §7 proof of the collision.
+
+#### 5. Throughput
+
+> **(v1 → v2: 行 1–7 完整继承自 v1 baseline,内容未变更。"Transpose cost" 行是 **v2 增量**(v1 中 col-mode 不存在,transpose 必须由外部 MTE buffer 通过 16 cy 单独完成)。"Aggregate read BW"行带 col-mode 限定语 "(either all row-mode or all col-mode, §6)" 是 v2 增量(v1 全部为 row-mode,无此限定)。)**
+
+| Metric | Value | Status |
+|--------|-------|--------|
+| Per port data BW | 8 banks × 64 B = **512 B/cy** (row-mode and col-mode) | v1 baseline (col-mode 限定语为 v2 增量) |
+| Per port per epoch (8 cy) | 8 chunk-grid rows **or** 8 chunk-grid cols × 512 B = **4 KB** (1 tile) | v1 baseline (col-mode 选项为 v2) |
+| Addr acceptance rate | **1 `reg_idx` (+ `is_transpose` on reads) / port / 8 cycles** (epoch-aligned) | v1 baseline (`is_transpose` 字段为 v2) |
+| Addr-to-data latency | 0–7 cy (depends on when within epoch the pending addr/mode is written) | v1 baseline |
+| Sustained tile rate | 1 tile / 8 cy / port (zero-bubble epoch chaining) | v1 baseline |
+| Aggregate read BW | 8 ports × 512 B/cy = **4 KB/cy** (either all row-mode or all col-mode, §6) | v1 baseline (uniform-mode 限定为 v2) |
+| Aggregate write BW | 8 ports × 512 B/cy = **4 KB/cy** | v1 baseline |
+| Total per epoch | **16 tile ops** (8R + 8W), zero bank conflicts | v1 baseline |
+| Transpose cost | **0 cycles** — a col-mode read delivers the chunk-grid transpose at full 512 B/cy without any extra latency, storage, or copy | **v2 增量** (v1: TRegFile 没有 transpose 能力, 必须通过 MTE 4 KB 缓冲完成 16 cy 转置) |
+
+#### 6. Write-to-Read Bypass & Scheduling Constraint
+
+> **(v1 → v2: Same-phase bypass(零延迟硬件)与 cross-phase RAW hazard 分析(包括下方表格)完整继承自 v1。Scheduling rule **(R1)** 完整继承自 v1。Scheduling rule **(R2) Uniform transpose mode per epoch** 是 **v2 增量**,在 v1 中不存在(v1 全部是 row-mode,从未需要此约束)。"Why mixed-mode is disallowed" 段是 v2 增量。)**
+
+**Same-phase bypass (hardware, zero-latency):**
+
+The calendar guarantees that each group's reader and writer in any given cycle are always a same-phase port pair. When a same-phase read and write target the same `tile_idx`, SRAM write latency (1 cycle) would return stale data. A combinational bypass mux forwards the write data directly to the read output.
+
+**Cross-phase RAW hazard (not resolved in hardware):**
+
+For different-phase port pairs (e.g. R0 reading tile T while W1 writes tile T in the same epoch), the phase offset causes one or more groups per pair to be **read before they are written**. The write data does not exist at the time of the read, so no combinational bypass can resolve it.
+
+Example — R0 (phase 0) and W1 (phase 1) on the same tile:
+
+```
+  Group  │  R0 reads  │  W1 writes  │  Result
+  ───────┼────────────┼─────────────┼──────────────────────────
+  G0     │  cycle 0   │  cycle 7    │  Read 7 cy before write → STALE ✗
+  G1     │  cycle 1   │  cycle 0    │  Write 1 cy before read → SRAM OK ✓
+  G2     │  cycle 2   │  cycle 1    │  Write 1 cy before read → SRAM OK ✓
+  G3     │  cycle 3   │  cycle 2    │  Write 1 cy before read → SRAM OK ✓
+  G4     │  cycle 4   │  cycle 3    │  Write 1 cy before read → SRAM OK ✓
+  G5     │  cycle 5   │  cycle 4    │  Write 1 cy before read → SRAM OK ✓
+  G6     │  cycle 6   │  cycle 5    │  Write 1 cy before read → SRAM OK ✓
+  G7     │  cycle 7   │  cycle 6    │  Write 1 cy before read → SRAM OK ✓
+```
+
+**Scheduling rules (enforced by upstream scheduler):**
+
+> **(R1) — (v1 baseline, 内容未变更)** Within the same 8-cycle epoch, no two different-phase read/write ports shall operate on the same `tile_idx`. Same-phase pairs (R0/W0, R1/W1, ..., R7/W7) are always safe and fully bypassed. Cross-phase pairs on the same tile must be separated by at least one full epoch (8 cycles).
+>
+> **(R2) — (v2 增量) Uniform transpose mode per epoch.** Within the same 8-cycle epoch, **all 8 active read ports must share the same `is_transpose` value**. Row-mode and col-mode reads cannot coexist in the same epoch because a row-mode reader occupies all 8 banks of its group G_a while every col-mode reader simultaneously wants exactly one bank inside G_a — the two patterns collide on the 1R SRAM port of that bank.
+>
+> Because `is_transpose` is double-registered on a per-port basis, each port *can* switch between row and col across successive epochs, but the scheduler must ensure the 8 active reads of any given epoch agree. Writes are always row-mode and impose no new constraint.
+
+**Why mixed-mode is disallowed (sketch).** At cycle `cy`, a row-mode port R_p occupies *all* 8 banks of group `G_{(p+cy) mod 8}`. A col-mode port R_q at the same cycle needs the bank at `(G_i, local = (q + cy + i) mod 8)` for every `i`, including `i = (p + cy) mod 8`, which collides with the group R_p has fully claimed. Since each SRAM bank has only 1R port, the collision is unresolvable by rotation or reorder. The uniform-mode rule sidesteps this cleanly. For row-mode + row-mode or col-mode + col-mode, the bijection arguments in the §4 proofs guarantee zero overlap.
+
+#### 7. Transposed Read — Diagonal Skew, Datapath, and Semantics
+
+> **(v1 → v2: 整节为 **v2 增量**,在 v1 中完全不存在。本节是 TRegFile-4K v2 的核心新功能 — bank-conflict-free 转置读出的完整规格。)**
+
+This section consolidates the **transposed-read enhancement**: how the bank-skew of §2 together with the `is_transpose` bit on the read port (§3) turns the TRegFile into a bank-conflict-free *row-or-column* tile fetcher, at a small fixed datapath cost and without any extra SRAM storage, redundancy, or latency.
+
+##### 7.1 Motivation
+
+Many tile operations — matrix transpose, GEMM lhs/rhs reshape, strided reductions across "columns", butterfly shuffles, etc. — need to consume the **columns** of an R × C tile at the same bandwidth they consume its rows. A naive rectangular bank decode (`bank = chunk_offset[5:0]`) is ideal for row-major sweeps but forces all 8 chunks of a logical column into the *same bank group*, causing a 1× → 1/8× bandwidth collapse plus a hard bank conflict.
+
+The diagonal skew solves this with one small hardware change (a per-group 3-bit rotator on writes, a matching inverse rotator plus a bank-select mux on reads) and **zero cost** in throughput, latency, storage, or port count.
+
+##### 7.2 Recap of the skewed bank map
+
+From §2:
+
+```
+  For chunk (g, l) of tile T, with g = chunk_offset[5:3], l = chunk_offset[2:0]:
+      bank_id    = 8·g + ((l + g) mod 8)
+      SRAM_addr  = tile_idx[7:0]
+```
+
+The mapping is:
+
+- **Injective within a tile** — the 64 chunks of one tile occupy 64 distinct banks (one per physical bank).
+- **Row-closed** — all 8 chunks of chunk-grid row `g` lie in the same group `G_g`, just rotated in local order.
+- **Column-spread** — the 8 chunks of chunk-grid column `l` lie one-per-group, along a wrapped diagonal (`bank_local = (l + g) mod 8` for `g = 0..7`).
+
+Both properties together give the **row × col conflict-free** guarantee.
+
+##### 7.3 Write datapath (always row-oriented)
+
+The producer presents each 512 B strip as 8 logical lanes (`l = 0..7`) of chunk-grid row `g` (chosen by the calendar from §4.1). A fixed **3-bit write-lane rotator** steered by `g` places logical lane `l` at physical `bank_local = (l + g) mod 8`:
+
+```
+    logical lanes  l=0 1 2 3 4 5 6 7   (512 B in)
+                       │ │ │ │ │ │ │ │
+                       ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
+                  ╔════════════════════╗
+                  ║  Rotate-right by g ║  g = (p + cy) mod 8  (from calendar)
+                  ╚════════════════════╝
+                       │ │ │ │ │ │ │ │
+                       ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
+    physical       bank_local = 0 1 2 3 4 5 6 7  of group G_g
+    banks          (SRAM_addr = tile_idx)
+```
+
+The rotator is 8-way, 64 B wide, with a 3-bit select — one instance per write port (8 total). No extra storage, no per-bank decode.
+
+##### 7.4 Read datapath (row-mode *or* col-mode)
+
+Read ports implement both traversal orders over a **shared bank-select mux** steered by `{phase p, cy[2:0], is_transpose}`. A 9-to-8 lane permute on the output side restores logical order.
+
+```
+                 ┌──────── row-mode ────────┐    ┌──────── col-mode ────────┐
+                 │  fetch 8 banks of G_g    │    │  fetch 1 bank per group  │
+    Bank sel:    │   g = (p + cy) mod 8     │    │   bank_i = 8·i           │
+                 │   bank_i = 8·g + i       │    │          + (p+cy+i) mod 8│
+                 │   (i = local 0..7)       │    │   (i = group 0..7)       │
+                 └──────────────────────────┘    └──────────────────────────┘
+                                │                            │
+                                └─────────┬──────────────────┘
+                                          ▼
+                           ┌──────────────────────────────┐
+                           │   Output Lane Permute        │
+                           │   row-mode: rotate-left  g   │
+                           │   col-mode: identity         │
+                           └──────────────┬───────────────┘
+                                          ▼
+                                   512 B to VEC
+```
+
+- **Row-mode output meaning.** Output lane `l` carries logical chunk `(g_active, l)`; strips arrive in chunk-grid-row order over the 8 cycles of the epoch (bytes `0..4095` of the tile in linear address order, modulo the phase offset).
+- **Col-mode output meaning.** Output lane `i` carries logical chunk `(i, l_active)` with `l_active = (p + cy) mod 8`; strips arrive in chunk-grid-*column* order — i.e., **the transpose of the 8 × 8 chunk grid**, still delivered at full 512 B/cy.
+
+The row-mode output rotator is the inverse of the write-side rotator (rotate-left by `g` vs. rotate-right by `g`); col-mode needs no lane permute because the fetched 8-tuple is already `chunk (0, l), chunk (1, l), …, chunk (7, l)` in group order.
+
+##### 7.5 Semantics: what "transpose" means at the TRegFile level
+
+The skew provides a **chunk-granular transpose** of the tile, not a byte- or element-granular one. Specifically:
+
+> Col-mode reads deliver the 4 KB tile as the **transpose of its 8 × 8 chunk-grid partition**. The 64 B inside each chunk are *not* transposed — they come out of the SRAM in their stored byte order.
+
+The TRegFile deliberately does **not** attempt an element-level transpose; finer permutation belongs to the VEC front-end, which already has a 512 B Align/Unpack/Permute stage.
+
+**Strip-fill invariant.** Both row-mode and col-mode always deliver the full **512 B** per cycle. Col-mode has a clean physical description (derived directly from the bank map):
+
+> **Col-mode strip `s` = the eight 64 B windows at tile byte positions `{ s·64 + k·512 : k = 0..7 }`.**
+
+Equivalently, col-mode picks *the same 64 B sub-chunk (offset `s·64`) from each of the 8 row-mode strips*, and delivers them concatenated as one 512 B strip.
+
+**Full-tile coverage invariant (independent of shape).** Over an 8-cycle epoch `s = 0..7`, col-mode covers chunk-id sets
+`{ 8·g + s : g = 0..7 }` which are disjoint for distinct `s` and whose union is exactly `{0, 1, …, 63}`. The map `(s, g) ↦ 8·g + s` is a **bijection** onto the 64 chunks of the tile; therefore *every* col-mode read delivers the **complete 4096 B of the tile in 8 cycles — no duplicate, no omission — for every one of the 23 legal `(E, R, C)` shapes**. The ✅ markings below only classify whether the delivery is semantically a clean whole-row transpose view (✅) or an interleaved row-segment view that VEC's Align/Unpack/Permute front-end must recombine (no ✅); the *data completeness* is identical in both cases.
+
+With the coverage and strip-fill formulas, for any legal `(R, C, E)` with row width `W = C·E` bytes, the set of tile rows touched by col-mode strip `s` is
+
+```
+  touched-rows(s) = { ( s·64 + k·512 ) div W :  k = 0..7 }     (as a multiset if W < 64)
+  byte-offset-in-row(s, k) = ( s·64 + k·512 ) mod W
+```
+
+and the 512 B payload is exactly `512 / W` rows' worth of data (whole rows when `W ≤ 64`, row-segments otherwise). The following regime table and the shape-by-shape enumeration are direct corollaries.
+
+| `W` range | Row-mode strip contents | Col-mode strip contents |
+|-----------|-------------------------|-------------------------|
+| `W ≤ 64` (row fits in 1 chunk, `64/W` rows per chunk) | `512/W` contiguous rows | `512/W` **whole** rows, arranged as `8` groups of `64/W` consecutive rows, with group `k` centered at base row `k·(R/8) + s·(64/W)` |
+| `W = 128, 256, 512` (row = 2, 4, 8 chunks) | `512/W` contiguous rows | 8 × 64 B *row-segments* picked from 8 distinct rows, stride `R/8`; each segment covers one of the `W/64` sub-chunks of its row, cycled by `s` |
+| `W = 1024, 2048, 4096` (one row spans `W/512` row-mode strips) | `1 / (W/512)` row (a half / quarter / eighth) | 8 × 64 B row-segments drawn from the `R ≤ 4` rows of the tile, at stride-512 byte positions inside each row |
+
+**Per-shape coverage across all 23 legal `(E, R, C)` geometries** of [`vector4k.md`](vector4k.md) §9.7 (FP16 and BF16 share shapes):
+
+| `R × C` (FP32, `E=4`) | `W=C·E` | Row-mode rows / strip | Col-mode strip `s` delivers | `R × C` (FP16 / BF16, `E=2`) | `W` | Row-mode rows / strip | Col-mode strip `s` delivers |
+|-----------------------|---------|-----------------------|-----------------------------|------------------------------|-----|-----------------------|-----------------------------|
+| 1 × 1024 | 4096 B | `1/8` row | 8 × 64 B of the sole row, stride 512 B (= 128 FP32 stride-128) | 1 × 2048 | 4096 B | `1/8` row | 8 × 64 B of the sole row, stride 512 B (= 256 FP16 stride-256) |
+| 2 × 512  | 2048 B | `1/4` row | 4 × 64 B of row 0 + 4 × 64 B of row 1, stride 512 B inside each row | 2 × 1024 | 2048 B | `1/4` row | same, scaled by `E` |
+| 4 × 256  | 1024 B | `1/2` row | 2 × 64 B of each of 4 rows, stride 512 B inside each row | 4 × 512  | 1024 B | `1/2` row | same, scaled by `E` |
+| **8 × 128**  | **512 B** | **1 whole row** | **1 × 64 B of each of 8 rows** ⇒ **clean 8 × 8 chunk-grid transpose = 8 rows × 16 FP32 col-band** ✅ | **8 × 256**  | **512 B** | **1 whole row** | **8 rows × 32 FP16 col-band** ✅ |
+| 16 × 64  | 256 B  | 2 rows | 8 rows (stride 2: rows {0,2,…,14} for s ∈ {0..3}; rows {1,3,…,15} for s ∈ {4..7}), each at one of 4 col-bands of 16 FP32 | 16 × 128 | 256 B  | 2 rows | same pattern over 16 rows × 128 FP16 |
+| 32 × 32  | 128 B  | 4 rows | 8 rows (stride 4), each at one of 2 half-row col-bands of 16 FP32 | 32 × 64  | 128 B  | 4 rows | same over 32 rows × 64 FP16 |
+| **64 × 16** | **64 B**  | **8 whole rows** | **8 complete rows (stride 8 over 64 rows)** ✅ | **64 × 32** | **64 B**  | **8 whole rows** | **8 complete rows (stride 8 over 64 rows)** ✅ |
+| 128 × 8  | 32 B   | 16 whole rows | 8 × (2 consecutive rows) with stride 16 over 128 rows | 128 × 16 | 32 B   | 16 whole rows | 8 × (2 consecutive rows), stride 16 over 128 rows |
+| 256 × 4  | 16 B   | 32 whole rows | 8 × (4 consecutive rows) with stride 32 over 256 rows | 256 × 8  | 16 B   | 32 whole rows | 8 × (4 consecutive rows), stride 32 over 256 rows |
+| 512 × 2  | 8 B    | 64 whole rows | 8 × (8 consecutive rows) with stride 64 over 512 rows | 512 × 4  | 8 B    | 64 whole rows | 8 × (8 consecutive rows), stride 64 over 512 rows |
+| 1024 × 1 | 4 B    | 128 whole rows | 8 × (16 consecutive rows) with stride 128 over 1024 rows | 1024 × 2 | 4 B    | 128 whole rows | 8 × (16 consecutive rows), stride 128 over 1024 rows |
+| — | | | | 2048 × 1 | 2 B    | 256 whole rows | 8 × (32 consecutive rows), stride 256 over 2048 rows |
+
+(✅ marks the two regimes in which col-mode produces a clean *whole-row* delivery: `W = 512 B` → eight full rows laid out as a 8 × 8 col-band transpose, and `W ≤ 64 B` → `512/W` full rows chosen as eight stride-`R/8` row-blocks. In every other regime col-mode still fills the 512 B port, but with row-segments instead of whole rows — VEC's Align/Unpack/Permute recombines them.)
+
+**Universal bank-conflict-freeness (proof).** The per-cycle physical bank set accessed by the 8 read ports is
+
+```
+  row-mode (∀ p ∈ {0..7}):  { 8·g + ((l+g) mod 8) : l = 0..7 }  with g = (p+cy) mod 8
+  col-mode (∀ p ∈ {0..7}):  { 8·i + ((p+cy+i) mod 8) : i = 0..7 }
+```
+
+Neither expression depends on `R`, `C`, `E`, `tile_idx`, or any tile content; both are functions only of `(p, cy)`. The two §4 bijection arguments therefore hold for **every** `(E, R, C)` in the 23-row legal-shape table unchanged. Writes (always row-mode) are likewise shape-agnostic. The **"uniform transpose mode per epoch"** rule (§6 R2) is the single necessary and sufficient scheduling constraint; no shape introduces any extra hazard.
+
+Hence:
+
+> **Transposed read is bank-conflict-free for all 23 legal `(E, R, C)` geometries and at every valid `W = C·E` from 2 B to 4096 B.** The 512 B read port is fully filled on every cycle; for `W ≤ 64` it is filled with `512/W` whole rows (stride-`R/8` gather), for `W ∈ (64, 512]` with `512/W` rows' worth of 64 B row-segments, and for `W > 512` with the same fractional-row budget as row-mode.
+
+VEC's existing Align/Unpack/Permute front-end (`vector4k.md` §4) is the sole agent that lifts these strip-level deliveries to the element-level transpose required by `TCOL*` / `TTRANS` class instructions — the TRegFile guarantees only that the raw 4 KB of data arrives, bank-conflict-free, in the chunk-grid-transposed order.
+
+##### 7.6 Datapath cost
+
+| Block | Before | After | Δ |
+|-------|--------|-------|---|
+| Bank decode | pure wiring | 3-bit rotator controlled by `g` | 8 × 64 B 8-way rotators per port (8 read + 8 write) |
+| Bank-select mux | 1 option (row-mode) | 2 options (row or col) steered by `is_transpose` | small 2-to-1 mux per port, plus col-mode address generator (`bank_i = 8·i + (p+cy+i) mod 8`) |
+| Output lane permute (read) | none | rotate-left by `g` (row-mode) / identity (col-mode) | 8 × 64 B 8-way rotators per read port |
+| Storage | — | **no change** | 0 extra SRAM, 0 extra rows |
+| Ports | 8R + 8W | **no change** | 0 extra ports |
+| Latency | read latency of underlying SRAM | **no change** | rotator + mux are combinational and fit inside the existing pipeline stage |
+| `is_transpose` registers | — | 1 bit of pending + 1 bit of active per read port | 16 FFs total |
+
+Net area overhead is dominated by the 24 × 8-way 64 B-wide rotators (8 write + 8 read input-side + 8 read output-side), which at a 64 B (512-bit) granularity are standard building blocks at negligible cost compared to the 64 × (256 × 512 b) SRAM macros.
+
+##### 7.7 Worked example — chunks of one tile
+
+Assume `tile_idx = 0x2A`, producer writes the 4 KB tile with the natural row-major linear address ordering (chunk 0 at bytes 0..63, chunk 63 at bytes 4032..4095). After the write-side rotator, physical SRAM contents at SRAM row `0x2A` are:
+
+```
+  bank  0  (G0,l=0)  ← chunk (0,0)      bank  8  (G1,l=0)  ← chunk (1,7)
+  bank  1  (G0,l=1)  ← chunk (0,1)      bank  9  (G1,l=1)  ← chunk (1,0)
+  bank  2  (G0,l=2)  ← chunk (0,2)      bank 10  (G1,l=2)  ← chunk (1,1)
+  …                                     …
+  bank  7  (G0,l=7)  ← chunk (0,7)      bank 15  (G1,l=7)  ← chunk (1,6)
+
+  bank 16  (G2,l=0)  ← chunk (2,6)      bank 24  (G3,l=0)  ← chunk (3,5)
+  bank 17  (G2,l=1)  ← chunk (2,7)      bank 25  (G3,l=1)  ← chunk (3,6)
+  …                                     …
+```
+
+A subsequent **row-mode** read on port R0 (phase 0) delivers, cycle-by-cycle:
+
+```
+  cy=0: chunk (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)   ← group G0, rotator = 0
+  cy=1: chunk (1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (1,7)   ← group G1, rotator = 1
+  …
+  cy=7: chunk (7,0) (7,1) (7,2) (7,3) (7,4) (7,5) (7,6) (7,7)   ← group G7, rotator = 7
+```
+
+i.e. the tile in natural row-major order (bytes 0..4095).
+
+A **col-mode** read on port R0 delivers:
+
+```
+  cy=0: chunk (0,0) (1,0) (2,0) (3,0) (4,0) (5,0) (6,0) (7,0)   ← col 0
+  cy=1: chunk (0,1) (1,1) (2,1) (3,1) (4,1) (5,1) (6,1) (7,1)   ← col 1
+  …
+  cy=7: chunk (0,7) (1,7) (2,7) (3,7) (4,7) (5,7) (6,7) (7,7)   ← col 7
+```
+
+i.e. the **transpose of the 8 × 8 chunk grid**, still 4 KB in 8 cycles, still conflict-free, with the only per-cycle physical access being a wrapped-diagonal bank set as listed in the §4.2 proof.
+
+##### 7.8 Summary of the enhancement
+
+1. **Storage layout** is a diagonal/diamond skew (`bank = 8·g + (l + g) mod 8`); the SRAM address and bank count are unchanged.
+2. **Read ports** gain an `is_transpose` input (double-registered with `reg_idx`); a port can pick row or col delivery order at each epoch boundary.
+3. **Calendar** is unchanged in its port-to-phase rotation; per-cycle bank-select is either *whole group* (row-mode) or *one bank per group* (col-mode).
+4. **Write ports** are unchanged externally; a fixed 3-bit rotator places written lanes at the skewed physical bank positions.
+5. **Throughput, latency, storage, SRAM port count**: all **unchanged**.
+6. **Scheduling** gains one rule: all 8 active reads of any epoch share the same `is_transpose`. This pairs naturally with the existing "1 address per port per 8 cycles" cadence.
+7. **Gain**: true bank-conflict-free transpose of the 8 × 8 chunk partition of any tile, delivered at full 512 B/cy — no spare cycles, no duplicate storage, no side buffer.
+
+#### 8. Revision History
+
+| Version | Notes |
+|---------|-------|
+| 0.1 (= **TRegFile-4K v1**) | Initial design: rectangular bank decode (`bank_id = 8·g + l`), row-mode reads only, 8R+8W ports, 8-cycle synchronized calendar, 4 KB tiles × 256 phys-tiles = 1 MB. Single scheduling rule (R1: cross-phase same-tile RAW hazard). |
+| **0.2 (= TRegFile-4K v2, this document)** | **Diagonal skew + `is_transpose` read input**: bank-conflict-free row-*or*-col tile delivery at full **512 B/cy**. Added §7 detailing storage map, read/write datapath, scheduling rule (R2 uniform transpose mode per epoch), cost, and worked examples. Storage size, port count, calendar cadence, address-acceptance rate **all unchanged from v1**. |
+| 0.3 (planned, not in this document) | Per-port quality-of-service throttling for cube-bound traffic (out of scope). |
+
+#### 9. v1 → v2 Migration Quick Reference
+
+> **For implementors and clients of TRegFile-4K:** the v1 → v2 transition is intentionally a **drop-in replacement**. v1 clients that never assert `is_transpose=1` see the v2 register file as functionally and timing-identical to v1.
+
+| Concern | v1 (rev. 0.1) | v2 (rev. 0.2) | Migration |
+|---------|---------------|---------------|-----------|
+| Bank decode formula | `bank_id = 8·g + l` (rectangular) | `bank_id = 8·g + ((l+g) mod 8)` (diagonal skew) | Internal to TRegFile; client-invisible. The on-chip 3-bit write-rotator (§7.3) handles the placement. |
+| Read port input | `reg_idx[7:0]` only | `reg_idx[7:0]` + `is_transpose[0]` | Tie `is_transpose` to 0 → identical to v1 behaviour. |
+| Read traversal modes | row-mode only | row-mode (`is_transpose=0`) **OR** col-mode (`is_transpose=1`) | New capability; v1 clients ignore. |
+| Write port | row-mode only (no `is_transpose` input) | identical to v1 | None. |
+| Output rotator (read) | none required (rectangular decode aligns naturally) | `rotate-left by g` for row-mode, identity for col-mode (§7.4) | Client-invisible. |
+| Scheduling rule R1 (cross-phase RAW) | enforced | enforced (same as v1) | None. |
+| Scheduling rule R2 (uniform transpose mode per epoch) | not applicable | **enforced** | New constraint; only matters if `is_transpose=1` is ever used. v1 schedulers automatically satisfy R2 (all `is_transpose=0`). |
+| Storage / latency / port count | baseline | **unchanged** | None. |
+| Software model | TRegFile read = "tile contents in row-major order" | TRegFile read = "tile contents in row-major order **OR** transpose of 8 × 8 chunk grid" | New software-visible mode bit. v1 software ignores. |
+
diff --git a/designs/outerCube/vector4k.md b/designs/outerCube/vector4k.md
new file mode 100644
index 00000000..fafbca93
--- /dev/null
+++ b/designs/outerCube/vector4k.md
@@ -0,0 +1,871 @@
+# VEC-4K: Vector Unit for 4 KB PTO Tiles (PTO ISA Subset)
+
+## 1. Purpose and Scope
+
+This document specifies a **vector execution unit (VEC-4K)** that implements a **software-visible subset** of the PTO Tile Lib ISA ([`PTOISA/README.md`](PTOISA/README.md)): elementwise tile–tile ops, tile–scalar ops, axis reduce/expand, and selected **complex** instructions (e.g. **TMRGSORT**, **TSORT32**, **TGATHER**, **TCI**). The unit is paired with a **tile register file (TRegFile)** holding **4 KB** tiles.
+
+**Non-goals (this document):** matrix multiply (**TMATMUL** / **TGEMV** family), global-memory **TLOAD/TSTORE** (treated as separate DMA-like paths), and **comm** collective ISA—only the on-tile vector datapath is analyzed here.
+
+---
+
+## 2. Tile and Format Model
+
+### 2.1 Storage Invariant
+
+Each logical tile occupies exactly **4096 bytes** in the TRegFile. The logical shape is **R × C** with:
+
+- **R** and **C** are powers of two (implementation may also require R·C to match the element count implied by the format).
+- **Row-major** layout: address increases along columns within a row, then along rows.
+
+Let **E** be the **storage bytes per logical element** in the chosen encoding. **Only two storage widths are supported** — FP32 and FP16 / BF16. Smaller-precision formats (FP8, MXFP4, HiFP4) are explicitly **out of scope**:
+
+| Logical format | Storage `E` (bytes / element) | Elements per 4 KB tile (N = 4096 / E) |
+|----------------|-------------------------------|----------------------------------------|
+| FP32           | 4                             | 1024                                   |
+| FP16 / BF16    | 2                             | 2048                                   |
+
+Internal ALU / reducer operands are widened to FP32 where required by ISA numerics (e.g. associative-reduce accumulation); pack / unpack is narrow-to-narrow only (FP16 ↔ FP32 cast per `TCVT`). There are **no packed-nibble lanes** anywhere in the datapath.
+
+**Valid shape examples** (illustrative):
+
+- FP32: 32×32, 16×64, 64×16, … (R·C = 1024).
+- FP16 / BF16: 64×32, 32×64, 128×16, … (R·C = 2048).
+
+### 2.2 Metadata
+
+Each issued vector op carries **format**, **R**, **C**, and **opcode**. Microcode (or a small FSM per op) derives:
+
+- `strip_count = 4096 / 512 = 8` **physical strips** per tile (fixed by §3).
+- `elem_per_strip = 512 / E` (must be integer; hardware asserts this from format).
+
+---
+
+## 3. TRegFile Interface and Striping
+
+### 3.1 Ports (Design Assumption)
+
+| Direction | Width | Count | Aggregate |
+|-----------|-------|-------|-----------|
+| Read      | 512 B | 2     | **1024 B/cycle** |
+| Write     | 512 B | 2     | **1024 B/cycle** |
+
+Ports may target **independent tile bases** (e.g. `src0` and `src1`) or **the same tile** at different offsets (double-pump one operand into two buffers).
+
+When VEC is attached to [`tregfile4k.md`](tregfile4k.md), a concrete dual-read binding is **Rd0 → R0** (**Port A**, phase **0**) and **Rd1 → R4** (**Port B**, phase **4**) so **same `tile_idx`** yields strip pair **`(G,e)`** and **`(G+4,e)`** at epoch phase **e** (**§4.4**).
+
+**Read semantics:** TRegFile read ports present **only** full **512 B** bank-group strips per [`tregfile4k.md`](tregfile4k.md) — **no gather** (no sub-strip element indexing inside the RF). **`TCOL*`** therefore **replays** the same `reg_idx` over **multiple cycles** / **epochs** as needed; **column extraction** is done in **VEC** from **strip buffers A/B** after each read (**§5.3.2**, **§4.4 Example E**).
+
+### 3.2 Physical Strip
+
+A **strip** is a contiguous **512-byte** chunk at offset `s·512` for `s ∈ {0,…,7}` within the 4 KB tile. This matches one port transaction per strip.
+
+**Minimum streaming latency (full tile), ignoring bank conflicts:**
+
+- **Unary** (both read ports read **the same logical tile** at consecutive strip offsets): **2 strips/cycle** × **512 B** = **1024 B/cycle** → **4096 / 1024 = 4 cycles** to read one full tile.
+- **Binary elementwise** (typical: **Rd0 → src0** strip `s`, **Rd1 → src1** strip `s`): **1 strip per operand per cycle** → **8 strips/operand** → **8 cycles** to ingest **both** full tiles. Software/hardware can **reuse** a buffered operand (e.g. **op reuse buffer**) to hide half of the reads on back-to-back dependent ops.
+
+The micro-architecture below assumes **8 strip indices** per 4 KB tile and schedules **cross-strip** work where reductions, expands, or gathers require it.
+
+### 3.3 On-Chip Buffers
+
+- **A strip buffer** and **B strip buffer** (512 B each, **Rd0/Rd1**), optionally **double-buffered** to overlap TRegFile read with compute; **§4.1** **crossbar** ingests **1024 B/cycle** from the two ports.
+- **Acc** — **256 × 32 b × 2** ping-pong (**`N_run = 512`**, **§4.1**, **§9.3.2**); optional **strip / staging** for multi-pass sorts; **§4.3** schedules **Rd0/Rd1** vs **`fiber_id`** / **Acc** updates.
+- **Scalar broadcast register** (tile–scalar immediates or a single-element tile).
+
+---
+
+## 4. Vector Datapath Overview
+
+### 4.1 Block Diagram (dataflow level)
+
+**Reference micro-architecture** (counts for **scheduling** / **§5.3.2** **`N_tree`**):
+
+```text
+  ┌────────────────────────────────────────────────────────────────────┐
+  │                     TRegFile (4 KB tiles)                            │
+  │              Rd0 (512 B)              Rd1 (512 B)                  │
+  └──────────────────┬──────────────────────────┬──────────────────────┘
+                     │                          │
+                     └──────────┬───────────────┘
+                                │ 1024 B aggregate / cycle (2×512 B)
+                                ▼
+  ┌────────────────────────────────────────────────────────────────────┐
+  │  Instruction opcode + shape (format, R, C, …)  ──►  CONTROL         │
+  └────────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  CROSSBAR                           │
+              │  1024 B in → distribute to compute  │
+              └─────────────────┬───────────────────┘
+                                │
+              ┌─────────────────▼───────────────────────────────────────────┐
+              │  (A) ALIGN / UNPACK / PERMUTE  (control-selected)          │
+              │  IN:  1024 B / cycle from crossbar                          │
+              │  OUT: 128 slices  (slice i width = W_prep,i bits)           │
+              └─────────────────┬───────────────────────────────────────────┘
+                                │ 128 parallel slice buses
+                                ▼
+              ┌─────────────────▼───────────────────────────────────────────┐
+              │  (B) N_group = 128  INDEPENDENT COMPUTE GROUPS  i = 0…127   │
+              │  ┌─────────────────────────────────────────────────────┐   │
+              │  │  Group i (representative):                          │   │
+              │  │    IN:   W_prep,i  (from A)                         │   │
+              │  │    ┌──────────────────┐      ┌────────────────────┐  │   │
+              │  │    │ Elementwise ALU  │ ───► │ Reduction tree     │  │   │
+              │  │    │ OUT:  W_ALU,i    │      │ OUT:  W_tree,i     │  │   │
+              │  │    └──────────────────┘      └─────────┬──────────┘  │   │
+              │  │    Constraint: W_ALU,i ≥ W_tree,i  (allowed; tree   │   │
+              │  │    narrows / retires partials; ALU may be wide SIMD) │   │
+              │  │    Elementwise bypass: ALU OUT can skip deep tree    │   │
+              │  │    (W_tree,i = W_ALU,i, or tree depth 0) when opcode │   │
+              │  │    does not need cross-lane combine on that group.   │   │
+              │  └─────────────────────────────────────────────────────┘   │
+              │  Typical: W_tree,i = 32 b (FP32-shaped partial → Acc)        │
+              └─────────────────┬───────────────────────────────────────────┘
+                                │ 128 × W_tree,i (subset to Acc per beat)
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  ACCUMULATOR (DFF, ping-pong)       │
+              │  256 × 32 b × 2 halves  ≈ 2048 B    │
+              │  per slot: DFF + optional combine    │
+              │    • RMW: adder (new ⊕ feedback DFF)  │
+              │    • BYPASS: new → DFF (no combine) │
+              │  N_run = 512 logical slots (§9.3.2) │
+              └─────────────────┬───────────────────┘
+                                │ mux: one 256-word half
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  Pack (FP32 → FP16 / BF16 cast, per TCVT) │
+              └─────────────────┬───────────────────┘
+                                │
+                    ┌───────────▼───────────┐
+                    │ Wr0 (512 B)  Wr1 (512 B) │
+                    │  = 1024 B retire / phase   │
+                    └───────────────────────────┘
+```
+
+**Flow:** **Rd0+Rd1** → **crossbar** (**1024 B/cycle**). **Control** drives **crossbar** routing, **(A)** unpack/permute **masks**, **(B)** per-group **ALU** opcode and **tree** depth (or **bypass**), and **Acc** addressing. **(A)** outputs **128** parallel **slices** of width **`W_prep,i`** (depends on **format** / **opcode** / **implementation**). **(B)** is **`N_group = 128`** identical **structures**: each slice feeds an **elementwise ALU** whose output has width **`W_ALU,i`**, then a **reduction tree** that may **narrow** to **`W_tree,i`**. **Allow `W_ALU,i ≥ W_tree,i`** (wide **SIMD** **ALU** **before** a **smaller** **tree** **footprint**). **Typical** partial to **Acc:** **`W_tree,i = 32` b** (FP32-shaped). **Pure elementwise** may use **tree depth 0** or **bypass** so **`W_tree,i = W_ALU,i`** for the path toward **Pack**. **Accumulator:** besides **RMW** (**partial ⊕** **stored** via **adder** **+** **DFF** **feedback**), **control** may **bypass** the **combine** **adder** and **write** **tree/ALU** **data** **straight** **into** the **Acc** **DFF** (**load** / **overwrite** / **non-accumulating** **retire**). **Acc** ping-pong, **Wr** half-select, **`fiber_id` / `ρ` / Acc waves:** **§5.3.2**, **§9.3.2**; **§4.3** calendars.
+
+### 4.2 “Lanes” vs “Strips”
+
+- **SIMD lane**: one parallel datapath processing **one logical element** after unpack (width depends on op; internal **FP32** is a reasonable unified width for expensive ops).
+- **Strip**: 512 B of **spatially contiguous** storage; SIMD width = `elem_per_strip`.
+- **Cross-lane** (within strip): reductions along a dimension that fits in one strip (partial row/col).
+- **Cross-strip**: **control-programmed** **crossbar** + **trees** (**§4.1**) combine strip `s` and `s′` contributions, or **multi-cycle accumulation** into **Acc** ping-pong (**§9.3.2**) / **staging**.
+
+### 4.3 Fiber ID and strip read calendar
+
+**`fiber_id`** is the **logical index along the axis** that **reduce** and **expand/broadcast** class ops treat as a **fiber**—one **output slot** per fiber after a reduce, or one **scalar source** per fiber when expanding along that axis:
+
+| Opcode family | `fiber_id` | Range |
+|---------------|------------|--------|
+| **`TROW*`** (row reduce) | row index **`r`** | `0 … R−1` |
+| **`TCOL*`** (column reduce) | column index **`c`** | `0 … C−1` |
+| **`TROWEXPAND*`** | **`r`** (splat target row) | `0 … R−1` |
+| **`TCOLEXPAND*`** | **`c`** (splat target column) | `0 … C−1` |
+
+**Elementwise** tile–tile ops do **not** use a single global **`fiber_id`**; they are scheduled **strip-by-strip** only. **Gather/sort/merge** use their own index streams; where they write **per-row/col** state, that state can still be keyed like **`fiber_id`** for buffer allocation.
+
+**From strip bytes to `(r, c)` and `fiber_id`:** For strip index **`s ∈ {0,…,7}`** and lane/byte offset inside the **512 B** chunk, **row-major** layout fixes a linear element order; decode **`(r, c)`** from **`(R, C, E)`**. Then **`fiber_id = r`** or **`c`** according to the opcode’s **axis**. **Control** drives the **crossbar** (**§4.1**) so **unpack**, **permutation**, and **tree/ALU** inputs see the correct segment for each **`fiber_id`** touched in that strip.
+
+**Strip read calendar:** A **calendar** is the **cycle-by-cycle** schedule that binds **what arrives on Rd0/Rd1** to **what the datapath does**—in particular, **which operands feed each lane** and **which `fiber_id`(s)** touch **Acc** or **per-fiber `v` buffers** that cycle.
+
+- **Port row (per cycle `t`):** specifies **`s(t)`** (which **512 B** chunk), **which logical tile** each port reads (**`src0`**, **`src1`**, narrow **`v`** tile, scalar tile, **ping-pong scratch** for merges/sorts, or **idle**), and optional **second-pass** phases. **`TCOL*`** does **not** use a **transpose scratchpad** — only **normal** `reg_idx` tiles in **row-major** strip order (**§5.3.2**). TRegFile ports **cannot gather** (**§3.1**); **`TCOL*`** may **repeat** the **same** `reg_idx` over **multiple TRegFile epochs** when **`#W = max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉) > 1`** (**§5.3.2**, **§4.4 Example E**; **`#W`** **reduces** to **two** **terms** **when** **`N_acc ≤ N_run`**).
+- **Operand sources (`TROW*` reduce):** **Tile elements** arrive strip-serially from **read ports**; after **unpack → within-strip tree → cross-strip combine**, the reducer performs **RMW** on **Acc** at **`fiber_id = r`**. Physical slot **`ρ`** is **`fiber_id` remapped** into **`[0, N_run)`** for the current **Acc wave** (§9.3.2: **`bank = ρ mod 8`**, **`word = ρ >> 3`**, **0…63**). One strip can touch **many** distinct **`fiber_id`**s when **`row_B < 512`** or **`col_B < 512`** (many thin fibers per strip).
+- **Operand sources (`TCOL*` reduce):** Ports still deliver **full rows** inside each **512 B** strip; **VEC** **selects** **`(r,c)`** for the scheduled **column band** from **strip buffers** (**no RF gather**). **Acc[`c`]** **+=** partial sums across strip-beats and, if needed, across **re-scans** of the tile.
+- **Operand sources (expand):** **`v[fiber_id]`** is supplied from a **narrow per-fiber vector** streamed on a read port, from **Acc / staging** after an in-place reduce, or from a **small buffer** filled in a **prefetch** phase; **`src`** elements still arrive **strip-major** like §5.1. The calendar interleaves **`v`** strip reads with **`src`** strips so each cycle’s SIMD sees a consistent **`(fiber_id, lane)`** map.
+
+**Templates:** Opcode decode picks a **calendar template** from **`(format, R, C, opcode)`**. The §9 metrics **`rS`, `rW`, `rK`, …** / **`cS`, `cW`, …** fix how many **within-strip** rounds and how **8-strip** walks align with **cross-strip** merge. The **47** distinct scheduling recipes in **§9.5.1** are **calendar families** over the same datapath, not separate RTL blocks. **Epoch-aligned worked tables** vs [`tregfile4k.md`](tregfile4k.md): **§4.4**.
+
+**Illustrative calendar row** (one cycle of a row-reduce pass; details vary by shape):
+
+| Field | Content |
+|-------|---------|
+| **`t`** | Cycle index in the micro-sequence |
+| **Rd0 / Rd1** | e.g. **`src` strip `s`**, second operand or **`idle`** |
+| **`s`** | Current **strip index** `0…7` (or remapped pass) |
+| **Lane → `(r,c)`** | From **§2.1** row-major map |
+| **`fiber_id`s updated** | Subset of **`r`** appearing in this strip’s segment |
+| **Acc** | **RMW** at decoded **bank/word** for each retiring partial |
+
+### 4.4 Epoch-aligned fiber calendars vs `tregfile4k.md` (four worked examples)
+
+Full **(format, shape)** enumeration would need **35** row-axis templates alone (§9); this subsection fixes **VEC ↔ TRegFile-4K** timing and shows **five** representative **`fiber_id`** calendars (**Examples A–E**; FP32 and FP16 / BF16 only). See [`tregfile4k.md`](tregfile4k.md): global **`e = cy[2:0]`** (phase within an **8-cycle** epoch); read port **Rp** presents bank-group **`G = (p + e) mod 8`** (**512 B** = one **strip** **Gs**).
+
+**Port binding for the tables:**
+
+| Logical name | TRegFile read port | Phase `p` | Strip delivered at phase `e` |
+|--------------|-------------------|-----------|--------------------------------|
+| **Port A** | **R0** | 0 | **`G_A = e`** |
+| **Port B** | **R4** | 4 | **`G_B = (4 + e) mod 8`** |
+
+**Epoch start:** **`t = 0`** is an **epoch boundary** (**e = 0**): the **`reg_idx`** for each port is **active** for the next 8 cycles (pending→active promotion, `tregfile4k.md` §4).
+
+**Dual-port same `tile_idx` on A+B:** in cycles **`t = 0…3`** (**e = 0…3**), the pair **`(G_A, G_B)`** visits **`(0,4), (1,5), (2,6), (3,7)`** — every strip **Gs ∈ {0,…,7}** appears **exactly once** as **one** of the two 512 B beats. Cycles **`t = 4…7`** repeat the **same** strip schedule (second lap with the **same** latched tile on both ports); microcode **suppresses duplicate Acc** or reuses beats for **another operand** / **write path**. **Benefit:** **4 cycles** to see all **8** strips **once** with **two** 512 B reads/cycle vs **8 cycles** with **Port A only** (Port B **idle**).
+
+**Row-major element index** in a strip: byte offset **`512·Gs + δ`** with **`δ`** increasing along **columns** within the row segment; **`fiber_id = r`** for **`TROW*`** labels which **row**’s **C** elements are being reduced or expanded.
+
+**Column shorthand in tables:**
+
+| Column | Meaning |
+|--------|---------|
+| **`t`**, **`e`** | Core cycle from op **epoch start**; **`e = t mod 8`**. |
+| **Port A / B** | **`T@Gs`** = tile **`reg_idx` T** and physical strip **Gs** (512 B). **`—`** = port unused this cycle. |
+| **Fibers (this beat)** | **`fiber_id`** values whose **row data** (reduce) or **row op** (expand) is anchored on **that port’s** strip in this cycle. |
+| **First elem @ port** | Logical **element** at **byte 0** of that port’s 512 B chunk for the listed fiber (start of the row segment in that strip). |
+| **`#elem`** | **Logical elements along the row** (along **C**) taken from **that port** this cycle toward **`TROW*`** / expand **`src`**. |
+| **Reduce / expand** | **`TROWSUM`:** cross-lane tree over **`#elem`** → **one** partial per **`fiber_id = r`** → **Acc RMW**. **`TCOL*`** (**§5.3.2**): **`fiber_id = c`**; **strip replay** + **VEC column mux** + **`Acc[c]`** **RMW**; **`P_beat = min(N_tree, N_acc)`**; **`#W = max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉)`** (**`f`**, **`N_run`**, **§5.3.2**; **`#W`** ≡ **`#waves`** **when** **`⌈C/N_run⌉ ≤ ⌈C/N_acc⌉`**) — **§4.4 Example E**. **`TROWEXPANDADD`:** for each **`fiber_id`**, combine **`#elem`** **`src`** lanes with **`v[fiber_id]`** (from **`v`** tile or latch). |
+
+*End-to-end latency* may add **tree pipeline stages** after the last ingest cycle; tables list **operand arrival** and **per-fiber arithmetic scope** per cycle.
+
+---
+
+#### Example A — `TROWSUM`, **FP32**, **8×128** (`C = 128`, **one row = one strip**)
+
+**Geometry:** **`row_B = 512 B`**, strip **Gs** holds **exactly row `r = Gs`**. **Dual-port** same **`src`** tile on **A+B** gives **two rows/cycle** for **`t = 0…3`**; **`t = 4…7`** are duplicate strip delivery (masked).
+
+| `t` | `e` | Port A | Port B | Fibers (A) | First elem @ A | `#elem` | Fibers (B) | First elem @ B | `#elem` | Reduce note |
+|----:|----:|--------|--------|------------|----------------|--------:|------------|----------------|--------:|-------------|
+| 0 | 0 | `src@G0` | `src@G4` | `r=0` | elem `(0,0)` | 128 | `r=4` | elem `(4,0)` | 128 | 2× `TROWSUM` lane-tree → **Acc** `r=0`, `r=4` |
+| 1 | 1 | `src@G1` | `src@G5` | `r=1` | `(1,0)` | 128 | `r=5` | `(5,0)` | 128 | **Acc** `r=1`, `r=5` |
+| 2 | 2 | `src@G2` | `src@G6` | `r=2` | `(2,0)` | 128 | `r=6` | `(6,0)` | 128 | **Acc** `r=2`, `r=6` |
+| 3 | 3 | `src@G3` | `src@G7` | `r=3` | `(3,0)` | 128 | `r=7` | `(7,0)` | 128 | **Acc** `r=3`, `r=7` |
+
+**Unique ingest complete at `t = 3`** (8 fibers, each **128** elements). **Single-port (A only):** stretch to **`t = 0…7`**, one row/cycle, **no** dual-port gain.
+
+---
+
+#### Example B — `TROWSUM`, **FP32**, **32×32** (`C = 32`, **4 rows / strip**)
+
+**Geometry:** **`row_B = 128 B`**, **`512 / 128 = 4`** rows per strip. **Single Port A** is enough (dual port does **not** shorten **unique** row coverage unless compute is strip-bound and overlapped differently); **Port B idle**.
+
+| `t` | `e` | Port A | Port B | Fibers (A) | First elem @ A (each fiber) | `#elem` each | Reduce note |
+|----:|----:|--------|--------|------------|-----------------------------|----------------|-------------|
+| 0 | 0 | `src@G0` | — | `0,1,2,3` | `(0,0)`, `(1,0)`, `(2,0)`, `(3,0)` | 32 | 4× lane-tree (`K=32`) → **Acc** `r=0…3` |
+| 1 | 1 | `src@G1` | — | `4,5,6,7` | `(4,0)`…`(7,0)` | 32 | **Acc** `r=4…7` |
+| 2 | 2 | `src@G2` | — | `8,9,10,11` | … | 32 | **Acc** |
+| 3 | 3 | `src@G3` | — | `12…15` | … | 32 | **Acc** |
+| 4 | 4 | `src@G4` | — | `16…19` | … | 32 | **Acc** |
+| 5 | 5 | `src@G5` | — | `20…23` | … | 32 | **Acc** |
+| 6 | 6 | `src@G6` | — | `24…27` | … | 32 | **Acc** |
+| 7 | 7 | `src@G7` | — | `28…31` | … | 32 | **Acc** |
+
+**32** fibers, each **`#elem = C = 32`** from **Port A** only; **8 cycles** = one epoch, **one** `reg_idx` on **R0**.
+
+---
+
+#### Example C — `TROWEXPANDADD`, **FP32**, **8×128** (`v[r]` + `src`)
+
+**`v` tile:** **8** row scalars as **FP32** = **32 B** at **byte offset 0** of strip **G0** (tile **`v`**; remaining bytes **don’t-care**).
+
+**Why prefetch `v`:** If **`v@G0`** were on **Port B** while **`src`** streams on **Port A** only, **dual-port** cannot also deliver **`src@G4`** in the **same** cycle. **High bandwidth schedule:** (1) **Pre-epoch** or **`t_pre`:** **Port B** reads **`v@G0`** once; latch **`v[0]…v[7]`** (bytes **0–3**, **4–7**, …, **28–31**). (2) **`t = 0…3`:** **Port A** and **Port B** both carry **`src`** with **same** `reg_idx` as Example **A** — **two rows/cycle**.
+
+| `t` | `e` | Port A | Port B | Fibers (`src`) | First elem @ A | First elem @ B | `#elem` / fiber | **`v[fiber_id]`** | Expand |
+|----:|----:|--------|--------|----------------|----------------|----------------|-----------------|-------------------|--------|
+| 0 | 0 | `src@G0` | `src@G4` | `r=0`, `r=4` | `(0,0)` | `(4,0)` | 128 | **latched** **`v[0]`**, **`v[4]`** | **128** lanes/fiber: **`src` + v** |
+| 1 | 1 | `src@G1` | `src@G5` | `r=1`, `r=5` | `(1,0)` | `(5,0)` | 128 | **latched** **`v[1]`**, **`v[5]`** | … |
+| 2 | 2 | `src@G2` | `src@G6` | `r=2`, `r=6` | … | … | 128 | **latched** **`v[2]`**, **`v[6]`** | … |
+| 3 | 3 | `src@G3` | `src@G7` | `r=3`, `r=7` | … | … | 128 | **latched** **`v[3]`**, **`v[7]`** | … |
+
+**`t_pre` (one beat, e.g. previous epoch):** **Port B** = **`v@G0`**, **Port A** = **—** or next **`src`** prefetch; **first element of `v[0]`** = **byte 0** of **B**’s 512 B chunk.
+
+---
+
+#### Example D — `TROWSUM`, **FP16**, **128×16** (`C = 16`, **`E = 2`**, **16 rows / strip**)
+
+**Geometry:** **`row_B = 32 B`** = **16** FP16 elements/row; **`512 / 32 = 16`** distinct **`fiber_id`** values per strip. **Dual-port** same **`src`**: **`t = 0…3`** covers all **128** rows (**32** partial trees/cycle = 16 fibers × 2 ports). **`K = 16`**, **`D_lane = 4`**.
+
+| `t` | `e` | Port A | Port B | Fiber IDs (A) | First elem @ A | `#elem` | Fiber IDs (B) | First elem @ B | `#elem` | Reduce note |
+|----:|----:|--------|--------|---------------|----------------|--------:|---------------|----------------|--------:|-------------|
+| 0 | 0 | `src@G0` | `src@G4` | `r=0…15` | row `r` at byte **`32r`** in **G0** | 16 | `r=64…79` | byte **`32(r−64)`** in **G4** | 16 | **32** lane-trees → **32** **Acc** RMW (**watch bank** = `r mod 8`) |
+| 1 | 1 | `src@G1` | `src@G5` | `16…31` | … | 16 | `80…95` | … | 16 | **32** **Acc** |
+| 2 | 2 | `src@G2` | `src@G6` | `32…47` | … | 16 | `96…111` | … | 16 | **32** **Acc** |
+| 3 | 3 | `src@G3` | `src@G7` | `48…63` | … | 16 | `112…127` | … | 16 | **32** **Acc** |
+
+**`t = 4…7`:** duplicate **`src`** strips (suppress **Acc** idempotent re-reduce) unless **`reg_idx`** advances. **BF16 128×16** reuses **exactly the same calendar** (same `E`, same byte layout, different payload semantics).
+
+---
+
+#### Example E — `TCOLSUM`, **FP16**, **16×128** (`R = 16`, **`C = 128`**, **native row-major**)
+
+**TRegFile:** Read ports emit **only** full **512 B** strips (**[`tregfile4k.md`](tregfile4k.md)**); **there is no gather** inside the tile RF. **`TCOLSUM`** cannot request “column **`c`** only” from the file — it must **accept whole strips** on **Port A / Port B**, then **select** the needed **`(r,c)`** in **VEC** (strip buffers → unpack → **column mux** / shifter network).
+
+**Policy:** **No transpose scratchpad** (**§5.3.2**). Operand remains **one** `reg_idx`, **§2.1** row-major. **Acc[`c`]**: **read–modify–write** associative **add** so partials from each strip-beat **accumulate** until all **R = 16** row contributions for column **`c`** are seen.
+
+**Hardware parallelism (§5.3.2):** Let **`N_tree`** = parallel **adder / reduce trees** per beat; **`N_acc`** = parallel **Acc** **RMW** slots per cycle (**`N_acc ≤ N_run`**, **§9.3.2**); **`f`** = **effective `Acc[c]` commits per tree per full tile scan** (one **dual-ingest** pass over **all** strips — includes **sub-cycles** / **pipeline**). **Same-cycle** combine+retire is capped by **`P_beat = min(N_tree, N_acc)`**.
+
+**Wave count (both limits):**
+
+**`#W = max(⌈C / N_acc⌉, ⌈C / (N_tree · f)⌉, ⌈C / N_run⌉)`** (**`#W`** **tile-epoch** **count**; **synonym** **`#waves`** **below** **when** **`N_acc ≤ N_run`** **⇒** **`⌈C/N_run⌉ ≤ ⌈C/N_acc⌉`**)
+
+- **`#waves_acc = ⌈C / N_acc⌉`**: **Acc-band** partitioning (**≤ `N_acc`** columns **finished** per **wave** if **trees** keep up).  
+- **`#waves_tree = ⌈C / (N_tree · f)⌉`**: when **`N_tree ≪ N_acc`**, **tree throughput** may require **more** **full scans** than **`#waves_acc`** predicts.  
+- **`#waves_Nrun = ⌈C / N_run⌉`**: **DFF** **capacity** (**§5.3.2**); **redundant** **vs** **`#waves_acc`** **when** **`N_acc ≤ N_run`**.
+
+When **`N_acc > N_tree`**, **trees** are **time-multiplexed** over **sub-cycles**; **`f`** must be **measured** from **RTL**/**micro-arch**.
+
+**Illustrative numbers:** **`C = 128`**, **`N_acc = 64`**, **`N_run = 512`** → **`#waves_acc = 2`**, **`⌈C/N_run⌉ = 1`**. With **`N_tree = 8`**, **`f = 8`**: **`N_tree · f = 64`**, **`#waves_tree = 2`**, **`#W = max(2, 2, 1) = 2`**. With **`N_tree = 4`**, **`f = 8`**: **`N_tree · f = 32`**, **`#waves_tree = 4`**, **`#W = max(2, 4, 1) = 4`** — **tree**-limited; **strip calendar** is unchanged, but **microcode** runs **more** **tile** **epochs** and may **shrink** the **column band** per wave below **`N_acc`**.
+
+**Acc-limited schedule (`#W = ⌈C / N_acc⌉ = 2`):** **`c_base = N_acc · k`** as in the tables below.
+
+**Geometry:** **`row_B = 256 B`** (**128** FP16 elements/row); **2** rows per **512 B** strip. **Dual** read (**R0+R4**, **§3.1**): **`t = 0…3`** (**`e = 0…3`**) delivers **unique** strip pairs **`(0,4)…(3,7)`** and visits **all 16** rows **once per full scan**. Cycles **`t = 4…7`** of the **same** TRegFile **epoch** repeat the **same** strips (second lap with identical `reg_idx`); **disable Acc updates** on **`t = 4…7`** unless **overlapped** with another **column band** (implementation choice).
+
+**If `N_acc ≥ C`:** **one** wave (**one** scan, **`t = 0…3`**) suffices for all columns.
+
+**Per-strip-beat (one wave):** From **A**/**B** buffers, **column mux** pulls **`(r,c)`** for **`c`** in the active band only; **4** row samples per **`c`** per **`t`** (2 rows × 2 ports) → **`Acc[c] +=`** partial (after interim reduce via **`N_tree`** lanes). After **`t = 3`** for that wave, each **`c`** in the band has **16** row terms accumulated → **done** for **`TCOLSUM`** (modulo tree **pipeline**).
+
+**Strip calendar (identical each wave; `c_base = N_acc · k`):**
+
+| `t` | `e` | Port A | Port B | Rows in A / B | **`fiber_id` band** | Per-**`c`** row samples this `t` | **Acc** |
+|----:|----:|--------|--------|---------------|---------------------|----------------------------------|---------|
+| 0 | 0 | `src@G0` | `src@G4` | **0–1** / **8–9**   | **`c ∈ [c_base, c_base + N_acc − 1]`** | **4** | **`Acc[c] +=`** partial from **4** **`(r,c)`** (mux from **A/B**) |
+| 1 | 1 | `src@G1` | `src@G5` | **2–3** / **10–11** | same band | **4** | **+=** … |
+| 2 | 2 | `src@G2` | `src@G6` | **4–5** / **12–13** | same band | **4** | **+=** … |
+| 3 | 3 | `src@G3` | `src@G7` | **6–7** / **14–15** | same band | **4** | **16** terms integrated per **`c`** in band |
+
+**`(r,c)` byte** in a strip with base row **`r₀`**: **`(r − r₀)·row_B + c·E`** with **`E = 2`** (FP16). **Multi-epoch summary** (**Acc-limited** **`#W = 2`**, i.e. **`#waves_tree ≤ #waves_acc`**; **`N_acc = 64`**, **`N_run = 512`**):
+
+| Wave `k` | Tile read policy | Active columns | Strip beats used |
+|----------|------------------|----------------|------------------|
+| 0 | Same `reg_idx`, epoch **E0** | **`c = 0…63`**   | **`t = 0…3`** as above |
+| 1 | **Re-read** same tile, epoch **E1** | **`c = 64…127`** | repeat calendar |
+
+If **`#waves_tree > #waves_acc`**, add **waves** **`k = 2 … #W − 1`** with **overlapping** or **narrower** **column** **bands** — **implementation**-specific.
+
+**Cycle lower bound (illustrative):** **`#W × 4`** strip-pair cycles (**`#W`** from **§5.3.2** **`max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉)`**), **plus** **epoch** turnarounds — **implementation-dependent**. **Single-port A only** doubles strip cycles per scan (**`t = 0…7`** unique). **BF16 16×128** behaves identically (same `E`, same `row_B`).
+
+**§9 `c*`** metrics still bound **tree / strip** complexity; they **exclude** **transpose scratch**, **`#W`** **tile replays**, **`f`**, **`N_run`**, **TRegFile** **non-gather** replay — add explicitly in schedules (**§5.3.2**).
+
+---
+
+## 5. Instruction Categories and Cycle Sketches
+
+The following uses **R**/**C** notation, **S = 8** strips, and **read_pair** = one cycle with both 512 B read ports used. **Write_pair** = both write ports used (same or different tiles depending on retire policy).
+
+### 5.1 Elementwise (Tile–Tile)
+
+**Representative:** `TADD`, `TMUL`, `TAND`, `TCMP`, `TCVT` (unary ops such as `TABS`/`TRELU` omit one read port per strip).
+
+**Dataflow (non-pipelined sketch):**
+
+1. For each strip index `s = 0…7`: `read_pair` loads `src0[s]` and `src1[s]` → unpack → SIMD op → pack into `dst[s]` staging.
+2. Retire `dst[s]` with `write_pair` (same cycle as next strip’s read if the pipeline supports **read–compute–write** overlap).
+
+**Latency:** **8 cycles** minimum to **read** both operand tiles strip-by-strip; with **pipelining** (overlap read `s+1` with write `s−1`), end-to-end often **~10–12 cycles** for a full `dst` tile (implementation-dependent buffering).
+
+**Cross-lane:** **none** for pure elementwise; SIMD lanes are independent within the strip.
+
+**Special:** `TCVT` may widen/narrow; internal path uses **wider SIMD** or **two-pass** if pack/unpack asymmetry exceeds one cycle.
+
+```mermaid
+flowchart LR
+  subgraph per_strip [Per strip s]
+    R0[Read src0 512B] --> U0[Unpack]
+    R1[Read src1 512B] --> U1[Unpack]
+    U0 --> ALU[Elementwise ALU]
+    U1 --> ALU
+    ALU --> P[Pack]
+    P --> W[Write dst 512B]
+  end
+```
+
+---
+
+### 5.2 Tile–Scalar / Tile–Immediate
+
+**Representative:** `TADDS`, `TMULS`, `TCMPS`, `TEXPANDS`, `TADDSC`.
+
+**Scalar path:**
+
+- Immediate or **single-element tile** loaded once into **scalar broadcast reg** (optional **1-cycle** read of a dedicated scalar slot).
+- Each strip: SIMD op **lane_i = f(tile[s]_i, scalar)**.
+
+**Cycles:** same order of magnitude as §5.1; **one fewer** long-latency operand read if scalar is in a control register.
+
+**Cross-lane:** none (unless scalar differs per row/col via side table—then becomes expand-like).
+
+---
+
+### 5.3 Axis Reduce
+
+**Row-reduce examples:** `TROWSUM`, `TROWMAX`, `TROWARGMAX` (reduce across **columns** within each row).  
+**Column-reduce examples:** `TCOLSUM`, `TCOLMAX`, `TCOLARGMAX`.
+
+**Key geometric fact:** a **512 B strip** spans a **contiguous run** of row-major storage; for **large C**, one row may span **multiple strips**; for **small C**, one strip may hold **multiple partial rows**. The control FSM computes `(row, col)` range per strip from `(R, C, E)`.
+
+**Scheduling:** Reduce passes are driven by a **strip read calendar** keyed by **`fiber_id = r`** or **`c`** (**§4.3**): each cycle’s **Rd0/Rd1** transactions determine **operand sources**; partials retire to **Acc** on the matching **`fiber_id`**. **`TCOL*`** obeys **§5.3.2** (**no transpose scratchpad**; TRegFile **no gather** — **strip replay** + **VEC column select** + **Acc RMW**).
+
+#### 5.3.1 Row-wise reduce (e.g. `TROWSUM`)
+
+For each **row r**, compute `acc[r] = reduce_{c} M[r,c]`.
+
+**Phase A – partial reduce within strip:**
+
+- For strips that contain **multiple columns of the same row segment**, use **horizontal SIMD tree** within the strip (cross-lane inside SIMD).
+
+**Phase B – cross-strip combine for rows spanning strips:**
+
+- Strips contributing to the same row feed a **segmented reduction network** or write **partial sums** to **Acc** (logical index `r`, physical slot **`ρ`** when **`N_run < R`**, **§9.3.2**), then **second pass** reads back when row complete. **Logical** depth (**rAccB** / **rStgUB**) is in **§9.3.1**; **legal shapes** in **§9.7**; **running** silicon = **`N_run`** DFF entries.
+
+**Cycle sketch (conceptual):**
+
+| Phase | Action |
+|-------|--------|
+| 1 | For each strip `s` in 0…7: read tile strip, compute **strip-partial** per affected row segment → **Acc** RMW (atomic add/max/…) |
+| 2 | If all columns of row seen, **finalize** row result to narrow format |
+| 3 | For `TROWEXPAND`-style output of scalar-per-row, stream writes; for true reduce producing **R×1** or packed vector tile, **write compact tile** over multiple cycles |
+
+**`TROWARGMAX` / `TROWARGMIN`:** each strip produces **(value, col_index)** pairs; cross-strip compare selects winner; **cross-lane compare tree** + **index mux**.
+
+```text
+  Strip0 partial ──┐
+  Strip1 partial ──┼──▶ Per-row combine (max/sum) ──▶ Row result tile
+  ...              │
+  Strip7 partial ──┘
+        ▲
+        └── updates keyed by row id (cross-strip)
+```
+
+#### 5.3.2 Column-wise reduce (e.g. `TCOLSUM`, `TCOLMAX`)
+
+**Architectural rule (VEC-4K):** **`TCOL*`** **must** be implemented on the **operand tile’s native row-major** layout as seen through **normal TRegFile** `reg_idx` / **512 B strip** reads. **No transpose scratchpad tile** — software and microcode **shall not** rely on materializing a **C×R** row-major **copy** of the operand in a separate tile **for the sole purpose of column reduction**.
+
+**Parallelism — `N_tree` vs `N_acc` (hardware):**
+
+| Symbol | Meaning |
+|--------|---------|
+| **`N_tree`** | Parallel **adder / reduce** (or **elementwise**) **paths** after the **crossbar** (**§4.1**). **Reference implementation:** **`N_tree = 128`**. Scheduling **examples** below may use **smaller** **illustrative** values. |
+| **`N_acc`** | **Distinct** **`Acc[·]`** **RMW** (or **commit**) slots **in the same cycle** — limited by **`N_tree`**, **adder feedback** **ports**, and **`N_run`** (**§9.3.2**): **`N_acc ≤ N_run`** (**512**); **one ping-pong half** holds **256** slots (**1024 B**). |
+
+**Coupling:** On a **single** beat, **at most `min(N_tree, N_acc)`** columns can **both** **combine** fresh partials **and** **retire** to **Acc** in the **same** cycle if **tree output** must **pair** **1:1** with an **accumulator write**.
+
+**When `N_acc > N_tree`:** **Trees** are the **scarce** resource; microcode may **time-multiplex** **`N_tree`** trees across **sub-cycles** while **keeping `N_acc`** **accumulators** **live** (pipeline registers + **Acc**). Once **pipelined**, the **schedule** can still **advance up to `N_acc` column accumulations per cycle** by **feeding** accumulators from **staggered** tree outputs.
+
+**Accumulator-centric wave count** (column **bands** sized by **Acc** retire slots):
+
+**`#waves_acc = ⌈C / N_acc⌉`**
+
+(i.e. **`num_col / num_acc`**, rounded up): each **wave** is one **complete** **row-major** scan of the operand tile, updating a **disjoint band** of **at most `N_acc`** columns’ **`Acc[c]`** toward the final **R**-way reduce.
+
+**Tree-limited wave count (`N_tree ≪ N_acc`):** Define **`f`** as the **effective number of distinct `Acc[c]` commits** each **single** **adder tree** can **sustain per full operand tile scan** (one **dual-port** ingest of **all** strips, e.g. **`t = 0…3`** in Example E), **after** **pipelining** and **sub-cycle** multiplex (**`f`** counts **tree→Acc** **throughput** over the **whole** scan — product of **commits per strip-beat** × **number of beats**, **bypass**, and **stage depth**). If **trees** cannot **produce** partials fast enough to **match** **`#waves_acc`**, **more** **tile replays** are needed:
+
+**`#waves_tree = ⌈C / (N_tree · f)⌉`**
+
+**DFF capacity wave count (`N_run`):**
+
+**`#waves_Nrun = ⌈C / N_run⌉`**
+
+**Combined (conservative, full-tile replays for `TCOL*`):**
+
+**`#W = max(⌈C / N_acc⌉, ⌈C / (N_tree · f)⌉, ⌈C / N_run⌉)`**
+
+Treat **`#W`** as the **tile-scan / epoch** **count** **lower bound** for **correct** **`TCOL*`** (**`#waves`** is an **alias** when **unambiguous**). **`K_outer`** (**§8**, **write staging**) counts **outer** **fiber-offset** **campaigns**; it **need not** equal **`#W`** — **avoid** **double-counting** **tile** **replays** **unless** the **micro-op** **nest** (**outer** × **inner** strip loop) is **specified**.
+
+**Redundancy:** **`N_acc ≤ N_run`** (**§9.3.2**) ⇒ **`C / N_acc ≥ C / N_run`** ⇒ **`⌈C / N_acc⌉ ≥ ⌈C / N_run⌉`** for **`C > 0`**, so **`#waves_Nrun`** **never exceeds** **`#waves_acc`** **in that regime** — the **third** **`max`** **term** is **for** **generality** (alternate **schedules**, **tooling**, or **future** **caps**) and **matches** **`#waves_acc`** **when** **`N_acc = N_run`**.
+
+When **`N_tree · f ≥ N_acc`**, **`#waves_tree ≤ #waves_acc`** and **`#W = max(⌈C / N_acc⌉, ⌈C / N_run⌉) = ⌈C / N_acc⌉`** (still **Acc-band** **long pole** if **`N_acc ≤ N_run`**).
+
+**Fiber-capacity rounds (`N_run`):** if **more output fibers** are **live** than **`N_run`** DFF slots allow in one **in-core** pass, or an **extreme `(R,C)`** forces **narrow bands** per **outer** iteration, **opcode decode** expands **one** architectural **`TROW*` / `TCOL*`** into **`K_outer` hardware loops** (nested **outside** the **strip** walk). Each **outer** iteration carries a **fiber base / offset** (`c_base`, `r_base`, …) so **only** **`≤ N_run`** **logical** fibers map to **physical `ρ`** in **Acc** at a time; the **inner** loop is still the **usual** **row-major** strip calendar (**§4.3**). **Correctness** is preserved by **trading latency for capacity**.
+
+**Write-side staging across outer loops:** **Wr0 / Wr1** (and **Pack** ahead of them) may hold **wide staging registers** — **per-strip** or **per-output-slice** — where **partial result tiles** or **completed fiber bands** from **earlier** outer iterations **accumulate** or **merge** until the **full** **`R`×`C`** (or **per-fiber vector**) output is **complete**. For **associative** reduces (**sum** / **max** / **min**), **merge** order must match **ISA numerics** (e.g. **sum** associativity vs **rounding**); **arg** ops need **value∥index** staging consistent across rounds (**§9.3.2**). This is **invisible** at **single-instruction retire** if the **VEC** does **not** commit **`dst`** until the **last** outer loop.
+
+**When `N_acc ≤ N_tree`:** **Acc** is the **bottleneck** for **parallel column retires**; **`#waves_acc`** **still** applies; **`min(N_tree, N_acc) = N_acc`** caps **simultaneous** combine+retire per beat unless **pipelining** exposes extra **tree** throughput. **Also** evaluate **`#waves_tree`** if **`f`** is **small**.
+
+**Design shorthand:** **`P_beat = min(N_tree, N_acc)`** for **strict** same-cycle **tree→Acc** pairing; **tile-replay** uses **`#W`** above — see **§4.4 Example E**.
+
+**`TROW*` mirror (row-axis output fibers, `fiber_id = r`):** same **symbols** **`N_tree`**, **`N_acc`**, **`N_run`**, **`f`** (**`f`** = **effective `Acc[r]` commits per tree per full operand scan** on the **row-reduce** calendar). **Replace `C` → `R`:**
+
+**`#W_trow = max(⌈R / N_acc⌉, ⌈R / (N_tree · f)⌉, ⌈R / N_run⌉)`**
+
+**§5.3.1** **Phase B** **strip** **walk** **replays** apply. **`⌈R / N_run⌉`** is **redundant** **vs** **`⌈R / N_acc⌉`** **when** **`N_acc ≤ N_run`** — **same** **as** **`TCOL*`**.
+
+For each **column c**, reduce across **rows** (**R** elements). In **row-major** storage, a column is **not** one contiguous byte range unless **`C = 1`**. **Implementation:** TRegFile delivers **only** full **512 B** strips (**§3.1** — **no gather**). Hardware **re-reads** the operand tile’s strips in **Gs** order (possibly **many TRegFile epochs** with the **same** `reg_idx`); after each read, **strip buffers A/B** hold **row-contiguous** data; **unpack** + **column mux** (VEC-side, **not** in the RF) extracts **`(r,c)`** for the **column band** scheduled that beat; **Acc[`c`] read–modify–write** accumulates partial sums until all **R** rows are covered. If **`N_acc < C`**, **repeat** the **full strip walk** for the next column band (**§4.4 Example E**). **Blocked partials** and **multi-cycle** calendars (**§4.3**) apply.
+
+**Metrics parity (§9):** Closed-form **`c*`** symbols match the **mathematical** **`TROW*`-on-`C×R`** substitution (**transpose-equivalent indices only**). That **algebra** does **not** imply a **physical** transpose buffer — it sizes **trees**, **`cS`**, **`cW`**, and **SRAM**; **wall-clock** must add **strip replay** and **`#W = max(⌈C / N_acc⌉, ⌈C / (N_tree · f)⌉, ⌈C / N_run⌉)`** **full-tile** **epochs** (**`f`**, **`N_run`** per **§5.3.2** / **§9.3.2**).
+
+**Cross-strip / cross-lane:** still heavy when **`col_B`** is small (many columns’ samples packed per strip); see **`c*`** in **§9.3** and **§9.6** (§9.7 lists **shapes** only).
+
+---
+
+### 5.4 Axis Expand / Broadcast
+
+**Representative:** `TROWEXPAND*`, `TCOLEXPAND*`.
+
+**Scheduling:** Expands use the same **`fiber_id`** convention (**§4.3**): **`r`** or **`c`** selects **`v[fiber_id]`**; the **calendar** interleaves reads of the **narrow `v` tile** (or **Acc**-backed **`v`**) with **`src`** strips so each **512 B** write sees correct **splat** metadata per lane.
+
+**Row expand** (broadcast scalar along row): after computing or loading **per-row scalar** `v[r]`, for each strip determine row segments and **broadcast** `v[r]` across lane positions (SIMD **splat**).
+
+**Column expand:** same **no-transpose-scratch** rule as **`TCOL*`** (**§5.3.2**): **row-major** strip walk + **multi-pass splat** / **lane scatter** with **column address generation**; **no** **`C×R`** scratch tile **only** for expand.
+
+**Cycles:** often **1× read** of narrow **per-row/col vector tile** + **1× read** of `src` + **streamed write** of `dst` → similar to **8–16** cycles depending on whether `src` and `v` fit strip schedule without extra passes.
+
+```mermaid
+flowchart TB
+  subgraph row_expand [TROWEXPANDADD sketch]
+    V[Per-row scalars v[r] in buffer]
+    S[src tile strips]
+    V --> SPLAT[Splat per row segment]
+    S --> ADD[Add/mul/max...]
+    SPLAT --> ADD
+    ADD --> D[dst strips]
+  end
+```
+
+---
+
+### 5.5 Complex Instructions
+
+#### 5.5.1 `TSORT32`
+
+Spec: sort **each 32-element block** of `src` with paired indices from `idx`.
+
+- **Within-strip:** if `32·E ≤ 512 B`, multiple blocks per strip; process blocks **in parallel SIMD sort networks** (bitonic / odd-even) of depth O(log² 32) comparators **pipelined**.
+- **Cross-block:** independent per block → **minimal cross-strip** except when a 32-block spans strip boundary → **microcode** stitches **tail/head** in a **staging register**.
+
+**Cycles:** **many** (tens), dominated by comparator stages × number of blocks `N/32`.
+
+#### 5.5.2 `TMRGSORT` (merge sort of multiple sorted lists)
+
+Typically **multi-list merge** with **k-way** comparator tree:
+
+1. **Load** list headers / pointers (implementation-defined in ISA).
+2. **Stream** strips from each list into **merge front buffers** (read ports time-multiplexed across lists).
+3. **Repeat:** compare **k** front elements, pick winner, push to **output strip**, refill from corresponding list.
+4. **Write** output strips via write ports.
+
+**Cross-lane / cross-strip:** **heavy**; merge **global** across lists, not SIMD-embarrassingly parallel. Expect **O(4096 / 1024) × (merge depth)** plus **compare tree** cycles—**hundreds** of cycles acceptable for a “complex” op.
+
+```text
+  List0 strips ──▶ ┐
+  List1 strips ──▶ ├──▶ k-way merge tree ──▶ out strip buffer ──▶ Wr ports
+  ...             │
+  List(k-1) ────▶ ┘
+```
+
+#### 5.5.3 `TGATHER` / `TGATHERB` / `TSCATTER`
+
+- **Index-driven** access: per element, **addr = base + f(index)`**; within VEC-4K, **on-tile** gather means **cross-strip byte mux** driven by **index SIMD** (indices may come from second tile).
+- Realistic schedule: **batch** indices into **coalesced** groups that fall into **same or adjacent strips** to limit mux fanout.
+
+**Cycles:** **large variance**; worst case approaches **per-element** serialization if indices are random.
+
+#### 5.5.4 `TCI`, `TTRI`, `TPART*`
+
+- **`TCI`:** **strip-parallel** index generation `base + stride` → **no cross-lane** dependency beyond broadcast of parameters.
+- **`TTRI`:** row/col counters compared to generate mask; **cross-lane** for diagonal boundary within strip only.
+- **`TPART*`:** valid-region mask intersects elementwise regions; same as §5.1 with **predicate gating**.
+
+#### 5.5.5 `TQUANT` / `TDEQUANT`
+
+Often **two-phase**: compute **scale/exp** per tile or per row (reduce), then **elementwise** scale. Combines **§5.3** + **§5.1**.
+
+---
+
+## 6. Cross-Lane and Cross-Strip Summary
+
+| Category | Cross-lane (within 512 B strip) | Cross-strip (among 8 strips) |
+|----------|----------------------------------|------------------------------|
+| Elementwise tile–tile | Independent lanes | None (strip order arbitrary) |
+| Tile–scalar | Independent | None |
+| Row reduce | Horizontal tree for row segment in strip | Combine partials for rows spanning strips |
+| Column reduce | **Strip read** → **VEC column mux** from row data in buffer (RF **no gather**) | **Heavy** strip **replay** / **`Acc` RMW** / multi-epoch (**no transpose scratch**) |
+| Row expand | Splat scalar across row segment | Repeat/broadcast metadata per strip |
+| Column expand | Partial splat (**column-major intent**, row-major storage) | **Heavy** multi-pass (**no transpose scratch**) |
+| `TSORT32` | Sort network per 32-block | Block spanning strip boundary |
+| `TMRGSORT` | Per-element compare in tree | **Global** merge across streams |
+| `TGATHER` | Mux selected elements | Arbitrary strip sources |
+
+---
+
+## 7. Datapath Diagram — Row Reduce with Cross-Strip Combine
+
+**Read sequencing** follows a **strip calendar** (**§4.3**). **§4.1** **dataflow:** **Rd0+Rd1** → **crossbar** (**1024 B**) → **(A) align/unpack/permute** → **128 slices `W_prep,i`** → **(B) 128×(ALU `W_ALU,i` → tree `W_tree,i`)** → **Acc** ping-pong → **half-select** → **Wr0+Wr1** (**`W_ALU,i` may exceed `W_tree,i`**).
+
+```mermaid
+flowchart TB
+  subgraph rf [TRegFile]
+    RD0[Rd0 512B]
+    RD1[Rd1 512B]
+  end
+  subgraph ctl [Control]
+    C[Opcode + shape]
+  end
+  subgraph xb [Crossbar]
+    XB[1024 B in]
+  end
+  subgraph modA ["(A) Align unpack permute"]
+    A["IN 1024 B OUT 128 x W_prep,i"]
+  end
+  subgraph modB ["(B) 128 groups N_group"]
+    G["Per i: slice W_prep,i to ALU to tree W_ALU,i then W_tree,i 32b typical"]
+  end
+  subgraph acc [Acc ping-pong]
+    ACCMEM["256x32bx2 RMW or bypass to DFF"]
+  end
+  subgraph out [Retire]
+    PACK[Pack]
+    WR[Wr0 Wr1 512B]
+  end
+  RD0 --> XB
+  RD1 --> XB
+  C --> XB
+  C --> A
+  C --> G
+  C --> ACCMEM
+  XB --> A
+  A --> G
+  G -->|W_tree to Acc rho| ACCMEM
+  ACCMEM -->|256 words| PACK --> WR
+```
+
+---
+
+## 8. Implementation Notes
+
+1. **Opcode decode** produces **control** for the **§4.1** **crossbar**, **(A)** **align/unpack/permute** (**per-slice** **`W_prep,i`**), **(B)** **128** **groups** (**ALU** **`W_ALU,i`**, **tree** **`W_tree,i`**, **`W_ALU,i ≥ W_tree,i`** **allowed**), **Acc** ping-pong **addresses**, **per-slot** **RMW** **vs** **bypass-to-DFF** (**§9.3.2**), **Wr half-select**, and a **strip read calendar** (**§4.3**): per-cycle **Rd0/Rd1** targets, **strip index** phase, **`fiber_id`** / **Acc** side effects. Parameters include strip loop count, **`TCOL*`** **wave** / **`N_acc`** / **`N_tree`** / **`f`** (§5.3.2), **`N_run`** / **`ρ` remap**, **`K_outer`**, **write-side staging**, **splat** / merge **k**, §9 **`r*`** / **`c*`** template id (**47** families). **`TCOL*`** **replays** over **`#W = max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉)`** when **`#W > 1`**; **no transpose-scratch**; **no RF gather** (**§3.1**).
+2. **Determinism:** PTO ops are expected to be **deterministic** at the tile level; multi-cycle internal scheduling is **invisible** if the instruction **retires atomically** from the programmer’s view (barriers via **`TSYNC`** as needed).
+3. **Resource conflicts:** with only **two** read ports, **TMRGSORT** and **column-reduce** should **stall** other TRegFile clients or use **dedicated tiles** for **algorithmic** ping-pong (e.g. sort lists) — **not** for a **transpose scratchpad** forbidden by **§5.3.2**.
+4. **Numerics:** FP16 / BF16 reductions are evaluated with an **FP32-widened accumulator** and rounded per `TCVT` / ISA rules on retire.
+
+---
+
+## 9. Legal `(format, R×C)` enumeration and axis-reduce complexity (`TROW*` / `TCOL*`)
+
+This section **enumerates every** combination of **logical format** and **tile shape** from §2.1 and, for each, gives **paired** metrics for **row-axis** reductions (`TROWSUM`, `TROWMAX`, `TROWARGMAX`, …) and **column-axis** reductions (`TCOLSUM`, `TCOLMAX`, `TCOLARGMAX`, …). It then explains how a **single reconfigurable reduction tree** morphs with those parameters, and counts how many distinct **control shapes** appear in the table.
+
+### 9.1 Enumeration rules
+
+- Tile storage: **4096 bytes**, row-major, **R** and **C** powers of two.
+- **N = R·C = 4096 / E** with **E** bytes per logical element (**only two supported storage widths**):
+  - **FP32:** `E = 4`, `N = 1024`, **11** shapes.
+  - **FP16** and **BF16:** `E = 2`, `N = 2048`, **12** shapes each (**24** table rows).
+
+**Master table rows:** **35**. **Unique `(E, R, C)` geometries:** **23** (FP16 and BF16 share shapes).
+
+`elem_per_strip = 512 / E` — **128** FP32 elements or **256** FP16 / BF16 elements per 512 B strip.
+
+### 9.2 Row-axis metrics (`TROW*`)
+
+For each **row** fiber, reduce **C** elements. **Bytes per row** `row_B = 4096/R`.
+
+| Sym | Definition |
+|-----|------------|
+| **rS** | Strips per row `= ⌈row_B / 512⌉`. |
+| **rK** | Elements in one cross-lane segment: `C` if `rS = 1`, else `512/E`. |
+| **rDl** | Cross-lane depth `= max(0, ⌈log₂ rK⌉)`. |
+| **rDc** | Cross-strip depth `= max(0, ⌈log₂ rS⌉)`. |
+| **rW** | Per-strip serial work: `rDl` if `row_B ≥ 512`, else `(512/row_B)·rDl`. |
+| **rLB** | `4 + rDl + rDc` (optimistic; §9.4). |
+| **rUB** | `4 + 8·rW + R·rDc` (conservative serial tree; §9.4). |
+| **rAccB** / **rStgUB** | Partial state (bytes): §9.3.1 — **`4·R`** logical running; **`4·R·rS`** staged upper bound; **physical running** **`N_run`** (**§9.3.2**). |
+
+### 9.3 Column-axis metrics (`TCOL*`)
+
+For each **column** fiber, reduce **R** elements. **Logical bytes per column** (if packed contiguously) `col_B = 4096/C = R·E`.
+
+**Formal substitution (metrics only — not a scratch layout):** Algebraically, `TCOL*` on **R×C** matches `TROW*` on a **fictitious C×R** row-major tile with the **same** **4096 B** element multiset. **VEC-4K does not allocate a physical C×R transpose scratchpad** for **`TCOL*`** (**§5.3.2**); the **`c*`** formulas still size **trees**, strip pressure, and **SRAM** for **strip replay + VEC column mux** (**TRegFile has no gather**, **§3.1**). The symbols are the **same** as §9.2 with **`(R,C) → (C,R)`** and **`C` ↔ `R`**:
+
+| Sym | Definition |
+|-----|------------|
+| **cS** | `⌈col_B / 512⌉` (= strips per column fiber in the **substitution-equivalent** striping model). |
+| **cK** | `R` if `cS = 1`, else `512/E`. |
+| **cDl** | `max(0, ⌈log₂ cK⌉)`. |
+| **cDc** | `max(0, ⌈log₂ cS⌉)`. |
+| **cW** | `cDl` if `col_B ≥ 512`, else `(512/col_B)·cDl`. |
+| **cLB** | `4 + cDl + cDc`. |
+| **cUB** | `4 + 8·cW + C·cDc` (note **`C`** column outputs, not `R`). |
+| **cAccB** / **cStgUB** | Partial state (bytes): §9.3.1 — **`4·C`** logical running; **`4·C·cS`** staged upper bound; **physical running** **`N_run`** (**§9.3.2**). |
+
+**Row-major hardware path:** scheduling uses the **same numeric** `(cS, cK, cDl, cDc, cW)` for **strip-sequential reads**, **partial state**, **VEC column extraction**, and **cross-strip merge** on the **operand tile**; **cLB/cUB** **exclude** **TTRANS** / **transpose-scratch** (**§5.3.2**) and **exclude** **`#W = max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉)`** **tile replays**, **`f`**, and **`N_run`** (**§5.3.2**). **Multi-epoch** **`reg_idx`** replay (**§4.4 Example E**) may dominate wall-clock.
+
+### 9.3.1 Partial accumulator state (`TROW*` / `TCOL*`)
+
+Strips that contribute to the **same** row (or column) either **update a running partial** or **buffer strip-level partials** until the cross-strip merge completes. **§9.3.1** gives **logical** byte formulas; **§9.7** lists **(format, R×C)** rows only; **§9.3.2** caps **live** running entries in silicon.
+
+**Assumption A — associative reduce (max / min / sum):** each output fiber keeps **one** **FP32-shaped** running partial (widen narrow formats in the reducer). Index **r** for rows, **c** for columns.
+
+| Symbol | Formula | Meaning |
+|--------|---------|---------|
+| **rAccB** | **`4·R`** | **Logical** per-row state (bytes): `R` rows × **4 B** (FP32 partial / compare operand width). **Physical** running file holds **`min(R, N_run)`** slots at a time when **`N_run < R`** (**wave** remap, **§9.3.2**). |
+| **cAccB** | **`4·C`** | **Logical** per-column state (bytes). **Physical** **`min(C, N_run)`** at a time when **`N_run < C`**. |
+
+**Implementation cap — `N_run = 512`:** the **VEC-4K** running-partial file is **512 × 32b DFF** (**2048 B**), **16×** smaller than the **§2.1** worst-case **8192** logical fibers (**8192 / 512 = 16**). When **`R > N_run`** or **`C > N_run`**, **decode** drives **`K_outer > 1` hardware loops** (**§5.3.2**): each **outer** step maps **at most `N_run`** fibers to **`ρ`** and runs the **full** inner **strip** schedule (or a **defined** subset); **completed** bands **retire** through **Pack → Wr0/Wr1**, often into **write-path staging registers** that **hold** or **combine** **partial `dst`** **slices** until **all** offsets are **done** — **functionally** equivalent to an unbounded Acc file, at **higher** cycle cost.
+
+**Same mechanism for `N_acc` / tree limits:** **outer loops** are **not** only for **`N_run`**; **banded `TCOL*`** (**`#W`** in **§5.3.2**) is the **same** pattern — **time** for **capacity**. **Write staging** is optional when each **wave** writes **disjoint** **`dst`** fibers directly; it is **required** when **outer** rounds must **merge** into **shared** output **strip** words or **one** **fiber** is **finished** only after **multiple** passes.
+
+**Assumption B — staged strip partials (upper bound):** microarchitecture retains **up to one FP32 partial per strip slot per fiber** before the **`⌈log₂ S⌉`** cross-strip tree drains them (worst case over all fibers simultaneously).
+
+| Symbol | Formula | Meaning |
+|--------|---------|---------|
+| **rStgUB** | **`4·R·rS`** | **Row-axis staging upper bound** (bytes). Never exceeds **32 768 B (32 KiB)** for any legal `(E,R,C)` in §2.1 (same peak as **rAccB** when `rS = 1`, or **4·1024·8** when `R = 1`, `rS = 8`, etc.). |
+| **cStgUB** | **`4·C·cS`** | **Column-axis** analogue; peak **32 KiB**. |
+
+**`TROWARGMAX` / `TROWARGMIN` / `TCOLARG*`:** plan for **value + index** per fiber (e.g. **8 B** aligned entries). A simple scaling rule: **≈ `2 × rAccB`** / **`2 × cAccB`** (and **×2** on **rStgUB** / **cStgUB** if indices are kept per staged partial).
+
+**Dual-axis overlap:** a single physical **Acc** file can be **time-multiplexed** between row and column passes; **simultaneous** row+column reductions need enough **live** slots for both (**≤ `N_run`** each if **serialized** bands) **or** **serialized** op issue. **Transpose scratch** is **not** used to fold axes (**§5.3.2**).
+
+#### 9.3.2 Accumulator organization (ping-pong DFF, `N_run = 512`, §4.1)
+
+The **running partial** store matches **§4.1**: **two** **ping-pong** **halves** of **256 × 32 bit** each (**1024 B** / **half**, **2048 B** total). **Each** **slot** is a **DFF** **word** with **two** **write** **modes** selected by **control**:
+
+- **RMW accumulate:** **new** **partial** from **(B)** **feeds** an **adder**; **second** **operand** is **feedback** from the **same** **slot’s** **DFF** **output**; **sum** (or **max/min** **compare**) **writes** **back** **to** **that** **DFF** (**associative** **reduce** **path**).
+- **Bypass combine (write-through):** **new** **data** **from** **(B)** **is** **muxed** **directly** **into** the **DFF**, **skipping** the **accumulate** **adder** — **overwrite** / **initialise** / **move** **style** **updates** **without** **old** **+** **new** **arithmetic**.
+
+**One** **half** **accepts** **writes** **while** **control** may **select** the **other** **half** for **Pack → Wr0+Wr1** (**512 B + 512 B** = **1024 B** = **256** words **per** **retire** **phase**).
+
+**Logical indexing:** **`ρ ∈ [0, N_run)`** with **`N_run = 512`**; e.g. **`ρ = h·256 + σ`** with **half** **`h ∈ {0,1}`** and **`σ ∈ [0,255]`**. **`fiber_id`** (**§4.3**) **remaps** to **`ρ`** across **Acc waves** when **`R` or `C > 512`**.
+
+**Capacity:**
+
+- **`N_run = 512`** **FP32-shaped** partials **across** **both** **halves**. **`N_acc ≤ N_run`** (**§5.3.2**); **at most 256** **distinct** **RMW** targets **per** **half** **per** **cycle** if **all** **writes** **land** **in** **one** **active** **accumulation** **half**.
+- **Logical** **`max(R)`** / **`max(C)`** over §2.1 remains **8192** — **rAccB** / **cAccB** (**§9.3.1**, **§9.6**) are **algorithmic**.
+
+**Conflict decode (optional 8-way view):** **`bank_id = ρ mod 8`**, **`word = ρ >> 3`** still **useful** for **port** **scheduling** (same as **32** **words**/bank × **8** **banks** **within** **each** **256-word** **half**).
+
+**`TROWARG*` / `TCOLARG*` variant:**
+
+- **64-bit** **value∥index** **per** **slot** → **double** **width** **or** **sidecar** **index** **RAM** (**§4.1** **unchanged** **topology**).
+
+**Staged partials (`rStgUB` / `cStgUB`):**
+
+- **Separate** **small** **buffer** **or** **stretched** **schedule** (**§9.3.1** **Assumption B**) — **orthogonal** to **Acc** ping-pong.
+
+**Summary:** **§4.1** **256 × 32 b × 2** **ping-pong** **Acc**; **`N_run = 512`**; **per-slot** **RMW** **or** **bypass-to-DFF**; **Wr0+Wr1** **drain** **one** **half** (**256** **words**) **per** **selected** **phase**. **Acc waves** **remap** **`fiber_id → ρ`** when **logical** **fibers** **exceed** **512**.
+
+### 9.4 Cycle model (both axes)
+
+Both axes assume **§3.2** unary ingest: **4 cycles** minimum to read the full tile with **two** 512 B read ports.
+
+- **Lower bound (*LB*):** ideal overlap of read, **one** wide pipelined `⌈log₂ K⌉`-stage tree, and cross-strip merge; **ignores** time-multiplexing many thin fibers on **one** physical tree.
+- **Upper bound (*UB*):** **8** strips each pay **W** tree-stage units on **one** shared tree, plus **one cross-strip phase per output fiber** (`R` outputs for `TROW*`, `C` outputs for `TCOL*`).
+
+### 9.5 Reconfigurable reduction tree — how the hardware “shape” follows the table
+
+The datapath is **one logical pipeline** reused by all table rows; its **effective shape** is selected by microcode from the **`r*`** or **`c*`** fields.
+
+1. **Unpack** maps a 512 B strip to up to **256** logical lanes (FP16 / BF16) or **128** lanes (FP32) — **physical SIMD** may be narrower; the **logical** tree depth is still **⌈log₂ K⌉**.
+
+2. **Cross-lane tree (variable fan-in K):** implement as **`D_lane = ⌈log₂ K⌉`** stages of **pairwise** reduce ops. **K** jumps with `(format, R, C)`:
+   - **`rK`** (row) depends primarily on **`C`** when `rS=1`, else fixed **`512/E`**.
+   - **`cK`** (column) depends primarily on **`R`** when `cS=1`, else **`512/E`**.
+   - For **rectangular** tiles, **`rK ≠ cK`** in general → row and column ops need **different programmed depths** for the same stored tile.
+
+3. **Cross-strip merger (variable S):** after each strip contributes a **partial**, a **balanced tree** of depth **`⌈log₂ S⌉`** combines partials for the same fiber ID (`rS` or `cS`). **S ∈ {1,2,4,8}** in this enumeration → at most **3** compare stages after lane tree.
+
+4. **Temporal “stretch” (W):** when **`row_B < 512`** (or **`col_B < 512`**), **multiple complete fibers** land in one strip. A **single** lane tree of depth **`D_lane`** must run **`512/row_B`** (or **`512/col_B`**) times **per strip** unless duplicated in silicon → **`W`** scales **linearly** with packed fiber count.
+
+```mermaid
+flowchart LR
+  subgraph strip [Per 512B strip]
+    U[Unpack] --> MUX[Mux K active lanes]
+    MUX --> LT[log2 K-stage lane tree]
+    LT --> P[Partial per fiber ID]
+  end
+  P --> CS[Depth log2 S cross-strip tree]
+  CS --> ACC[Acc ping-pong 256×32b×2 RMW §4.1]
+  ACC --> OUT[Reduced fiber value]
+```
+
+#### 9.5.1 How many distinct “shapes” are needed?
+
+| Counting notion | Value | Meaning |
+|-----------------|------:|---------|
+| **Physical datapaths** | **1** | One reducer suffices if it supports **max K = 256** (FP16 / BF16), **max `D_lane` = 8**, **max `S` = 8** (`D_cross ≤ 3`), with **per-stage bypass** and **programmable lane mask**. |
+| **Unique `(D_lane, D_cross, W_strip)` tuples** | **22** | Distinct **time-scheduling recipes** for **either** axis, over all **23** geometries (FP32 contributes 11, FP16 / BF16 contribute 12; only the trivial `(0,0,0)` tuple is shared). |
+| **Unique `(S, K, D_lane, D_cross)` quartets** | **15** | Coarser strip + tree fingerprint (per axis), after eliminating overlap between FP32 and FP16 quartet sets. |
+| **Unique paired `(row tuple, column tuple)`** | **23** | One pair per **`(E,R,C)`** (FP16 and BF16 share shapes, so BF16 adds no new pairs); **square** shapes have **identical** row and column metrics. |
+
+So: **one** parameterized **tree + cross-strip** unit covers the whole table; firmware/microcode must hold **22** **scheduling templates** (or equivalent parameterized loops) per axis, not **35** different RTL blocks.
+
+### 9.6 Summary by format (extrema over all legal shapes)
+
+Maxima over **both** axes are **identical** for each format family (swap **R↔C** maps extreme row cases to extreme column cases).
+
+| Format | N | # shapes | max **K** (either axis) | max **D_lane** | max **S** | max **D_cross** | min *LB* | max *LB* | max *UB* (r or c) | max **rAccB** / **cAccB** | max **rStgUB** / **cStgUB** |
+|--------|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| FP32 | 1024 | 11 | 128 | 7 | 8 | 3 | 4 | 14 | 516 | 4096 | 4096 |
+| FP16 / BF16 | 2048 | 12 | 256 | 8 | 8 | 3 | 4 | 15 | 1028 | 8192 | 8192 |
+
+**Logical** peak **rAccB** / **cAccB** in the table is **`4·R` / `4·C`** (up to **8 KiB** at **`R` or `C = 2048`**, FP16 / BF16). **VEC-4K silicon** (**§9.3.2**): **running partials** = **`N_run = 512`** entries × **4 B** = **2048 B DFF**; **`R` or `C > 512`** uses **Acc waves** (**§9.3.1**). **rStgUB** / **cStgUB** remain **staging upper bounds** (**≤ 8 KiB** for the supported formats); **physical staging** may be a **small separate buffer** or **longer** microcode schedule.
+
+### 9.7 Legal `(format, R×C)` enumeration
+
+**35** rows — same **master-table** row count as **§9.1** (**23** distinct **`(E, R, C)`** geometries; **FP16** vs **BF16** duplicate shapes). **Per-axis metrics** (`r*`, `c*`, **LB/UB**, **rAccB**, **cAccB**, **rStgUB**, **cStgUB**) are defined in **§9.2–9.3**; **format extrema** in **§9.6**.
+
+| Format | E (B/elem) | N | R×C |
+|--------|------------|---|-----|
+| FP32 | 4 | 1024 | 1×1024 |
+| FP32 | 4 | 1024 | 2×512 |
+| FP32 | 4 | 1024 | 4×256 |
+| FP32 | 4 | 1024 | 8×128 |
+| FP32 | 4 | 1024 | 16×64 |
+| FP32 | 4 | 1024 | 32×32 |
+| FP32 | 4 | 1024 | 64×16 |
+| FP32 | 4 | 1024 | 128×8 |
+| FP32 | 4 | 1024 | 256×4 |
+| FP32 | 4 | 1024 | 512×2 |
+| FP32 | 4 | 1024 | 1024×1 |
+| FP16 | 2 | 2048 | 1×2048 |
+| FP16 | 2 | 2048 | 2×1024 |
+| FP16 | 2 | 2048 | 4×512 |
+| FP16 | 2 | 2048 | 8×256 |
+| FP16 | 2 | 2048 | 16×128 |
+| FP16 | 2 | 2048 | 32×64 |
+| FP16 | 2 | 2048 | 64×32 |
+| FP16 | 2 | 2048 | 128×16 |
+| FP16 | 2 | 2048 | 256×8 |
+| FP16 | 2 | 2048 | 512×4 |
+| FP16 | 2 | 2048 | 1024×2 |
+| FP16 | 2 | 2048 | 2048×1 |
+| BF16 | 2 | 2048 | 1×2048 |
+| BF16 | 2 | 2048 | 2×1024 |
+| BF16 | 2 | 2048 | 4×512 |
+| BF16 | 2 | 2048 | 8×256 |
+| BF16 | 2 | 2048 | 16×128 |
+| BF16 | 2 | 2048 | 32×64 |
+| BF16 | 2 | 2048 | 64×32 |
+| BF16 | 2 | 2048 | 128×16 |
+| BF16 | 2 | 2048 | 256×8 |
+| BF16 | 2 | 2048 | 512×4 |
+| BF16 | 2 | 2048 | 1024×2 |
+| BF16 | 2 | 2048 | 2048×1 |
+
+---
+
+## 10. Related Documents
+
+- [`tregfile4k.md`](tregfile4k.md) — **8R/8W** tile RF, **8-cycle epoch**, **`G = (p+e) mod 8`** calendar; **§4.4** binds **R0/R4** to VEC **Port A/B**.
+- [`outerCube.md`](outerCube.md) — MXU / outer product engine (different port count; **not** identical to VEC-4K’s 2R+2W model).
+- [`PTOISA/README.md`](PTOISA/README.md) — authoritative ISA list and per-op references.
+
+---
+
+## Document History
+
+| Version | Date | Notes |
+|---------|------|-------|
+| 0.1 | 2026-04-07 | Initial VEC-4K architecture sketch for 4 KB tiles, 2×512B R/W ports, PTO ISA vector subset |
+| 0.2 | 2026-04-07 | §9: full `(format, R×C)` enumeration, `TROWMAX` tree complexity, `C_lb`/`C_ub` cycle models, master table |
+| 0.3 | 2026-04-07 | Remove §5.3.1a examples; §9 unified row+column table; §9.5 tree morphing + shape counts (1 datapath, 47 recipes, 23 quartets, 50 pairs) |
+| 0.4 | 2026-04-07 | §9.3.1 + table columns **rAccB**/**cAccB**/**rStgUB**/**cStgUB** (per-fiber partial SRAM sizing) |
+| 0.5 | 2026-04-07 | §9.7: Markdown pipe table → compact **HTML** table (`font-size: 0.52em`, scroll wrapper) for preview rendering |
+| 0.6 | 2026-04-07 | §9.3.2 multi-bank accumulator SRAM: **8×1024×32b** default, address map, ports, arg/staging variants |
+| 0.7 | 2026-04-07 | §4.1 / §7 / §9.5 diagrams: **Acc 8-bank** in logical block diagram, row-reduce mermaid, reconfigurable-tree flow |
+| 0.8 | 2026-04-07 | §4.3 **fiber_id** + **strip read calendar** (Rd0/Rd1 per cycle, operand sources, Acc RMW); ties to §5.3–5.4, §8, §9.3.2 / §9.5.1 templates |
+| 0.9 | 2026-04-07 | §4.4 four **epoch-aligned** fiber calendars vs `tregfile4k.md` (**R0/R4**), dual/single-port ingest, **TROWSUM**/**TROWEXPANDADD** examples |
+| 0.10 | 2026-04-07 | §4.4 **+4** examples (**E–H**): **FP8** `TROWSUM` **64×64**, **16×256**; **MXFP4** `TROWSUM` **512×16**; **MXFP4** **32×256** (superseded for **H** in **v0.11**) |
+| 0.11 | 2026-04-07 | §4.4 Example **H** → **`TCOLSUM` MXFP4 32×256** (**`fiber_id = c`**); §4.3 legend **`TCOL*`** cross-ref |
+| 0.12 | 2026-04-07 | **`TCOL*` / `TCOLEXPAND*`**: **no transpose scratchpad** — native row-major + gather/splat; §5.3.2, §6, §8–9, **Example H** rewritten; **`c*`** = metrics-only substitution |
+| 0.13 | 2026-04-07 | **§3.1** TRegFile **no gather**; **`TCOL*`** = strip **replay** + VEC **column mux** + **Acc RMW**; **Example H** multi-epoch / **`P`** bands; §5.3.2, §6, §8–9 aligned |
+| 0.14 | 2026-04-07 | **§5.3.2** **`N_tree`** / **`N_acc`**, **`P_beat=min`**, **`#waves=⌈C/N_acc⌉`**; **Example H** + §4.3/§8/§9 use **`N_acc`** (not lone **`P`**) |
+| 0.15 | 2026-04-07 | **`f`** (commits/tree/full-scan); **`#waves=max(⌈C/N_acc⌉,⌈C/(N_tree·f)⌉)`**; §5.3.2, Example H, §4.3/§8/§9 |
+| 0.16 | 2026-04-07 | §4.4 Example **H**: **illustrative** **`#waves_acc`/`#waves_tree`** split; **Acc-limited** multi-epoch table caption; **tree-limited** extra waves note |
+| 0.17 | 2026-04-07 | **`N_run = 512`**: running partials **8×64×32b DFF** (~**2 KiB**), **16×** vs **8192** peak; **Acc waves** / **`ρ`** remap; §4.1/§4.3/§5.3.2/§8/§9.3–9.7 + mermaid |
+| 0.18 | 2026-04-07 | **§5.3.2** / **§9.3.1** / **§8**: **`K_outer`** hardware loops, **fiber offset**, **Wr staging** to **merge** partial **`dst`** when Acc capacity **<** extreme shape |
+| 0.19 | 2026-04-07 | **§5.3.2**: **`#waves_Nrun`**, **`#W`** **max** **formula**, **`#W_trow`** **`TROW*`**; **`K_outer`** **vs** **`#W`**; **§4.3** / **§4.4 H** / **§8** / **§9.3** **`#W`** **cross-refs**; **§5.3.1** **Acc** **wording** |
+| 0.20 | 2026-04-07 | **§9.7**: drop wide **HTML** **r\*/c\*** **master** table → **Markdown** **4-column** **(Format, E, N, R×C)** only; **cross-refs** **§5.3.2** / **§9.3.1** |
+| 0.21 | 2026-04-07 | **§4.1** **dataflow** diagram: **opcode+shape→control→crossbar** (**1024 B**), **`N_tree=128`**, **Acc** **256×32b×2** **ping-pong** **adder+fb**, **Wr half** → **Wr0+Wr1**; **§3.3** / **§4.2–4.3** / **§7–8** / **§9.3.2** / **§9.5** mermaid |
+| 0.22 | 2026-04-07 | **§4.1** / **§7**: split **(A)** **align/unpack/permute** vs **(B)** **128** **groups**; **W_prep,i** / **W_ALU,i** / **W_tree,i**; **W_ALU ≥ W_tree** |
+| 0.23 | 2026-04-07 | **Acc** **bypass**: **write-through** **to** **DFF** **without** **combine** **adder** (**§4.1**, **§9.3.2**, **§7–8**) |
+| 0.24 | 2026-04-20 | **Format simplification**: drop **FP8 / MXFP4 / HiFP4**; supported storage widths are now **FP32 (E=4)** and **FP16 / BF16 (E=2)** only. §2.1 table, §4.4 examples (D → FP16 `TROWSUM` 128×16; old E–H removed; new **E** = FP16 `TCOLSUM` 16×128 wave illustration), §9.1 / §9.5.1 / §9.6 extrema, §9.7 master table (35 rows / 23 unique shapes); all `FP8`/`FP4` / nibble-packed cross-refs retired. Peak **K = 256**, **D_lane = 8**, **max UB = 1028** (FP16 `512×4`). |
diff --git a/designs/outerCube/vector4k_v2.md b/designs/outerCube/vector4k_v2.md
new file mode 100644
index 00000000..5dbb4782
--- /dev/null
+++ b/designs/outerCube/vector4k_v2.md
@@ -0,0 +1,2546 @@
+# VEC-4K-v2: Vector Unit for 4 KB PTO Tiles with Staging Registers and Per-Port Transpose
+
+## 1. Purpose and Scope
+
+This document specifies **VEC-4K-v2**, an evolution of [`vector4k.md`](vector4k.md) (VEC-4K v0.24) that paires the same **4 KB tile register file** ([`tregfile4k.md`](tregfile4k.md)) with a **restructured vector execution unit**. The functional ISA subset is unchanged from VEC-4K (§1 of [`vector4k.md`](vector4k.md)): elementwise tile–tile, tile–scalar, axis reduce / expand, and selected complex instructions (**TMRGSORT**, **TSORT32**, **TGATHER**, **TCI**, …).
+
+**What is new in v2 (vs. [`vector4k.md`](vector4k.md)):**
+
+1. **Up to three tile operands per instruction** (`A`, `B` = value tiles; `C` = per-element bitmask) — fetched through the **same** two 512 B read ports in a **variable-length operand-fetch phase** (§3, §6). Full-tile (unmasked) instructions skip the `C` fetch entirely.
+2. **Up to two tile results per instruction** (`D0`, `D1`) — retired through the **same** two 512 B write ports.
+3. **Per-read-port `is_transpose` bit — forwarded to the TRegFile** (**[`tregfile4k.md`](tregfile4k.md) §7**) so that any of `A`, `B`, `C` can be fetched in **row-mode or col-mode at full 512 B/cy** with no additional VEC-side datapath logic (§3.3). The staging register consumes the delivered strips in whichever mode TRegFile produced; element-level fixup for the non-aligned `W` regimes of [`tregfile4k.md`](tregfile4k.md) §7.5 is absorbed by the stage (A) align/unpack/permute block (§5.2). [`tregfile4k.md`](tregfile4k.md) §6 **rule R2** (uniform transpose per epoch) applies as a scheduling constraint — mixed-mode 3-operand instructions pay one extra epoch (§6).
+4. **Tile-level metadata (32 b per tile register):** `shape.x`, `shape.y`, and `format ∈ {fp32, fp16, fp8, fp4}`. The metadata travels with the tile through the staging registers and drives the align/unpack/permute stage and the reduction-tree depth (§2.2, §5).
+5. **Explicit staging registers** between the TRegFile read ports and the 512 B compute datapath: **`SA`, `SB`, `SC` (operand tile staging)**, **`SOP` (opcode + control staging)**, and **`SX`, `SY` (scalar staging)** (§4). Scalar operands may be sourced from the **scalar GPR register file**, from an **immediate** field of the instruction word, from a **tile element** `[r, c]`, or from an **ACC slot**; the source is encoded in the instruction word's `sx_src` / `sy_src` selectors and the corresponding 5-bit GPR index / immediate / tile-element pointer / ACC slot id (§4.3). GPR and immediate scalars are captured at issue and do not occupy any vector-side read-port cycles.
+6. **Variable-length operand-fetch phase** — the number of cycles from "instruction issue" to "operand ready" depends on the number of tile operands, the TRegFile epoch alignment, and any port-sharing pressure (§6).
+7. **Native 3-source `TFMA` — `D = A·B + C` with three tile-register operands (§7.6).** The classic FMA pattern `y = γ·x̂ + β` (final affine in LayerNorm / RMSNorm), Welford `μ_new = δ·inv_n + μ_old`, and elementwise `gelu` / `swiglu` / sin/cos kernels need a **third value-tile operand** that is **not** the accumulator. v2 promotes operand `C` to a **dual role** (mask **or** value tile, selected by the issue-time `c_role` bit, §3.3c) and binds a third VEC-side TRegFile read port (`R1` = Port C) so all three tiles fetch **in parallel within one 8-cy epoch** — same throughput as a 2-source op. Justification: ~2× throughput vs. emulated `MUL` + `ADD`, and one final rounding instead of two (avoids precision loss in narrow-format accumulation). Hardware cost: ~0 — the stage (B) per-lane FMA already supports `A·B + Z`; the only delta is binding port `R1` and routing `SC` value-mode read into the existing `MUX_Z` (§5.8).
+
+8. **Three novel PTO instructions enabled by the v2 pipeline (§7.5):**
+   - **`TINV`** — tile matrix inverse for square matrices up to **128×128 FP32 (16 tiles)** / 64×64 FP8 (single tile); Gauss–Jordan with Acc-resident pivot row and in-ALU Newton–Raphson reciprocal refinement, all microcode-driven, ~33 K beats for 128×128 FP32. Reuses the unified ALU + Acc feedback, `tilelet_xpose`, and per-lane predicate path; only new datapath addition is a small `RECIP` / `RSQRT` LUT+refine block on stage (B) (~50 K gate system-wide).
+   - **`TROWRANGE_MUL`** — column-wise product over a dynamic row sub-range `[r_start, r_end)` (range from scalar GPRs), ~10 beats per call. Reuses the existing ACCUM / INIT / READOUT machinery with `alu_op = MUL` and a tiny (~100-gate) combinational predicate synthesiser for the row-range mask. Replaces the 3-pass `exp(sum(log(...)))` emulation with a single tile-resident pass.
+   - **`TMRGSORT`** — full-tile mergesort / bitonic sort over any `N = 2^p` up to 8192 (FP4 tile), via a single **reconfigurable 256-lane perfect-shuffle + 128-way compare-swap** primitive activated by new `SHUFFLE_CAS_{UP,DOWN}` alu_ops. A microcode layer-schedule runs `p(p+1)/2 × ⌈N/256⌉` beats total (e.g. 36 beats for N=256, 220 beats for N=1024 FP32, 2.9 K beats for N=8192 FP4). Emits value + index tiles in one instruction via dual retire. Supports partial / predicated sort for free via the per-lane mask gate. New hardware cost ≈ ~130 K gate, roughly breakeven with v1's hard-wired `TSORT32` / `TMRGSORT` networks that it replaces, but supports all power-of-2 sort sizes in one piece of silicon.
+
+The **compute datapath width remains 512 B** and the **TRegFile port count (2R + 2W × 512 B)** is unchanged (§3.1). Scheduling templates, reduction-tree depth formulas, Acc ping-pong, and `#W` / `K_outer` are **all carried over** from [`vector4k.md`](vector4k.md) §4–§9; v2 changes how **operands are delivered and staged**, not how they are computed.
+
+> **Key architectural choice.** VEC-4K-v2 has **no input crossbar** between the staging registers and the compute pipeline. Instead, each compute beat is directly driven by a **microcode "beat instruction"** latched in `SOP` (§5.4) that nominates, for each of the up-to-three ALU operand slots, **which staging register** (`SA` / `SB` / `SC`), **which 512 B tilelet** (strip index `0..7` within the tile), and whether to apply a **tilelet-level transpose** (`tilelet_xpose`). The tilelet-level transpose is the same 8 × 8 chunk-grid transpose that [`tregfile4k.md`](tregfile4k.md) §7 implements at its read port — we reuse the algorithm locally inside VEC by organising each staging register with a **wrapped-diagonal bank skew** (§4.2.1) so that a tilelet can be delivered in either row-mode or col-mode per beat at full 512 B/cy, at zero scheduling cost. Element-level fixup for non-aligned `W` regimes reuses the existing stage (A) align / unpack / permute block ([`vector4k.md`](vector4k.md) §4.1 (A), [`tregfile4k.md`](tregfile4k.md) §7.5). The per-operand `is_transpose` bit on the TRegFile read port (§3.3) is retained as an optional tile-level pre-transpose (useful when one view of a tile is needed many times in a row), but the primary VEC-v2 transpose mechanism is the per-beat staging-side `tilelet_xpose`.
+>
+> **Consequence for reduction workloads.** With microcode-controlled per-beat tilelet dispatch and the accumulator RMW / ping-pong path of [`vector4k.md`](vector4k.md) §4.1, col-axis reductions (summing across the `R` axis) are expressed naturally as an accumulator-accumulate loop that processes one 512 B row-slice per beat (§5.6). Row-axis reductions (summing across the `C` axis) are rewritten as col-axis reductions on the tilelet-transposed operand — microcode flips `tilelet_xpose` on the relevant staging read and reuses the same accumulator pattern (§5.6.2). Narrow tiles (`W = C·E < 512 B`, multiple logical rows per strip) require a **final cross-slot accumulator merge** after the per-strip accumulate loop; §5.6.3 formalises this pattern. §5.7 walks through worked examples for both wide and narrow shapes for row-reduce and col-reduce, with microcode pseudo-code.
+>
+> **Operand role of C — mask / predicate only.** Unlike operands `A` and `B`, which are value tiles, operand `C` is specifically a **per-element bitmask** (1 bit per logical element of the active format) indicating which lanes of `A` / `B` participate in the ALU operation (§3.3c, §4.2.2). Instructions that do not need a mask (the common "full-tile" case) **do not fetch `C` at all** — this saves one operand-fetch epoch (§6.2) — and the stage (B) mask input defaults to `IMM_ALL_ONES` so every lane is enabled. True 3-value ternary operations (e.g. `TFMA` with an in-tile addend) use the accumulator feedback path (§5.5, §5.8) as their third value input, not the `C` tile.
+
+> VEC-4K-v2 reintroduces the **narrow storage widths** (`fp8`, `fp4`) that [`vector4k.md`](vector4k.md) v0.24 removed. The `format` field in the tile-register metadata (§2.2) is **4 bits wide** to cover all encodings; the reduction tree still widens to FP32 in internal accumulate paths.
+
+> **Self-contained reading.** §1–§10 below describe the VEC-4K-v2 architecture in full. Sections that say "unchanged from [`vector4k.md`](vector4k.md) §X" or "All op categories of `vector4k.md` §5 carry over" rely on a v1 baseline that is **also reproduced in this document, verbatim** in the **Appendix V1** at the end (§A.1 inheriting v1 §3.2; §A.2 inheriting v1 §4.2; §A.3 v1 §4.3; §A.4 v1 §4.4 worked examples A–E; §A.5 v1 §5 instruction categories baseline; §A.6 v1 §6 cross-lane summary baseline; §A.7 v1 §7 row-reduce datapath diagram; §A.8 v1 §9 legal `(format, R×C)` enumeration with all formulas, tables, and reduction-tree complexity analysis; §A.9 v1 §8 implementation notes baseline). Each appendix subsection is explicitly marked `(v1 → v2: 内容未变更, 完整复制自 vector4k.md §X)` so a reader of `vector4k_v2.md` alone obtains the complete and current design without consulting v1.
+
+---
+
+## 2. Tile, Format, and Tile-Register Model
+
+### 2.1 Storage Invariant
+
+Each logical tile occupies exactly **4096 bytes** in the TRegFile. Logical shape is **R × C**, both powers of two, row-major (same rule as [`vector4k.md`](vector4k.md) §2.1).
+
+VEC-4K-v2 supports **four** logical formats, distinguished by the **storage bytes per element `E`**:
+
+| `format` | Logical name | `E` (bytes / element) | Elements per 4 KB tile (`N = 4096 / E`) |
+|----------|--------------|-----------------------|------------------------------------------|
+| **`0b0000`** | **FP32**         | 4   | 1024 |
+| **`0b0001`** | **FP16 / BF16**  | 2   | 2048 |
+| **`0b0010`** | **FP8** (E4M3 / E5M2) | 1 | 4096 |
+| **`0b0011`** | **FP4** (MXFP4 / HiFP4, packed nibbles) | 0.5 | 8192 |
+| `0b01xx`–`0b11xx` | reserved | — | — |
+
+Internal ALU / reducer operands are widened to **FP32** where required by ISA numerics (associative reduce, accumulate); pack / unpack at the align stage (**§5.2**) handles narrow-to-wide on ingest and wide-to-narrow on retire. FP8 and FP4 paths are **lanes-only** through the compute stage (no packed arithmetic in the reduce tree).
+
+`elem_per_strip = 512 / E` — **128** FP32, **256** FP16/BF16, **512** FP8, **1024** FP4 logical elements per 512 B strip.
+
+### 2.2 Tile-register metadata (32 bits per tile register)
+
+Each entry of the TRegFile carries a **32-bit metadata word** alongside the 4096 B payload. The metadata is allocated per tile register (not per strip) and is **read together** with the first strip of the tile:
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                    tile-register metadata (32 b)                     │
+├────────────┬────────────┬───────────┬─────────┬────────────────────┤
+│ shape.x    │ shape.y    │ format    │ flags   │ reserved           │
+│ [13:0]     │ [27:14]    │ [31:28]   │ see §2.2│                    │
+│ 14 b       │ 14 b       │ 4 b       │ (in flags overlay)          │
+└────────────┴────────────┴───────────┴─────────┴────────────────────┘
+```
+
+| Field | Width | Range | Meaning |
+|-------|-------|-------|---------|
+| **`shape.x`** | 14 b | `1 … 8192` | Number of **columns `C`** (logical row length). Power-of-two values are required by §2.1; non-power-of-two encodings are reserved for future extensions. |
+| **`shape.y`** | 14 b | `1 … 8192` | Number of **rows `R`**. Same power-of-two rule. |
+| **`format`** | 4 b | §2.1 | Logical format code; selects `E` and unpack policy at stage **(A)** (§5.2). |
+| **`flags`** | — | (aliased with high bits of `reserved`) | Optional per-tile hints: `arg_tile` (tile carries value∥index pairs for `TROWARG*` / `TCOLARG*`), `scalar_tile` (tile is a packed scalar broadcast source, §4.3), `prefetch_hint` (VEC may stream this tile sooner than others). Encoded by microcode. |
+
+The metadata is written by the producing instruction (either an ISA-level `TSETMETA` or implicitly by the op that created the tile) and cannot change while the tile register is the target of any in-flight operand fetch. Consumers read the metadata **once** at operand fetch time; the decoded `E`, `R`, `C`, `format` are latched into the opcode staging register (**`SOP`**, §4.4) and remain valid for the entire compute phase of the instruction.
+
+> Note: the **32-bit** metadata is orthogonal to the **`is_transpose`** bit on the **read port** (§3.3). `is_transpose` is per-read-op, per-port, and lives in the issue packet; the metadata lives with the tile register.
+
+### 2.3 Shape–format consistency
+
+A tile register is **legal** iff `shape.x · shape.y · E = 4096`. Microcode is responsible for checking this at tile allocation; the VEC compute stage assumes legality and does **not** verify it per cycle.
+
+---
+
+## 3. TRegFile Interface
+
+### 3.1 Ports (v2: 3 read + 2 write, expanded from v1's 2R / 2W)
+
+| Direction | Width | Count | Aggregate | v1 → v2 |
+|-----------|-------|-------|-----------|---------|
+| Read      | 512 B | **3** | **1536 B/cycle** | **v2 增量** (v1: 2 read ports). The third read port supports native 3-tile ops (`TFMA` family, §7.6); when no instruction in flight needs a third value tile, the port is idle and the kernel collapses to v1-equivalent 2-port operation. |
+| Write     | 512 B | 2     | **1024 B/cycle** | unchanged from v1 |
+
+Ports attach to the **8R / 8W TRegFile-4K** ([`tregfile4k.md`](tregfile4k.md) §3). VEC-4K-v2 binds **three** of the eight physical read ports — by convention **R0 = Port A** (phase 0), **R4 = Port B** (phase 4), **R1 = Port C** (phase 1) — and **two** of the eight physical write ports (**W0** = `D0`, **W4** = `D1`). The 3rd read port (R1) is the only structural change vs. [`vector4k.md`](vector4k.md) §3.1; everything else is identical: same 8-cycle epoch cadence, same `(reg_idx, is_transpose)` per epoch, same R2 uniform-transpose rule across all 3 active reads in any epoch ([`tregfile4k.md`](tregfile4k.md) §6).
+
+> **Why a 3rd VEC read port (and not 2 epochs of fetch).** A real 3-source FMA `D = A·B + C` is the canonical kernel for LayerNorm / RMSNorm final affine (`y = γ·x̂ + β`), Welford incremental update (`μ_new = δ·inv_n + μ_old`, `M2_new = δ·δ_2 + M2_old`), Welford state merge (§7.6), `gelu` / `swiglu` activations, sin/cos polynomials, and any kernel where one operand is **not** the previous accumulator value (so v1's `TFMA_ACC D = A·B + Acc` does not apply). Without the 3rd read port, fetching three tiles takes **2 epochs = 16 cy** and halves throughput; with the 3rd port, fetch is **1 epoch = 8 cy**, matching binary ops and doubling the `TFMA` rate. The cost is one additional VEC-side read-port binding inside an already-8R TRegFile — zero new SRAM, zero new bank-conflict pressure (the diagonal skew already guarantees conflict-free 8-port read, [`tregfile4k.md`](tregfile4k.md) §4). For workloads that never issue a 3-tile op (pure elementwise / cube / reduce), R1 simply stays idle — energy is gated.
+
+Each read port consumes one **`{reg_idx, is_transpose}`** pair per epoch (8 cycles; §3.2) and delivers **512 B/cy** across the epoch. Strips may be **row-mode** (linear bytes `s·512 … s·512+511` for strip `s`) or **col-mode** (8 × 8 chunk-grid transpose, [`tregfile4k.md`](tregfile4k.md) §7). The `is_transpose` bit is produced by VEC-4K-v2 (from the issue packet) and consumed by the TRegFile read port.
+
+**Operand roles.** The three ISA-level tile operands have distinct payload semantics and distinct fetch policies. Operand `C` carries a **`c_role` selector** in the issue packet that picks between two payload interpretations (§3.3c).
+
+| Operand | Role | Payload | Fetched when | Storage |
+|---------|------|---------|--------------|---------|
+| **A** | Value tile (primary) | `R × C × E` bytes of tile data | Always | `SA` (§4.2) |
+| **B** | Value tile (secondary, optional) | `R × C × E` bytes of tile data | When the op is binary or ternary (`N_val ≥ 2`) | `SB` (§4.2) |
+| **C** | **Dual role — `c_role = MASK`**: per-element bitmask (1 bit per element of `A` / `B`, selecting which lanes participate). **`c_role = VALUE`**: a third **full value tile** for native 3-source ops such as `TFMA D = A·B + C` (§7.6). | `c_role = MASK` → `⌈R · C / 8⌉` bytes of packed mask bits; `c_role = VALUE` → `R × C × E` bytes of tile data | `c_role = MASK` and `has_mask = 1`, **OR** `c_role = VALUE` and `N_val = 3` (e.g. `TFMA`, `TFNMA`, `TLERP`); skipped for full-tile binary / unary ops | `SC` (§4.2.2) — the same 4 KB diagonal-skew flip-flop array; `c_role` selects between the 128 × 1 b lane-predicate read port (mask mode) and the 512 B/cy value-tilelet read port (value mode) |
+
+When `C` is not fetched, the stage (B) ALU's mask input defaults to `IMM_ALL_ONES` (§5.8) and every lane participates unconditionally; the value-FMA addend (`src_Z`) defaults to `IMM_ZERO` so `alu_op = FMA` collapses to `MUL` if a kernel mistakenly issues FMA without `c_role = VALUE`. Issue-time hardware checks `(c_role, has_mask, N_val)` for legality. Two value tiles plus a mask in one instruction (`A`, `B`, `C_mask`) requires a 16 cy fetch (one epoch for `A`+`B` + one piggyback strip for the mask, §6.2); three value tiles (`A`, `B`, `C_value`) fetch in parallel in **one** 8 cy epoch using a third VEC-side TRegFile read port (R1, §3.3a).
+
+### 3.2 Strip / epoch model
+
+- **Strip** = 512 B = one port beat.
+- **Epoch** = 8 cycles = one full 4 KB tile on one read port.
+- **Dual-port epoch**: two tiles × 8 strips on two read ports = **two 4 KB tiles consumed in 8 cycles**, as in [`vector4k.md`](vector4k.md) §3.2. For **masked** instructions (§3.3c, §6.3), the small mask payload (1–2 strips) piggybacks on an idle port cycle within a value-tile epoch — no extra epoch is needed in the common case.
+
+### 3.3 Two transpose points: tregfile read port and staging-read tilelet xpose
+
+VEC-4K-v2 exposes **two** orthogonal chunk-grid transpose mechanisms, both reusing the [`tregfile4k.md`](tregfile4k.md) §7 algorithm at 64 B sub-chunk granularity. Microcode chooses per-tile / per-beat which one to enable:
+
+| Transpose point | Granularity | Control | When to use |
+|------------------|-------------|---------|-------------|
+| **TRegFile read port** — per-operand `is_transpose` at fetch time (§3.3a) | Whole tile, set once at operand fetch; [`tregfile4k.md`](tregfile4k.md) §7.4 col-mode | One bit in the issue packet, forwarded to R0 / R4 | When the compute phase re-reads the same tilelet many times and would prefer a pre-transposed cached view |
+| **Staging read port** — per-beat `tilelet_xpose` at dispatch time (§3.3b, §4.2.1, §5.4) | Per 512 B tilelet, selectable every compute beat; chunk-grid (8 × 8 of 64 B chunks) | One bit per operand slot in each beat of the microcode program in `SOP` | Primary mechanism — default for reduction / reshape workloads; no scheduling cost |
+
+#### 3.3a TRegFile read-port `is_transpose` (per-operand, [`tregfile4k.md`](tregfile4k.md) §7)
+
+Each operand (`A`, `B`, `C`) carries an optional `is_transpose` bit in the issue packet. VEC-4K-v2 **forwards** this bit to the TRegFile read port that fetches the operand, reusing [`tregfile4k.md`](tregfile4k.md) §7.
+
+| Operand | Read port (TRegFile) | Stored in | `is_transpose` realized by |
+|---------|----------------------|-----------|----------------------------|
+| **A**   | Port A (R0)          | `SA` | TRegFile-4K col-mode read on R0 ([`tregfile4k.md`](tregfile4k.md) §7) |
+| **B**   | Port B (R4)          | `SB` | TRegFile-4K col-mode read on R4 |
+| **C**   | **`c_role = VALUE`** (e.g. `TFMA`): Port C (**R1**, dedicated 3rd VEC-side read port — §3.1, [`tregfile4k.md`](tregfile4k.md) §3 reserves all 8 read ports). **`c_role = MASK`**: Port A **or** Port B (whichever is free during the value-tile epoch — the small mask payload of 1–2 strips piggybacks on idle cycles). | `SC` | TRegFile-4K col-mode read on whichever port fetches C |
+
+**Mechanics.** Each read-port epoch takes 8 cycles and delivers one 4 KB tile. When `is_transpose = 0`, the 8 strips are row-mode (strip `s` = chunk-grid row `s` = bytes `s·512..s·512+511` of the source tile). When `is_transpose = 1`, the 8 strips are col-mode (strip `s` = chunk-grid column `s` = the 8 chunks `{(0,s), (1,s), …, (7,s)}` concatenated, i.e. the 8 × 8 chunk-grid transpose of the source tile; see [`tregfile4k.md`](tregfile4k.md) §7.4–§7.5). Strips are written into `SA` / `SB` / `SC` in **arrival order**: strip `s` goes to staging bytes `s·512 … s·512+511` regardless of mode, so the staging register always contains exactly what TRegFile delivered — no transpose is performed inside VEC.
+
+**Consequences for scheduling.**
+
+1. **`tregfile4k.md` §6 rule R2 applies.** The two physical read ports active in any 8-cycle TRegFile epoch must use the **same** `is_transpose` value; mixing `0` and `1` within one epoch is illegal. For 2-operand instructions this is the norm when both operands share the mode. For 3-operand instructions (and 2-op instructions with mismatched modes) microcode must split the fetch into two epochs — see §6.2 / §6.3 for the cycle accounting (`xpose_mismatch` term).
+2. **Transpose granularity = chunk-grid (64 B).** The col-mode delivery from [`tregfile4k.md`](tregfile4k.md) §7.5 is an 8 × 8 chunk-grid transpose. For shapes where a logical row fits in one chunk (`W = C·E ≤ 64 B`) or is exactly one strip (`W = 512 B`), the delivery is element-level valid and stage (A) forwards it unchanged. For the other `W ∈ {128, 256, 1024, 2048, 4096}` regimes, stage (A) runs a byte-segment recombine (§5.2) that reuses the existing [`vector4k.md`](vector4k.md) §4.1 (A) align / unpack / permute block. This is exactly the downstream responsibility [`tregfile4k.md`](tregfile4k.md) §7.5 already places on col-mode consumers.
+3. **Per-operand, not per-beat.** The TRegFile-side `is_transpose` is chosen once per operand fetch (one bit in `SOP` per operand slot). For per-beat flexibility, use the staging-side `tilelet_xpose` (§3.3b) instead — it is the primary mechanism and costs nothing at the TRegFile epoch level.
+
+#### 3.3b Staging read-port `tilelet_xpose` (per-beat, primary mechanism)
+
+Independently of the TRegFile-side bit, each compute beat's microcode (§5.4) carries a `tilelet_xpose` bit **per operand slot**. When set, the staging register (`SA` / `SB` / `SC`) delivers the **chunk-grid transpose** of the nominated tilelet (strip index `s ∈ 0..7`) for that beat. The delivery is produced by the staging register's diagonal-skew read datapath (§4.2.1), which is a local instantiation of the [`tregfile4k.md`](tregfile4k.md) §7.4 read datapath: a bank-select mux + 64 B output rotator driven by `{tilelet_xpose, s}`.
+
+| `tilelet_xpose` | Tilelet delivered | Output byte `[l·64 … l·64+63]` holds |
+|-----------------|--------------------|---------------------------------------|
+| 0 (row-mode)    | chunk-grid row `s` of the staged tile | logical chunk `(s, l)` of the cached tile |
+| 1 (col-mode)    | chunk-grid column `s` of the staged tile | logical chunk `(l, s)` of the cached tile, i.e. the 8 × 8 chunk-grid transpose |
+
+Both modes run at **512 B/cy** with no extra cycles. The bit can flip every beat and is independent across the three operand slots (a given beat can e.g. read `SA` row-mode and `SB` col-mode in the same beat). Element-level fixup for the non-aligned `W` regimes is still the responsibility of stage (A) (§5.3 / [`tregfile4k.md`](tregfile4k.md) §7.5); the staging boundary only guarantees chunk-grid correctness.
+
+Because `tilelet_xpose` is realised **inside** the VEC staging register (no TRegFile round-trip), it is unaffected by [`tregfile4k.md`](tregfile4k.md) §6 rule R2 and does not interact with operand-fetch scheduling.
+
+#### 3.3c Operand `C` — dual role: per-element bitmask **or** value tile
+
+Operand `C` carries an issue-time **`c_role`** selector that picks between two payload interpretations of the same `SC` staging register. The selector is one bit in the issue packet; both modes share the same diagonal-skew flip-flop layout (§4.2.2), the same write rotator, and the same fill rate (512 B / cy through the bound read port). Only the *read-side* differs:
+
+| `c_role` | SC payload | SC read port | Used by |
+|----------|------------|--------------|---------|
+| **`MASK`** | Packed per-element bitmask (1 bit per logical element of the ALU's operating format) | **128 × 1 b lane-predicate** output → stage (B) ALU mask input | Masked elementwise (`TADD_M`, `TMUL_M`, …), masked reductions, `TSELECT` (§7.4), masked sort, masked gather |
+| **`VALUE`** | A third **full value tile** with the same row-major `(r, c)` layout as `A` / `B` | **512 B / cy value-tilelet** output → stage (B) ALU `MUX_Z` input (§5.8) | Native 3-source ops: `TFMA D = A·B + C` (§7.6), `TFNMA`, `TLERP D = A·(1-C) + B·C`, future ternary kernels |
+
+The `SC` staging register is **always 4 KB** (matching `SA` / `SB`, §4.2) regardless of mode; this avoids a second physical layout. Unused storage is don't-care — for `c_role = MASK` only the low `R · C / 8` bytes are written, and for FP32-tile masks this is 128 B (well within a single strip).
+
+##### 3.3c.1 Mask mode (`c_role = MASK`) — unchanged from v0.16
+
+When `c_role = MASK`, layout is row-major over the `R × C` element grid of the associated value tile:
+
+```
+  mask_bit(r, c) = SC[ (r · C + c) / 8 ].bit[ (r · C + c) % 8 ]
+```
+
+Total mask size is `R · C / 8` bytes. For a 4 KB FP32 tile this is 1024 elements → 128 B (well within a single strip); for FP16 it is 256 B; for FP4 it is 1024 B (2 strips).
+
+**Fetch policy.** The issue packet carries a `has_mask` bit. When `has_mask = 0` the mask tile is **not read** — no TRegFile epoch is consumed for `C`, and the stage (B) mask input is tied to `IMM_ALL_ONES` statically by the microcode assembler. When `has_mask = 1`, microcode fetches `C` from TRegFile, short-circuiting to cover only the strips actually needed (`⌈R · C / (8 · 512)⌉`), trading one full 8-cycle epoch for as few as 1–2 strip cycles that piggyback on a value-tile epoch (§6.2, §6.3). The mask read uses the existing 2-port (R0 / R4) binding — the new R1 port is **not** used for masks, since the payload is too small to amortise an extra port.
+
+##### 3.3c.2 Value mode (`c_role = VALUE`) — v2.1 增量, enables native 3-source TFMA
+
+When `c_role = VALUE`, `SC` is read as a 512 B / cy value-tilelet (same datapath as `SA` / `SB`, §4.2.1). The microcode beat word's `src_C` field nominates `{strip_index s, tilelet_xpose xp}` per beat just like `src_A` / `src_B` (§5.4); the staged value enters the stage (B) ALU through the per-lane `MUX_Z` (§5.8), where it serves as the FMA addend (`A·B + Z` with `Z = SC[s]`).
+
+**Fetch policy.** The issue packet's `c_role = VALUE` triggers a **3-port parallel fetch** in one epoch:
+
+- **R0** ← tile `A` (Port A, phase 0)
+- **R4** ← tile `B` (Port B, phase 4)
+- **R1** ← tile `C` (Port C, phase 1; v2.1 增量 — see §3.1)
+
+All three reads share the same 8-cycle epoch cadence and (per [`tregfile4k.md`](tregfile4k.md) §6 R2) the **same** `is_transpose` value. An instruction with mismatched `is_transpose_*` across the three operands splits the fetch into two epochs (16 cy); the common-case kernels for `TFMA` (LayerNorm `γ · x̂ + β`, Welford updates, etc.) all use uniform transpose, so the **typical fetch cost is 8 cy** — the same as a binary op (§6.2).
+
+**Mask compatibility.** A 3-source value-FMA can still be **conditionally executed** by setting both `c_role = VALUE` and a separate `pred_src` field that points the lane-predicate input at one of `IMM_ALL_ONES` (default), `ACC_FLAG_*`, or a static immediate produced by the row-range predicate synthesiser (§7.5.2). The `SC` register cannot simultaneously carry both a mask and a value; for the rare case of a masked-3-source-FMA the kernel splits into two instructions (compute unconditional `TFMA`, then `TSELECT` with the mask).
+
+##### 3.3c.3 Transpose for both modes
+
+The `c_role = MASK` mask, conceptually, has the same logical `(r, c)` layout as the value tile; `is_transpose_C` and `tilelet_xpose` of the `SC` read port transpose the mask the same way they transpose a value tile, so masked-transpose ops stay consistent without extra handling. For `c_role = VALUE`, transpose semantics are identical to `SA` / `SB` (§3.3a, §3.3b) — the value tile is just a third operand, indistinguishable from a value-tile read on `SA` / `SB` from the staging-register perspective.
+
+### 3.4 Read semantics (unchanged)
+
+Read ports present **only** full 512 B bank-group strips. **No RF gather** ([`vector4k.md`](vector4k.md) §3.1, [`tregfile4k.md`](tregfile4k.md) §3). Column mux / sub-strip element extraction still happens **in VEC** after a strip lands in its staging buffer (§4, §5).
+
+### 3.5 On-chip buffers
+
+- **Staging registers `SA`, `SB`** — value-tile buffers: **diagonal-skewed flip-flop arrays**, 64 × 64 B sub-banks = 4096 B each plus metadata (§4.2, §4.2.1). Write port accepts a full 512 B strip per cycle through a 3-bit 8-way 64 B rotator; read port delivers 512 B/cy (128-lane data) in row-mode or col-mode selected by the microcode's `tilelet_xpose` bit (§3.3b).
+- **Staging register `SC`** — **mask buffer**: same 4 KB diagonal-skew physical layout as `SA` / `SB` (§4.2.2), but its primary read port is a **128 × 1 b** lane-predicate output fed to the stage (B) ALU mask input (§5.8). Only the low `R · C / 8` bytes are written (the fetch typically takes 1–2 strip cycles, not a full 8-cycle epoch); unused regions are don't-care. `SC` is **not fetched at all** when `has_mask = 0`, in which case the mask input is tied to `IMM_ALL_ONES`.
+- **Acc** — unchanged from [`vector4k.md`](vector4k.md) §4.1: **256 × 32 b × 2** ping-pong (**`N_run = 512`**).
+- **Scalar broadcast / staging** — **§4.3**.
+
+---
+
+## 4. Staging Register File
+
+VEC-4K-v2 introduces an explicit **staging register file** between the read ports and the 512 B compute datapath. The goals are:
+
+- **Decouple** the TRegFile epoch cadence from the compute pipeline, so that the compute stage can start as soon as each operand lands (possibly out-of-order).
+- Hold the **full 4 KB tile** for each operand while the reduction tree and `#W` waves of `TCOL*` iterate.
+- Carry the **32-bit metadata** and the **`is_transpose`** bit alongside the payload.
+
+### 4.1 Layout
+
+```text
+  ┌──────────────────────────────────────────────────────────────────────────┐
+  │                        Read ports  (R0=A, R4=B)                           │
+  │               512 B/cy                      512 B/cy                      │
+  └────────────┬────────────────────────────────┬────────────────────────────┘
+               │                                │
+               ▼                                ▼
+       ┌────────────────┐                ┌────────────────┐
+       │  SA staging    │                │  SB staging    │
+       │  • 4096 B data │                │  • 4096 B data │
+       │  • 32 b meta   │                │  • 32 b meta   │
+       │  • 1 b xpose   │                │  • 1 b xpose   │
+       │  • valid / vld │                │  • valid / vld │
+       └─────┬──────────┘                └─────┬──────────┘
+             │                                  │
+             │       ┌────────────────┐         │
+             └──────▶│  SC staging    │◀────────┘   (optional 3rd operand;
+                     │  • 4096 B data │             filled by whichever read
+                     │  • 32 b meta   │             port is free during the
+                     │  • 1 b xpose   │             second operand-fetch epoch;
+                     │  • valid / vld │             §6.3)
+                     └───────┬────────┘
+                             │
+                             ▼
+               ┌────────────────────────────────────────────┐
+               │  Scalar staging:  SX, SY                   │
+               │    • 64 b scalar value                     │
+               │    • source tag: GPR / IMM / TILE / ACC    │
+               │    • filled at issue from scalar GPR RF    │
+               │      (GPR/IMM: 0 vector-side cycles)       │
+               │    • valid / broadcast mode                │
+               └────────────────────────────────────────────┘
+                             │
+                             ▼
+               ┌────────────────────────────────────────────┐
+               │  Opcode / control staging:  SOP            │
+               │    • opcode                                │
+               │    • decoded (E, R, C) triples for A/B/C   │
+               │    • per-port is_transpose latched         │
+               │    • ucode_{base,len} (§5.4 microcode)     │
+               │    • #W / K_outer / f parameters           │
+               │    • retire mask (D0 / D1, widths)         │
+               └────────────┬───────────────────────────────┘
+                            │
+                            ▼
+                compute datapath (microcode-driven, align, 128 groups, Acc, pack)
+                            │
+                            ▼
+                 ┌────────────────────────────┐
+                 │  Write ports  (W0=D0, W4=D1)│
+                 │  512 B/cy each             │
+                 └────────────────────────────┘
+```
+
+### 4.2 Operand staging registers `SA`, `SB`, `SC`
+
+`SA` and `SB` are **value-tile** buffers; `SC` is a **mask-tile** buffer (§3.3c). All three use the same 4 KB **diagonal-skewed flip-flop array** organisation (64 sub-banks × 64 B) for uniformity of the write datapath, but `SC` is read through a narrower **128 × 1 b lane-predicate** port (§4.2.2) rather than the 512 B value-tilelet port used for `SA` / `SB`. Each register holds **one full 4 KB tile** plus its metadata and per-operand control bits, with the **wrapped-diagonal skew** of [`tregfile4k.md`](tregfile4k.md) §2 / §7 allowing either row-mode or col-mode delivery (`tilelet_xpose ∈ {0, 1}`, §3.3b) at full port rate.
+
+| Property | Size / behaviour |
+|----------|------------------|
+| Data payload | **4096 B** (= 64 × 64 B flip-flop sub-banks) |
+| Storage layout | Logical chunk `(g, l)` at sub-bank `8·g + ((l + g) mod 8)` — same diagonal skew as [`tregfile4k.md`](tregfile4k.md) §2 |
+| Fill rate | **512 B / cy** via the attached read port, through one 8-way 64 B write rotator steered by the arriving strip's chunk-grid row `g(t)` |
+| Drain rate | **512 B / cy** to any ALU operand slot (§5.4), selected by a per-beat `{strip_index s, tilelet_xpose xp}` pair; bank-select mux + output rotator implement row/col-mode ([`tregfile4k.md`](tregfile4k.md) §7.4) |
+| Double-buffering | **optional** — implementation may add a shadow tile so that the next instruction’s operand fetch can overlap with this instruction’s compute phase |
+| Metadata | **32 b** latched on the **first** strip arrival |
+| Strip-valid bits | **8 b** (one per chunk-grid row `g`) — compute can start on a tilelet as soon as the chunk-grid row(s) it needs are written, **without** waiting for the full tile |
+
+The strip-valid bitmap is what makes the **variable-length operand-fetch phase** (§6) compatible with the microcode-driven beat schedule: the beat program (§5.4) nominates `{source, s, xp}` per operand slot, and the compute stage stalls for at most one cycle if the nominated tilelet needs sub-banks that are not yet marked valid.
+
+> **Why three staging registers?** Predicated / masked ops (`TSELECT`, masked `TADD` / `TMUL`, `TGATHER` with selection mask, `TROWEXPAND*` with enable mask) need a per-element enable signal in addition to up to two value tiles. `SC` carries that mask. Full-tile ops don't fetch `SC` at all; the mask input to stage (B) defaults to `IMM_ALL_ONES` (§5.8).
+
+#### 4.2.1 Diagonal-skew staging register (shared read datapath with [`tregfile4k.md`](tregfile4k.md) §7)
+
+Because each staging register absorbs only 512 B/cy on the write side (one TRegFile strip) and delivers only 512 B/cy on the read side, a full-generality byte crossbar is unnecessary. Instead, we apply the [`tregfile4k.md`](tregfile4k.md) §2 / §7 diagonal-skew trick locally inside VEC: store the tile diagonally, and select row-mode or col-mode at the **read** port via a small bank-select mux + 64 B rotator.
+
+**Storage layout.** The 4096 B flip-flop array is partitioned into **64 sub-banks of 64 B** each, arranged in 8 sub-bank groups `G0..G7` (8 sub-banks per group). Viewing the tile as an 8 × 8 chunk grid of 64 B chunks, logical chunk `(g, l)` is stored at
+
+```
+  bank_group = g                                   ← pure wiring
+  bank_local = (l + g) mod 8                       ← 3-bit rotator on write
+  bank_id    = 8·g + bank_local                    ← 0..63
+```
+
+This is the same wrapped-diagonal placement that makes [`tregfile4k.md`](tregfile4k.md) §7 col-mode bank-conflict-free.
+
+**Write datapath.** TRegFile read ports deliver 512 B strips in whatever mode the operand-fetch `is_transpose` bit selected (§3.3a). The strip arriving at cycle `t` carries chunk-grid row `g(t)` (row-mode delivery) or chunk-grid column `g(t)` (col-mode delivery) of the source tile. A fixed 3-bit 8-way 64 B rotator, steered by `g(t)`, places logical lane `l` (= input byte position `l·64..l·64+63`) at physical `bank_local = (l + g) mod 8`. The write rotator is the same size and steering rule as the [`tregfile4k.md`](tregfile4k.md) §3 / §7.3 write-side rotator. For the purposes of staging, the *semantics* of chunks stored at `(g, l)` follow the mode TRegFile used: after row-mode fetch the staged "chunk `(g, l)`" is logical tile chunk `(g, l)`; after col-mode fetch it is logical tile chunk `(l, g)` (pre-transposed).
+
+**Read datapath.** For each compute beat, the microcode nominates `{strip_index s, tilelet_xpose xp}` per operand slot. The staging register produces one 512 B tilelet per slot:
+
+```
+       ┌───────── xp = 0 (row-mode) ────────────────┐
+       │ fetch 8 sub-banks of G_s, locals 0..7      │
+       │ output rotator: rotate-left by s            │
+       │ output lane l carries staged chunk (s, l)   │
+       └────────────────────────────────────────────┘
+       ┌───────── xp = 1 (col-mode) ────────────────┐
+       │ fetch 1 sub-bank per group along diagonal:  │
+       │   bank_i = 8·i + (s + i) mod 8, i = 0..7    │
+       │ output rotator: identity                    │
+       │ output lane i carries staged chunk (i, s)   │
+       └────────────────────────────────────────────┘
+```
+
+Both modes fill the 512 B output bus every cycle. Read-side hardware = one bank-select mux (2-input per sub-bank) + one 64 B output rotator (active in row-mode only). `xp` may flip between beats, and may differ across the three operand slots in the same beat (SA col-mode + SB row-mode is legal).
+
+**Element-level fixup (delegated to stage (A)).** Chunk-grid transpose is at 64 B granularity. For tile row widths `W = C·E`:
+
+| `W` regime | Col-mode tilelet content | Element-level fixup (stage (A), §5.3) |
+|------------|---------------------------|----------------------------------------|
+| `W ≤ 64 B` (row fits in 1 chunk) | `512/W` whole rows per tilelet, 8 groups of `64/W` consecutive rows, stride `R/8` | None — stage (A) identity |
+| `W = 512 B` (row = 1 strip) | 8 × 64 B slices from 8 distinct rows (one 16-FP32 col-band per row) | None when the ALU is per-lane element-wise; §5.6.2 walks through the accumulator pattern for reductions |
+| `W ∈ {128, 256, 1024, 2048, 4096}` | 8 × 64 B row-segments from 8 distinct rows | Stage (A) byte-segment recombine (reuses [`vector4k.md`](vector4k.md) §4.1 (A) align / unpack / permute) |
+
+**Cost (reference implementation, per staging register):**
+
+| Block | Size |
+|-------|------|
+| Sub-bank storage | 64 × 64 B flip-flops = 4096 B |
+| Write rotator | 1 × 8-way 64 B 3-bit rotator |
+| Read bank-select mux | 2-input mux per sub-bank (row-mode path vs col-mode path), driven by `{xp, s}` |
+| Read output rotator | 1 × 8-way 64 B 3-bit rotator (active in row-mode; bypassed in col-mode) |
+| Strip-valid bitmap | 8 b per register |
+
+**Per-cycle read / write conflict.** Since the 4096 B store is flip-flops (not 1R1W SRAM), a sub-bank can be written and read in the same cycle if the implementation exposes independent write and read ports per sub-bank. Early-start compute (§6) uses the strip-valid bitmap to order things; the write rotator targets the group `G_{g(t)}` currently arriving while the read side targets already-valid sub-banks.
+
+#### 4.2.2 `SC` read path — lane-predicate expansion
+
+`SC` shares the §4.2.1 write datapath with `SA` / `SB` (diagonal-skew, 64 × 64 B sub-banks, full row-mode / col-mode reads available), but its **primary output** is a **128-lane, 1-bit-per-lane predicate bus** that feeds the stage (B) ALU mask input (§5.8). A per-beat microcode field `mask_strip_index ∈ 0..7` selects which 64-bit chunk of packed mask bits is delivered; a small unpacker fans those 512 bits out to the 128 lane predicates according to the active format:
+
+| Format | Elements per 128-lane group | Mask bits needed per beat | SC bytes read per beat |
+|--------|------------------------------|----------------------------|--------------------------|
+| FP32 / INT32 | 128 | 128 | 16 B |
+| FP16 / BF16  | 256 (2 sub-lanes per 32-b group) | 256 | 32 B |
+| FP8          | 512 (4 sub-lanes) | 512 | 64 B |
+| FP4          | 1024 (8 sub-lanes, serialized as 2 sub-beats) | 1024 | 128 B |
+
+For all formats the mask fits **in well under one 512 B tilelet**, so the SC read port typically consumes only a 16 B–128 B slice per beat; the bank-select mux of §4.2.1 is configured to emit the right slice, and the output rotator is bypassed (mask bits don't need a chunk-grid rotate — the unpacker handles lane ordering directly).
+
+**Fetch short-circuit.** Since the total mask payload for a 4 KB tile at any format is at most 1024 B (2 strips), microcode fetches `SC` in **1 or 2 strip cycles** rather than a full 8-cycle TRegFile epoch when `has_mask = 1`. When `has_mask = 0`, `SC` is not accessed at all — no read port is allocated, no storage is written, and the stage (B) mask input is steered to `IMM_ALL_ONES` (§5.8) for every beat of the compute phase.
+
+**Same-tile transpose behavior.** `is_transpose_C` and `tilelet_xpose` on an `SC` read reinterpret the packed-bitmap layout consistently with the value-tile transpose (§3.3a, §3.3b): a mask that's row-mode for its `A` partner stays row-mode; a mask paired with a col-mode `A` is fetched / read col-mode so element `(r, c)` lines up with the corresponding value lane.
+
+### 4.3 Scalar staging `SX`, `SY`
+
+Two **scalar** staging slots carry per-instruction scalar parameters:
+
+| Slot | Width | Typical content |
+|------|-------|-----------------|
+| **`SX`** | 64 b (value) + 8 b (tag) | Primary scalar: immediate (`TADDS`), threshold (`TCMPS`), scale (`TMULS`, `TDEQUANT`), base (`TCI`) |
+| **`SY`** | 64 b (value) + 8 b (tag) | Secondary scalar: stride (`TCI`), offset (`TCOLEXPAND*`), second compare operand, or FP packing metadata (`TCVT`) |
+
+The **tag** distinguishes four source kinds; the first two are the common case:
+
+- **Scalar GPR** (the PTO ISA encodes `sx_gpr`, `sy_gpr` — 5-bit register indices into the scalar general-purpose register file). At issue time the scalar GPR register file performs one 64-bit read for each used scalar slot and forwards the value directly into `SX` / `SY`. This is the normal way a PTO tile instruction takes non-immediate scalars — e.g. `TADDS Tdst, Tsrc, r5` (broadcast-add `r5`), `TMULS Tdst, Tsrc, r3` (scalar multiply by `r3`), `TDEQUANT Tdst, Tsrc, r1, r2` (scale `r1`, zero-point `r2`). The GPR file sits on the scalar-issue side of the core and is shared with the ordinary scalar pipeline; VEC-4K-v2 only sees the 64 b value that arrives with the instruction packet.
+- **Immediate** (value is a literal field of the instruction word; common for small constants, compare thresholds, shift amounts).
+- **Tile-element** (value is element `[r,c]` of some tile register, fetched as a **single-element read** — either by reading one strip and mux'ing, or by a dedicated "scalar tile" marked with the `scalar_tile` flag in metadata §2.2).
+- **ACC slot** (value is slot `ρ` of the Acc ping-pong, e.g. after an in-place reduce; enables `TROWEXPAND*` with on-chip `v[r]`).
+
+The instruction word reserves one 2-bit `sx_src ∈ {GPR, IMM, TILE, ACC}` selector and one 2-bit `sy_src` selector (plus the associated 5-bit GPR index / 16-bit immediate / tile-element pointer / ACC slot id payload). The selected value is latched into the corresponding `(value, tag)` slot **at instruction issue**, before the operand-fetch prologue (§6) begins — scalar GPR reads and immediate captures happen on the issue cycle itself and complete in 0 VEC-pipeline cycles, so **SX / SY are always valid by the time the first microcode beat fires**. Tile-element and ACC-slot sources that require a vector-side read instead piggyback on an idle read-port slot during the operand-fetch prologue (see §6.3 rule 4).
+
+Scalars are **broadcast** across all 128 groups of stage (B) (§5.2) when referenced as an ALU operand (microcode `src_A` / `src_B` / `src_Z ∈ {SX, SY}`); they do **not** consume read-port bandwidth beyond the one scalar-GPR read at issue (or zero cycles for immediate). A beat that does not name `SX` / `SY` as a source leaves them untouched — the scalars persist for the duration of the compute phase.
+
+### 4.4 Opcode / control staging `SOP`
+
+`SOP` latches the **decoded instruction word** plus the **per-operand metadata union** at operand-fetch entry:
+
+| Field | Width (approx) | Purpose |
+|-------|----------------|---------|
+| `opcode` | 8–10 b | PTO opcode id |
+| `format_A`, `R_A`, `C_A` | 4 + 14 + 14 | Copied from `SA.meta` at first strip of A |
+| `format_B`, `R_B`, `C_B` | 4 + 14 + 14 | Copied from `SB.meta` |
+| `format_C`, `R_C`, `C_C` | 4 + 14 + 14 | Copied from `SC.meta` when `has_mask = 1`; otherwise unused. `format_C` is the *value-tile* format that the mask addresses (the mask itself is always 1 bit/element); `R_C = R_A`, `C_C = C_A` is the normal case |
+| `has_mask` | 1 b | `0` = full-tile op (skip `SC` fetch, ALU mask input → `IMM_ALL_ONES`, §5.8); `1` = fetch `SC` (§3.3c, §4.2.2) and feed its lane-predicate bus to the ALU mask input |
+| `is_xpose_A`, `is_xpose_B`, `is_xpose_C` | 1 + 1 + 1 | Latched from issue packet — **tregfile-side** transpose (§3.3a); the per-beat `tilelet_xpose_*` lives in the microcode word below. `is_xpose_C` is ignored when `has_mask = 0` |
+| `retire_mask` | 2 b | Bit0 = `D0` written, bit1 = `D1` written |
+| `retire_format_0`, `retire_format_1` | 4 + 4 | Format of each result tile (may differ from operand formats, e.g. `TCVT`) |
+| `ucode_base`, `ucode_len` | 8 + 6 | Start address and length (in beats) of the microcode program driving the compute phase (§5.4); the microcode emits one **beat word** per compute cycle |
+| `#W`, `K_outer`, `f` | 6 + 6 + 6 | `TCOL*` / `TROW*` wave parameters ([`vector4k.md`](vector4k.md) §5.3.2); resolved by the selected microcode program |
+| `N_tree_en`, `N_acc_en` | 7 + 10 | Tree / Acc enable masks (bypass unused groups). **In the recommended baseline** (§5.8.1) the cross-group reducer network is physically absent, so `N_tree_en` is tied to `0` at synthesis time and the bits are reserved for an optional implementation that keeps the network. `N_acc_en` (Acc sub-bank enables) is always present and always driven live |
+
+All fields are **stable** for the entire compute phase (operand fetch may update valid bits but must not change the latched metadata).
+
+---
+
+## 5. Vector Datapath Overview
+
+### 5.1 High-level flow — no crossbar, microcode-driven tilelet dispatch
+
+```text
+  ┌────────────────────────────────────────────────────────────────────┐
+  │     TRegFile-4K (R0=A, R4=B; W0=D0, W4=D1)                          │
+  │  R0 (512B/cy, is_xpose_A)   R4 (512B/cy, is_xpose_B)               │
+  └──────────┬──────────────────────────────┬───────────────────────────┘
+             │                              │
+             ▼                              ▼
+        ┌──────────┐                   ┌──────────┐                  ┌──────────┐
+        │ SA       │                   │ SB       │                  │ SC       │
+        │ 4 KB FF  │                   │ 4 KB FF  │                  │ 4 KB FF  │
+        │ diagonal │                   │ diagonal │                  │ diagonal │
+        │ skew     │                   │ skew     │                  │ skew     │
+        │ r/c read │                   │ r/c read │                  │ r/c read │
+        └──────┬───┘                   └───┬──────┘                  └────┬─────┘
+               │ 512B/cy per beat          │ 512B/cy per beat             │ 512B/cy per beat
+               │  (s, xp) from μop         │  (s, xp) from μop            │  (s, xp) from μop
+               ▼                           ▼                              ▼
+       ┌─────────────────────────────────────────────────────────────────────────┐
+       │           3 × 512 B OPERAND BUSES   (direct — no crossbar)              │
+       │           +  SX / SY scalar broadcast (single 32/64 b, splayed to 128)   │
+       └───────────────────────────────────────────────────────────┬──────────────┘
+                                                                   │
+                                                                   ▼
+         ┌───────────────────────────────────────────────────────────┐
+         │  (A) ALIGN / UNPACK / PERMUTE   (per operand slot)         │
+         │    • format-aware unpack (fp32 / fp16 / bf16 / fp8 / fp4)  │
+         │    • col-mode byte-segment recombine for non-aligned W     │
+         │    • scalar broadcast splice                               │
+         │    OUT: up to 3 × 128-slice operand streams                │
+         └───────────────────────────┬───────────────────────────────┘
+                                     ▼
+         ┌───────────────────────────────────────────────────────────┐
+         │  (B) 128 COMPUTE GROUPS  (shared with accumulator, §5.8)   │
+         │    • per-slot input MUX: {operand_pipe, ACC_LO, ACC_HI,    │
+         │                           SX, SY}  ← src_* from μop        │
+         │    • ALU  (W_ALU,i bits) — 2 or 3 inputs                   │
+         │    • reduce tree (W_tree,i bits; bypass on elementwise)    │
+         └───────────────────────────┬───────────────────────────────┘
+                                     ▼                            ▲
+         ┌───────────────────────────────────────────────────────────┐
+         │  ACC  (256 × 32 b × 2 ping-pong, split into LO/HI 128-slot │
+         │        sub-banks, each 1R1W; no dedicated RMW adder)       │
+         │   write ← μop acc_op (INIT/ACCUM/MERGE_STAGE/READOUT/NONE) │
+         │   read  → feeds back into ALU input MUX above (two channels│
+         │           LO + HI can be read concurrently)                │
+         └───────────────────────────┬───────────────────────────────┘
+                                     ▼
+         ┌───────────────────────────────────────────────────────────┐
+         │  Pack  (retire_format_0, retire_format_1)                 │
+         └───────────┬───────────────────────────┬───────────────────┘
+                     ▼                           ▼
+              ┌──────────────┐            ┌──────────────┐
+              │ D0 → Wr0     │            │ D1 → Wr4     │
+              │ 512 B/cy     │            │ 512 B/cy     │
+              └──────────────┘            └──────────────┘
+                (retire_mask[0])            (retire_mask[1])
+```
+
+### 5.2 Why no crossbar
+
+In VEC-4K v1 ([`vector4k.md`](vector4k.md) §4.1) and in VEC-4K-v2 drafts up to v0.4, a crossbar between the staging / port outputs and stage (A) served two needs: (i) swap which read port feeds which ALU operand slot, and (ii) route a col-mode fixup input alongside the row-mode stream. In v0.5 both needs dissolve:
+
+1. **Operand-slot assignment** is now an explicit microcode field per beat (`src_{A,B,Z} ∈ {SA, SB, SX, SY, ACC_READ_LO, ACC_READ_HI, IMM_ZERO}`, and `mask_src ∈ {SC_mask, IMM_ALL_ONES, IMM_FROM_SOP}`, §5.4). Each value staging register (SA, SB) directly drives one of the two 512 B value-operand buses entering stage (A); SC drives the 1-bit-per-lane mask bus independently. The microcode pre-decided which register feeds which slot for each beat, so no arbitration is needed.
+2. **Col-mode view selection** is now done **inside** each staging register (per-beat `tilelet_xpose`, §3.3b, §4.2.1). Stage (A) receives a fully-resolved 512 B tilelet in the requested mode and only has to run the element-level fixup (when `W ∈ {128, 256, 1024, 2048, 4096}`).
+
+The datapath is therefore a **straight 2-lane value pipe** from `{SA, SB}` to stage (A) plus a **1-bit-per-lane mask sideband** from `SC`, with a scalar sideband (`SX`, `SY`) broadcast to 128 lanes on demand.
+
+### 5.3 Stage (A) — format-aware unpack and element-level fixup
+
+Stage (A) has two independent per-value-slot pipes (one each for the A, B inputs to the ALU) and a parallel 128-lane mask unpack lane. Each value pipe receives a 512 B tilelet per beat and performs:
+
+1. **Format unpack** for `E ∈ {4, 2, 1, 0.5}`, widening to a 128-slot FP32-shaped operand stream. For `fp4`, a 512 B beat (**1024 logical elements**) is serialised into two back-to-back 128-slice sub-beats; microcode compensates by issuing two beats of the same `{src, s, xp}` pair with an fp4 sub-beat counter.
+2. **Col-mode element-level fixup** when the tilelet arrived with `tilelet_xpose = 1` and the operand's `W = C·E` is in the non-aligned regime `{128, 256, 1024, 2048, 4096}`. This is the same shape-parameterized byte-segment recombine [`tregfile4k.md`](tregfile4k.md) §7.5 expects of any col-mode consumer, reusing the [`vector4k.md`](vector4k.md) §4.1 (A) align / unpack / permute block.
+3. **Scalar splice.** When the beat names `SX` or `SY` as an operand slot source, stage (A) broadcasts the scalar value (after tag-driven extraction, §4.3) across all 128 lanes of that slot.
+
+For the two aligned regimes (`W ≤ 64 B` or `W = 512 B`) stage (A) is identity for both row-mode and col-mode tilelets — no fixup is required.
+
+### 5.4 Microcode beat format
+
+`SOP.ucode_{base,len}` names a program of `ucode_len ≤ 64` **beat words**, one per compute cycle. Each beat word is ~64 bits wide and specifies:
+
+| Field | Width | Description |
+|-------|------:|-------------|
+| `src_A`, `src_B`, `src_Z` | 3 × 3 b | **Value-input** source per ALU slot — drives the per-slot input MUX at stage (B) (§5.8): `{SA, SB, SX_broadcast, SY_broadcast, ACC_READ_LO, ACC_READ_HI, IMM_ZERO, —}`. `A` / `B` are the two primary inputs; `Z` is the optional 3rd value input (FMA addend) — typically set to `ACC_READ_*` for FMA-accumulate or to `—` for non-FMA beats. **Note:** `SC` is *not* a value source — it only feeds the mask input via `mask_src` below (§3.3c). `—` = slot unused. |
+| `s_A`, `s_B`              | 2 × 3 b | Strip / tilelet index `∈ 0..7` for each value slot that sources from a staging register (`SA` / `SB`). The `Z` slot, when it sources `ACC_READ_*`, uses `acc_slot` below instead |
+| `xp_A`, `xp_B`            | 2 × 1 b | Per-slot `tilelet_xpose` (§3.3b) — `1 = col-mode`, `0 = row-mode`. Ignored for `SX` / `SY` / `ACC_readback` / `IMM_ZERO` sources |
+| `mask_src`                | 2 b     | **Mask-input** source: `{SC_mask, IMM_ALL_ONES, IMM_FROM_SOP, —}`. Defaults to `IMM_ALL_ONES` when `SOP.has_mask = 0` (the microcode assembler sets this statically). `SC_mask` reads through the `SC` lane-predicate port (§4.2.2) |
+| `mask_strip`              | 3 b     | Strip (or sub-strip) index of `SC` to present on the mask bus this beat; ignored unless `mask_src = SC_mask` |
+| `alu_op`                  | 5 b     | ALU operation at this beat (`ADD`, `SUB`, `MUL`, `FMA`, `FNMA` *= –A·B+Z*, `MAX`, `MIN`, `CMP`, `AND`, `XOR`, `PASS_A`, `PASS_B`, `SELECT`, **`RECIP`**, **`RSQRT`**, **`SHUFFLE_CAS_UP`**, **`SHUFFLE_CAS_DOWN`**, …). Overrides `SOP.opcode` per beat so that reduction loops can mix `PASS` (load) and `ADD` (accumulate) beats. `SELECT` uses the mask bus directly as its predicate. `RECIP` / `RSQRT` return the ~11-bit approximate reciprocal / reciprocal-square-root of `src_A` lane-wise, intended as a seed for 2–3 FMA-based Newton–Raphson refinement beats (used by `TINV`, §7.5.1). The reciprocal unit is a small per-group LUT + mantissa-shift block sharing the stage-(B) FMA's mantissa path; adding it costs ~50 K gate system-wide. `SHUFFLE_CAS_{UP,DOWN}` drive the 256-lane shuffle + compare-swap primitive (§7.5.3): inputs `(src_A, src_B)` are two 128-lane strips forming a 256-lane vector, an extra 3-bit **`shuffle_stride`** field (`2^0..2^7` = `{1, 2, 4, 8, 16, 32, 64, 128}`, reusing `xp_A` + `s_A` low bits of the beat word when `alu_op = SHUFFLE_CAS_*`) selects the log-stride butterfly permutation, and the 128 per-pair comparators emit `(min, max)` for `UP` or `(max, min)` for `DOWN`. Output is two 128-lane strips written back via `wr_en_D0 / D1` or to staging (§7.5.3). |
+| `acc_op`                  | 3 b     | Accumulator behaviour: `{NONE, INIT, ACCUM, MERGE_STAGE, READOUT}` — see §5.5 |
+| `acc_slot`                | 4 b     | Dual-use field (§5.5.1). **When `acc_op ∈ {ACCUM, INIT, READOUT}`**: low 2 b select LO / HI × ping-pong plane 0 / 1 (v0.9 semantics); upper 2 b unused / tied 0. **When `acc_op = MERGE_STAGE`**: bits `[2:0]` = `merge_bit ∈ 0..7` (the fold stride = `2^merge_bit`); bit `[3]` = `merge_base_parity` for partial-plane / per-row folds (§5.7.2, §5.7.3). The assembler forces `[2:0] = 7` for non-merge beats so the read MUX degenerates to the identity `LO = Acc[0..127], HI = Acc[128..255]` view |
+| `wr_en_D0`, `wr_en_D1`    | 2 × 1 b | If set, the beat's pack / retire output is committed to `D0` / `D1` write port (targeting the next strip slot at the write port's FSM) |
+| `wr_strip_D0`, `wr_strip_D1` | 2 × 3 b | Strip index to present on each write port when the corresponding `wr_en_*` is set |
+
+The microcode program is a **read-only ROM lookup** keyed by opcode category + shape regime (`(opcode, W-regime, R-regime, format)`); microcode can be re-generated by a software assembler per workload without RTL change. `ucode_len` is typically 8 beats (plain element-wise on a single tile) through ~40 beats (narrow-tile col-reduce with accumulator merge).
+
+Each beat word traces a straight path from `{src_*, s_*, xp_*}` through the staging-register read datapath (§4.2.1) to stage (A) to stage (B) to Acc to (optional) pack → `D0` / `D1`. No control state outlives the beat itself; the only state carried across beats is the accumulator ping-pong plane and the per-operand strip-valid bitmap.
+
+### 5.5 Accumulator — banked dual-read, ALU-mediated RMW
+
+The accumulator retains the 256 × 32 b × 2 ping-pong capacity of [`vector4k.md`](vector4k.md) §4.1 but reorganises its read / write ports so that the stage (B) ALU (§5.8) is the **only** arithmetic unit in the datapath — there is no separate accumulator adder.
+
+**Physical layout.** Each of the two ping-pong planes is split into two 128-slot sub-banks:
+
+```
+  Plane P (= 0 or 1):
+    sub-bank P.LO:  acc_slot  0 .. 127     ← 1R1W port (read_LO, write_LO)
+    sub-bank P.HI:  acc_slot 128 .. 255    ← 1R1W port (read_HI, write_HI)
+```
+
+The split is by slot index; each sub-bank has one read port and one write port. Two reads from different sub-banks are concurrent; two reads inside the same sub-bank still serialise, but microcode never schedules that case (§5.6.3 `MERGE_STAGE` always pairs a low-range slot with a high-range slot, and `ACCUM` only touches one sub-bank per beat).
+
+**Feedback to ALU input.** Each of the stage (B) ALU operand-input MUXes accepts `ACC_READ_LO` or `ACC_READ_HI` as a source (§5.4, §5.8); so `acc[k]` is just another operand the ALU can consume. The ALU output flows back to a write port of the same or the other sub-bank on the same beat — that is the "RMW" path.
+
+**`acc_op` field — routing only, no arithmetic.** The microcode `acc_op` field is now a pure routing selector for the Acc write port (the arithmetic lives in `src_*` + `alu_op`):
+
+| `acc_op` | Write-port behaviour (read / ALU routing is in `src_*`) |
+|----------|---------------------------------------------------------|
+| `NONE`        | Acc write ports idle; ALU output goes to Pack (bypass). Used for element-wise ops. |
+| `INIT`        | ALU output written to `acc[acc_slot]`, overwriting the old value. First beat of a reduction pass — microcode normally pairs with `alu_op = PASS_A` so the ALU acts as a load. |
+| `ACCUM`       | ALU output written to `acc[acc_slot]`. Pair with `src_B = ACC_READ_LO / HI` and `alu_op = ADD` (or `MAX`, …). For GEMM-style **FMA-accumulate**, use `src_A = SA, src_B = SB, src_Z = ACC_READ_*, alu_op = FMA`. Functionally identical to the previous "Acc RMW add" but performed by the stage (B) ALU via the feedback MUX. |
+| `MERGE_STAGE` | Same write port as `ACCUM` but `src_A = ACC_READ_LO, src_B = ACC_READ_HI, alu_op = ADD` (or `MAX` for `argmax`-style merges). Both ALU inputs come from the Acc sub-banks; the stage (A) operand pipe is idle this beat. Writes back to the low sub-bank by default. |
+| `READOUT`     | Read `acc[acc_slot]` through the ALU (`src_A = ACC_READ_{LO,HI}, alu_op = PASS_A`) and forward to Pack; Acc write ports idle. |
+
+The same reduction programs of §5.6 and §5.7 remain valid with exactly the pseudo-code already shown — the `acc_op` keywords now denote the write-port behaviour while the read-back + arithmetic flows through the stage (B) ALU automatically. The concrete microcode-assembler expansion of `ACCUM` is
+
+```
+# microcode macro: ACCUM_ADD(op_src, op_slot, op_xp, k)
+BEAT:
+    src_A   = op_src      # SA / SB / SC (operand pipe)
+    s_A     = op_slot
+    xp_A    = op_xp
+    src_B   = (k < 128 ? ACC_READ_LO : ACC_READ_HI)   # the one holding acc[k]
+    alu_op  = ADD
+    acc_op  = ACCUM
+    acc_slot = k
+```
+
+and a `MERGE_STAGE` beat that halves the partial-sum count in narrow-tile reductions (§5.6.3) is
+
+```
+# microcode macro: MERGE_STAGE_ADD(lo_base, hi_base, width)
+BEAT:
+    src_A     = ACC_READ_LO   # slot = lo_base
+    src_B     = ACC_READ_HI   # slot = hi_base
+    alu_op    = ADD
+    acc_op    = MERGE_STAGE
+    acc_slot  = lo_base       # write destination
+```
+
+Microcode can equally issue `MAX` / `MIN` / `OR` at `MERGE_STAGE` beats to fold partial argmax / argmin / bitmask reductions — again entirely through the shared stage (B) ALU.
+
+#### 5.5.1 Accumulator feedback MUX — `MERGE_STAGE` pair-selection network
+
+The `MERGE_STAGE` beat asks the ALU, for every lane `k ∈ 0..127` in parallel, to consume the pair `(Acc[a_k], Acc[b_k])` and write the result back to `Acc[a_k]`, where the pair is chosen by a **microcode-programmable fold-bit** `merge_bit ∈ 0..7`:
+
+```
+  b_k = a_k ⊕ (1 << merge_bit)            # paired at stride 2^merge_bit
+```
+
+All nine `MERGE_STAGE` beats appearing in §5.6.3 / §5.7.2 / §5.7.3 / §5.7.4 are instances of this primitive with different `merge_bit` values (7 for LO↔HI, 6 for stride-64, 5 for stride-32, …, 0 for stride-1). This subsection specifies the physical layout and MUX network that realise it in **one cycle with no structural hazard**, and shows how the simpler `ACCUM` / `INIT` / `READOUT` access patterns fall out as a degenerate case of the same hardware.
+
+**Physical layout — parity-indexed two-bank plane.** Each ping-pong plane holds 256 × 32 b in flip-flops, partitioned into two sub-banks by the **parity of the slot index**:
+
+```
+  bank(s)  = s[7] ⊕ s[6] ⊕ s[5] ⊕ s[4] ⊕ s[3] ⊕ s[2] ⊕ s[1] ⊕ s[0]   # XOR of all 8 bits
+  intra(s) = s[6:0]                                                     # 7-bit intra-bank index
+
+  Bank 0  (128 × 32 b, 1R1W, flip-flop)  ← slots with parity(s) = 0   e.g. {0, 3, 5, 6, 9, 10, 12, 15, …}
+  Bank 1  (128 × 32 b, 1R1W, flip-flop)  ← slots with parity(s) = 1   e.g. {1, 2, 4, 7, 8, 11, 13, 14, …}
+```
+
+Parity-based partitioning is the *key* structural trick: any pair `(s, s ⊕ (1 << k))` for **any** `k ∈ 0..7` has opposite parity, hence the two pair members are **always in different banks**. This eliminates bank conflicts at every fold stride `2^k` and is what lets a single beat service all 128 lanes.
+
+> The "sub-bank split by slot index" language in the earlier §5.5 paragraph refers to *this* parity split — the names `LO`, `HI` are preserved only as shorthands for the two ALU operand buses produced by the MUX network below, not as address ranges. At `merge_bit = 7` the MUX collapses to the familiar `LO = Acc[0..127]`, `HI = Acc[128..255]` view (see the fallback table below).
+
+**Per-lane pair-address generator.** For each lane `k`, the address of the "canonical low" slot is produced by a tiny combinational block:
+
+```
+  a_k = insert_bit_zero(k, merge_bit)     # 8-bit output: bit merge_bit forced to 0, other 7 bits = k
+  b_k = a_k | (1 << merge_bit)            # same 7 remaining bits, fold-bit set to 1
+  bank_of_a_k = parity(a_k) = parity(k)   # because the forced-zero bit contributes 0 to the parity
+```
+
+where `insert_bit_zero(k, p)` is a 3-bit-controlled bit-shifter: `{k[6:p], 1'b0, k[p-1:0]}`. The total hardware per lane is one 8-mux → 3 gate levels, trivially combinational.
+
+A useful consequence of `parity(a_k) = parity(k)`: the bank routing of lane `k` is **fixed independent of `merge_bit`** — lanes with even `parity(k)` always read/write Bank 0, odd lanes Bank 1. Only the *intra-bank address* `a_k[6:0]` varies with `merge_bit`.
+
+**Read-side MUX network.** Each beat reads all 128 slots from each bank (the flip-flop register file exposes every slot combinationally), producing two 128 × 32 b = 512 B read planes `V_0` and `V_1`. A **3-bit-controlled 128→128 shuffle** per plane then steers the correct slot to each lane:
+
+```
+  Bank 0 output (128 × 32 b)  ──┐
+                                 ├─▶  shuffle[merge_bit]  ─▶  U_0[k] = V_0[ intra(a_k if parity(k)=0 else b_k) ]
+  Bank 1 output (128 × 32 b)  ──┤
+                                 └─▶  shuffle[merge_bit]  ─▶  U_1[k] = V_1[ intra(b_k if parity(k)=0 else a_k) ]
+```
+
+The shuffle is implemented as a **two-level MUX tree**: `merge_bit[2]` selects among {stride ≤ 8, stride ≥ 16} butterflies, `merge_bit[1:0]` selects the specific butterfly stage. Area ≈ `128 × 32 b × 8:1 = 32 Kb × 3` MUX levels, roughly comparable to the ALU input-select layer (§5.8). Only 8 static permutations are needed — one per `merge_bit` value — so gate sharing across stages keeps the tree small.
+
+**Swap MUX (per-lane 2:1).** The two shuffled buses `U_0`, `U_1` carry `{Acc[a_k], Acc[b_k]}` but in an order that depends on `parity(k)`. A per-lane 2:1 MUX keyed on `parity(k)` presents them to the ALU as `(LO, HI)`:
+
+```
+  if parity(k) = 0:   LO[k] = U_0[k],  HI[k] = U_1[k]
+  else:               LO[k] = U_1[k],  HI[k] = U_0[k]
+```
+
+Cost: 128 × 32 b × 2:1 = one wired-OR layer, negligible vs. the ALU.
+
+**ALU input.** `LO` and `HI` feed the stage (B) value-input MUXes (§5.8) as the `ACC_READ_LO` and `ACC_READ_HI` sources. `MERGE_STAGE` beats drive `src_A = ACC_READ_LO`, `src_B = ACC_READ_HI`, `alu_op = ADD` (or `MAX` / `MIN` / `OR` / `XOR` for non-sum merges); the 128-lane ALU produces 128 results in parallel.
+
+**Write-back network.** The ALU's 128 × 32 b output `W[k]` is routed back via the **inverse** of the read-side shuffle + swap:
+
+```
+  lane k  ──▶  bank parity(k)  at intra-bank address a_k[6:0]
+```
+
+Because `parity(a_k) = parity(k)` is fixed, bank selection is a per-lane static routing (lane 0 → bank 0, lane 1 → bank 1, lane 2 → bank 1, lane 3 → bank 0, …). Only the *within-bank address* depends on `merge_bit`, via the same 3-bit-controlled permutation used on the read side. Per-bank area: one 64→64 reconfigurable permutation (since each bank receives exactly 64 writes per `MERGE_STAGE` beat, one per even-parity or odd-parity lane).
+
+The paired slot `Acc[b_k]` is **not** written — its old value remains in the opposite bank. Subsequent `MERGE_STAGE` beats with a smaller `merge_bit` only touch canonical-low slots again (fold-bit = 0 of the new merge axis), so the untouched half becomes stale-but-unreferenced and never influences downstream beats. After a complete `log₂(W/E)`-beat fold chain only the final canonical-low slot of each fold chain carries useful data; Pack / retire then reads exactly those slots.
+
+**Fallback for `ACCUM` / `INIT` / `READOUT` — the degenerate `merge_bit = 7` case.** These non-merge `acc_op` modes use the **same** physical machinery with `merge_bit` hard-wired to 7 and `HI` used either as the accumulate read-back or as a spare FMA addend path:
+
+| `acc_op`  | `merge_bit` | `a_k` pattern | Effective mapping |
+|-----------|-------------|---------------|-------------------|
+| `ACCUM` (`acc_slot` → LO half) | 7 | `a_k = k`  (bit 7 forced 0, k∈0..127) | Read `LO[k] = Acc[k]`, HI idle or = `Acc[k+128]` (FMA addend). Write lane `k` → `Acc[k]`. |
+| `ACCUM` (`acc_slot` → HI half) | 7 | `a_k = k + 128`  (bit 7 set by `acc_slot[0]`) | Read `LO[k] = Acc[k+128]`, HI idle. Write lane `k` → `Acc[k+128]`. |
+| `INIT`    | 7 | `a_k = k` | Read suppressed; write lane `k` → `Acc[a_k]` with stage-A unpack output. |
+| `READOUT` | 7 | `a_k = k` | Read `LO[k] = Acc[k]`, forward to Pack; write ports idle. |
+
+With `merge_bit = 7` the read shuffle is the identity permutation, the swap MUX settles to `LO = Bank of parity 0 ∪ Bank of parity 1 = Acc[0..127]` automatically, and the network reproduces the v1-style `LO / HI` halves without any special casing in the microcode assembler.
+
+**Microcode encoding (extension to §5.4).** `acc_slot` is widened from 2 b (v0.9) to 4 b:
+
+- Bits `[2:0]` = `merge_bit` when `acc_op = MERGE_STAGE`; otherwise (ACCUM / INIT / READOUT) these bits are forced to 7 by the assembler and the field picks LO / HI + ping-pong plane (the original 4-state semantics of v0.9 are preserved).
+- Bit `[3]` = `merge_base_parity` for partial-plane merges (e.g. the wide-row-reduce's 16-wide per-row fold uses bit 3 to mark "only even-indexed rows participate in this beat"). For full-plane merges bit 3 = 0.
+
+**Cost summary.**
+
+| Block | Approx. size | Note |
+|-------|--------------|------|
+| Per-plane flip-flop storage | 256 × 32 b = 8 Kb | unchanged from [`vector4k.md`](vector4k.md) §4.1 |
+| Ping-pong duplication | × 2 | 16 Kb per VEC instance |
+| Read shuffle MUX (per plane, × 2 banks) | 8-way 128 × 32 b permutation | ~0.5 × ALU input-layer area |
+| Swap MUX | 128 × 32 b × 2:1 | 1 per lane, controlled by `parity(k)` |
+| Write-back MUX | 8-way 64 × 32 b permutation per bank | same configuration input as read-side |
+| **Net v2 vs. v1 (§5.8.1)** | +~1 ALU-input-layer's worth of MUX | absorbed by the simultaneous removal of the v1 dedicated Acc RMW adder; net area change ≈ 0 |
+
+**Pipeline implications.** The `MERGE_STAGE` read → shuffle → ALU → write path closes in a single beat for 1-cycle ALU ops (`ADD`, `MAX`, `AND`, …). Back-to-back `MERGE_STAGE` beats at decreasing `merge_bit` never touch the same slot twice (each beat reads from both banks but writes only half), so no RAW hazard arises and no pipeline bubbles are needed. This is what allows the 4-beat fold chain of §5.7.2 (16 within-row partials → 1 row-sum) to run at 1 beat / cycle.
+
+### 5.6 Reduction strategy
+
+The combination of microcode-controlled tilelet dispatch, per-beat `tilelet_xpose`, and accumulator RMW gives VEC-v2 a simple, uniform way to express both axis reductions (`TCOL*`, `TROW*`) without any crossbar or dedicated reduce-tree schedule. The recipes below cover all legal tile shapes.
+
+#### 5.6.1 Col-axis reduce (`TCOL*`) — primary pattern
+
+`TCOL*` sums along the `R` axis of the `R × C` tile, producing a `1 × C` output. Per-element structure: each output column `j` = `Σ_i A[i, j]`. For tiles wide enough that each logical row occupies one or more full 512 B strips (`W = C·E ≥ 512 B`), the pattern is a direct accumulator loop:
+
+```
+# pseudo-code, tile A: R × C, W = C·E bytes, assume W = k · 512 (k = 1 for W = 512, k = 2 for W = 1024, etc.)
+# output D0: 1 × C row-vector, packed into k strips
+
+FOR j_band IN 0 .. k-1:              # which 512 B slice of the logical row
+    acc_op = INIT                     # first beat of this col-band overwrites acc
+    FOR row_i IN 0 .. R-1:            # every logical row
+        s = row_i * k + j_band         # which strip of A holds row row_i's j_band-th slice
+        BEAT:
+            src_A = SA, s_A = s, xp_A = 0     # row-mode tilelet
+            src_B = —                          # unary accumulate-add
+            alu_op = PASS_A
+            acc_op = (row_i == 0) ? INIT : ACCUM
+            acc_slot = j_band
+    BEAT:                             # one retire beat per col-band
+        src_A = ACC_readback, acc_slot = j_band
+        wr_en_D0 = 1, wr_strip_D0 = j_band
+```
+
+Total beats: `k · R + k` = `k · (R + 1)`. For the canonical `W = 512 B, R = 8` case this is `1 · 9 = 9` compute beats + fetch overhead, matching [`vector4k.md`](vector4k.md) §5.3's `#W = 1` template.
+
+#### 5.6.2 Row-axis reduce (`TROW*`) via tilelet transpose (recommended baseline)
+
+`TROW*` sums along the `C` axis, producing an `R × 1` output. Per-element structure: each output row `i` = `Σ_j A[i, j]`. The **recommended baseline** (no cross-group reducer network, §5.8.1) expresses row-reduce as a two-phase operation: (1) a col-axis-style `ACCUM` loop on the tilelet-transposed view of the operand, which collects row-sums as multiple partials per row spread across 128 lane-slots, followed by (2) a short `MERGE_STAGE` phase that folds those partials down to one sum per row. Both phases stay strictly lane-aligned; no data crosses a group boundary.
+
+For the canonical `W = 512 B, R = 8` shape (one full row per strip), a col-mode staged read of strip `s` delivers chunk `(i, s)` of the tile on output lane `i` for `i = 0..7` — that is, the 16-FP32 col-band `[s·16 .. s·16+15]` from each of the 8 rows, one slice per lane-group of 16. After 8 `ACCUM` beats (one per col-band), the accumulator plane holds row-sums stored as **16 partial copies per row**, one per `k ∈ 0..15` within-row position. A chain of `log₂(16) = 4` `MERGE_STAGE` beats then folds the 16 within-row partials down to 1:
+
+```
+# pseudo-code, tile A: R × C with W = 512 B, R = 8 (canonical wide one-row-per-strip shape)
+# output D0: 8 × 1 col-vector, packed into strip 0
+
+# ---- phase 1: xpose-accumulate (8 beats) ----
+FOR s IN 0 .. 7:                               # 8 col-band beats
+    BEAT:
+        src_A   = SA, s_A = s, xp_A = 1        # COL-MODE tilelet — chunk (i, s)
+        alu_op  = PASS_A                        # per-lane identity, sum via acc_op
+        acc_op  = (s == 0) ? INIT : ACCUM
+        acc_slot = 0
+
+# after phase 1: for g ∈ 0..7, k ∈ 0..15
+#   acc[16·g + k] = Σ_{s=0..7} A[g, s·16 + k]   # 16 within-row partials per row
+
+# ---- phase 2: within-row halving merge (log2(16) = 4 beats) ----
+FOR stride IN {8, 4, 2, 1}:                    # halve the within-row k-axis each beat
+    BEAT:
+        acc_op   = MERGE_STAGE
+        acc_slot = within-row, stride           # acc[16·g + k] += acc[16·g + k + stride]
+                                               # for k ∈ 0..stride-1, g ∈ 0..7
+
+# after phase 2: acc[16·g] = Σ_{j=0..127} A[g, j]   for g ∈ 0..7   (eight row-sums)
+
+# ---- retire ----
+BEAT:
+    src_A = ACC_readback, acc_slot = 0
+    wr_en_D0 = 1, wr_strip_D0 = 0               # pack slots {0, 16, 32, …, 112} → strip[0..7]
+```
+
+> **Why tilelet_xpose converts row-reduce into col-reduce.** In row-mode, strip `s` = logical row `s` of the tile (under `W = 512 B`), so accumulating strips across `s` would sum *different rows* — i.e. a col-axis reduction. In col-mode, strip `s` = chunk-grid column `s`, which means the same chunk-grid column (= the same col-band `[s·16 .. s·16+15]`) from every row is packed into one tilelet; accumulating across `s` now sums *different col-bands of each row into that row's own slot* — i.e. a row-axis reduction, spread over 16-FP32 sub-groups on each lane. The 16-way spread is then folded in-Acc by `MERGE_STAGE` without any cross-group wiring.
+
+Total beats: `R + log₂(512 / E) + 1` when `W = 512 B` and the format is `E`-byte (`log₂(16) = 4` for FP32). For FP16 / BF16 (`E = 2`) the within-row fold is `log₂(32) = 5` beats; for FP8 (`E = 1`) it is `log₂(64) = 6` beats. For the same `W = 512 B` shape with `R < 8` the microcode fills the unused chunk-grid rows with implicit zero on initialization. For wider `W ∈ {1024, 2048, 4096}` the col-mode tilelet delivers 8 × 64 B row-segments per beat; stage (A)'s byte-segment recombine (§5.3) presents the right element-level slices to the per-lane ALU, and the same `ACCUM` + `MERGE_STAGE` pattern runs over a proportionally longer strip count.
+
+> **Alternative configuration (§5.8.1, *not* recommended).** An implementation that keeps a cross-group reducer network of [`vector4k.md`](vector4k.md) §4.1 / §5.3.2 can collapse the `log₂` `MERGE_STAGE` beats into the pack stage of a single final beat, trimming phase 2 to 1 beat. The beat saving applies **only** to `TROW*` and comes at the cost of an additional 128-lane cross-group adder network in stage (B). See §5.7.2-alt for the corresponding 9-beat microcode.
+
+#### 5.6.3 Narrow-tile reductions (`W = C·E < 512 B`) — final-stage accumulator merge
+
+When a logical row is smaller than 512 B, each strip packs `512/W` full logical rows. The col-axis reduction of §5.6.1 is then more efficient — the accumulator can hold **`512/W` separate partial sums, stacked side-by-side**, because one strip contributes to that many partial sums in one beat. But because the per-strip ALU does element-wise add across 128 lanes (`alu_op = PASS_A` + `acc_op = ACCUM`), all `512/W` partial sums are produced in parallel: each 128-lane beat adds "row `r·(512/W) + k`" to accumulator slot `k` (`k = 0 .. 512/W − 1`).
+
+After the initial per-strip loop, Acc contains `512/W` independent column-vector partials that still need to be summed together into the final `1 × C` answer. This is the **final accumulator merge** pass, which uses `acc_op = MERGE_STAGE` to hierarchically halve the partial count per beat:
+
+```
+# phase 1: per-strip accumulate (R / (512/W) strips total, i.e. 8 strips for a 4 KB tile)
+acc_op = INIT
+FOR strip_s IN 0 .. 7:
+    BEAT:
+        src_A = SA, s_A = strip_s, xp_A = 0
+        alu_op = PASS_A
+        acc_op = (strip_s == 0) ? INIT : ACCUM
+        acc_slot = 0                             # all 512/W partials share the 128-slot plane
+
+# after phase 1: acc[0 .. 127] = (512/W) × (C in FP32 slots)
+#                partial_k[j]  = Σ_{row r ≡ k (mod 512/W)}  A[r, j]     for k = 0 .. 512/W − 1
+#                                                                          j = 0 .. C − 1
+
+# phase 2: hierarchical merge — log2(512/W) beats
+FOR half IN log2(512/W) DOWNTO 1:              # e.g. 512/W = 8  →  halves = {4, 2, 1}
+    BEAT:
+        acc_op   = MERGE_STAGE
+        acc_slot = <pair (low, high) in this halving step>
+        # computes acc[low-band] ← acc[low-band] + acc[high-band]  across 128 lanes
+
+# after phase 2: acc[0 .. C−1] = final col-axis sum (C ≤ 128 elements)
+
+BEAT:                                          # retire
+    src_A = ACC_readback, acc_slot = 0
+    wr_en_D0 = 1, wr_strip_D0 = 0
+```
+
+Total beats: `R + log2(512/W) + 1`. For `W = 64 B` (i.e. `512/W = 8`), that is `R + 3 + 1 = R + 4` beats; for a 4 KB tile `R = 64` gives `68` beats, vs. the `R = 64` strip replays a v1 `TCOL*` would have needed (also ~64 beats, but with multi-wave accumulator management, [`vector4k.md`](vector4k.md) §5.3.2). The narrow-tile path wins by collapsing `#W` to 1 and replacing per-wave reduction control with a flat microcode loop.
+
+Narrow-tile **row-reduce** is the dual problem — each strip already packs `512/W` full logical rows, each `W`-byte wide. The recommended baseline (no cross-group network) handles it by **per-strip accumulate + per-strip merge + drain-into-persistent-collector**:
+
+1. `ACCUM` strip `s` into a scratch plane: each beat places `512/W` logical rows × `W/E` within-row partials into the 128 Acc slots (all `R / (512/W)` strips contribute if `R > 512/W`, otherwise just one strip is needed).
+2. After all strips are accumulated, run `log₂(W/E)` `MERGE_STAGE` beats to fold the within-row partials down to one sum per row (same mechanics as the wide case above, just a smaller fold factor).
+3. When the resulting `R` row-sums do not fit in one 128-slot plane (`R > 128`), the microcode splits the operation into `K_outer = ⌈R / 128⌉` outer waves, each processing a row-band. This is analogous to [`vector4k.md`](vector4k.md) §5.3.2's `K_outer` loop and typically covers all practical `(R, W)` corners — the worst case (e.g. `R = 64, C = 16`, 1024 within-row partials → overflows one plane) uses `K_outer = 2` at a total cost of ~24 beats (§5.7.4).
+
+Stage (B) stays strictly lane-aligned for every beat of the above; the only "cross-lane" work is the `MERGE_STAGE` pair-merge within a sub-bank, which uses the Acc's own LO/HI read ports (§5.5) rather than a compute-group tree.
+
+> **Alternative configuration.** An implementation carrying the optional cross-group reducer network can collapse each strip's within-row fold into the ALU beat itself (emitting `512/W` ready row-sums per beat), reducing the narrow-row-reduce cost to roughly `R / (512/W) + 1` beats — e.g. 9 beats for `R = 64, C = 16` vs. 49 beats in the recommended baseline (§5.7.4 scaling note gives the intermediate-`W` penalties, which taper rapidly to ≤2× as `W` approaches 512 B). See §5.7.4-alt.
+
+### 5.7 Worked microcode examples
+
+The four shape regimes below cover the corners of the wide / narrow × row-reduce / col-reduce space. All examples use FP32 (`E = 4`).
+
+#### 5.7.1 Wide col-reduce — `R = 8, C = 128` (`W = 512 B`)
+
+Tile layout: each logical row occupies exactly one 512 B strip. Output: `1 × 128` row-vector (one strip).
+
+```
+# 9 compute beats
+BEAT 0:  src_A=SA, s_A=0, xp_A=0;  alu=PASS_A;  acc=INIT,  slot=0
+BEAT 1:  src_A=SA, s_A=1, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 2:  src_A=SA, s_A=2, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 3:  src_A=SA, s_A=3, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 4:  src_A=SA, s_A=4, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 5:  src_A=SA, s_A=5, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 6:  src_A=SA, s_A=6, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 7:  src_A=SA, s_A=7, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 8:  src_A=ACC_readback, slot=0;  wr_en_D0=1, wr_strip_D0=0
+```
+
+After BEAT 7, `acc[0..127]` = `Σ_{i=0..7} A[i, ·]`. BEAT 8 retires it as strip 0 of `D0`. Total: 9 beats.
+
+#### 5.7.2 Wide row-reduce via tilelet_xpose — same shape, but reduce along `C` (recommended baseline)
+
+Output: `8 × 1` col-vector (packed into strip 0 of `D0` at FP32 positions 0..7; remainder padded by Pack).
+
+The recommended baseline (no cross-group reducer network) does **8 `ACCUM` beats with `xp_A = 1`, then 4 `MERGE_STAGE` beats with stride 8/4/2/1 along the within-row dimension, then 1 retire beat — total 13 compute beats**:
+
+```
+# ---- phase 1: xpose-accumulate (8 beats) ----
+BEAT 0:  src_A=SA, s_A=0, xp_A=1;  alu=PASS_A;  acc=INIT,  slot=0
+BEAT 1:  src_A=SA, s_A=1, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 2:  src_A=SA, s_A=2, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 3:  src_A=SA, s_A=3, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 4:  src_A=SA, s_A=4, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 5:  src_A=SA, s_A=5, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 6:  src_A=SA, s_A=6, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 7:  src_A=SA, s_A=7, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+
+# after phase 1: for g ∈ 0..7, k ∈ 0..15,
+#   acc[16g + k] = Σ_{s=0..7} A[g, s·16 + k]    # 8 rows × 16 within-row partials each
+
+# ---- phase 2: within-row halving merge (log2(16) = 4 beats) ----
+BEAT 8:   acc=MERGE_STAGE, acc_slot={merge_bit=3}   # stride 2^3 = 8 slots
+          # acc[16g + k] += acc[16g + k + 8]   for k ∈ 0..7, g ∈ 0..7
+BEAT 9:   acc=MERGE_STAGE, acc_slot={merge_bit=2}   # stride 4
+BEAT 10:  acc=MERGE_STAGE, acc_slot={merge_bit=1}   # stride 2
+BEAT 11:  acc=MERGE_STAGE, acc_slot={merge_bit=0}   # stride 1
+          # after beat 11: acc[16g] = Σ_{j=0..127} A[g, j]  for g = 0..7
+
+# ---- retire ----
+BEAT 12:  src_A = ACC_readback, acc_slot = 0
+          wr_en_D0 = 1, wr_strip_D0 = 0
+          # Pack gathers slots {0, 16, 32, 48, 64, 80, 96, 112} into strip[0..7]
+```
+
+Stage-by-stage data:
+
+- Beat 0: `SA.read(s=0, xp=1)` delivers chunk-grid column 0 = 8 × 64 B slices, one per chunk-grid row `g`. For `W = 512 B`, chunk-grid row `g` = logical row `g`; so slice `g` holds logical row `g`, cols 0..15 (16 FP32 = 64 B). After INIT, `acc[16g..16g+15] = A[g, 0..15]` for `g = 0..7`.
+- Beats 1–7: col-band `s → acc[16g..16g+15] += A[g, s·16 .. s·16+15]`. After beat 7, each row's sum lives as **16 copies spread across 16 consecutive Acc slots** — all strictly lane-aligned.
+- Beats 8–11: in-Acc pair-merge at strides 8, 4, 2, 1 along the within-row axis (`k ∈ 0..15`). Implementation reuses the stage (B) ALU via `src_A = ACC_READ_LO, src_B = ACC_READ_HI, alu_op = ADD` (§5.5); the sub-bank LO/HI split is programmed by the microcode assembler to pair slots at the requested stride. No cross-group fold of any kind.
+- Beat 12: retire — Pack picks one slot per row (slot `16·g + 0` for `g = 0..7`) and emits 8 FP32 values into positions 0..7 of strip 0 of `D0`.
+
+**Cost summary:** 13 compute beats vs. 9 for `TCOL*` on the same shape (§5.7.1). The 4-beat overhead is the inherent cost of folding the 16 within-row partials without a cross-group adder.
+
+##### 5.7.2-alt Wide row-reduce with cross-group reducer network (alternative configuration, *not* recommended)
+
+An implementation that keeps the [`vector4k.md`](vector4k.md) §4.1 / §5.3.2 cross-group reducer network (`N_tree = 128 → 8` fold programmable, §5.8.1) can compress phase 2 into the pack step of a single beat:
+
+```
+# 9 compute beats, identical structure to 5.7.1 but with xp_A = 1 and tree enabled at retire
+BEAT 0:  src_A=SA, s_A=0, xp_A=1;  alu=PASS_A;  acc=INIT,  slot=0
+BEAT 1:  src_A=SA, s_A=1, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+ …
+BEAT 7:  src_A=SA, s_A=7, xp_A=1;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 8:  src_A=ACC_readback, slot=0;  N_tree_en={128 → 8 fold};  wr_en_D0=1, wr_strip_D0=0
+```
+
+After beat 7, `acc[16g..16g+15] = Σ_{j=0..127} A[g, j]` stored as 16 copies per row (same as phase 1 above). At beat 8 the cross-group reducer-tree folds the 16 copies per lane-group to 1 FP32 per row, packing the 8 row-sums directly into strip 0 of `D0`. **Saves 4 beats for `TROW*` only**, at the cost of a 128-input programmable reducer network in stage (B). No other instruction class benefits.
+
+#### 5.7.3 Narrow col-reduce — `R = 64, C = 16` (`W = 64 B, 512/W = 8`)
+
+Each strip packs 8 logical rows × 16 FP32. Output: `1 × 16` row-vector.
+
+```
+# phase 1: 8 per-strip accumulate beats
+BEAT 0:  src_A=SA, s_A=0, xp_A=0;  alu=PASS_A;  acc=INIT,  slot=0
+BEAT 1:  src_A=SA, s_A=1, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 2:  src_A=SA, s_A=2, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 3:  src_A=SA, s_A=3, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 4:  src_A=SA, s_A=4, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 5:  src_A=SA, s_A=5, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 6:  src_A=SA, s_A=6, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+BEAT 7:  src_A=SA, s_A=7, xp_A=0;  alu=PASS_A;  acc=ACCUM, slot=0
+
+# after phase 1:
+#   for k = 0..7, j = 0..15:
+#     acc[16k + j] = Σ_{strip s = 0..7} A[8s + k, j]
+#                  = Σ_{rows r ≡ k (mod 8), r ∈ 0..63} A[r, j]
+#   i.e. acc has 8 × 16 = 128 slots, each a partial col-sum over 8 rows
+
+# phase 2: hierarchical merge (log2(8) = 3 beats)
+BEAT 8:   acc=MERGE_STAGE, acc_slot={merge_bit=6}  # acc[0..63]  += acc[64..127]   (stride 64)
+BEAT 9:   acc=MERGE_STAGE, acc_slot={merge_bit=5}  # acc[0..31]  += acc[32..63]    (stride 32)
+BEAT 10:  acc=MERGE_STAGE, acc_slot={merge_bit=4}  # acc[0..15]  += acc[16..31]    (stride 16)
+
+# after phase 2: acc[0..15] = final col-axis sum = Σ_{r=0..63} A[r, ·]
+
+BEAT 11: src_A=ACC_readback, slot=0;  wr_en_D0=1, wr_strip_D0=0   # retire
+```
+
+Total: `8 + 3 + 1 = 12` compute beats. The output populates FP32 positions 0..15 of strip 0 of `D0`; the remainder of the strip is zero-filled by Pack (determined by the retire mask that tracks the logical output shape `1 × 16`).
+
+#### 5.7.4 Narrow row-reduce — same shape, reduce along `C` (recommended baseline)
+
+Each strip packs 8 logical rows × 16 cols × FP32. Output: `64 × 1` col-vector (64 FP32 = 256 B = half a strip).
+
+This is the **worst case** for the no-cross-group-tree baseline: each strip's 128 lanes fill exactly one 128-slot plane, so the 16 within-row partials for that strip's 8 rows cannot be merged with another strip's partials in parallel. The microcode therefore uses `K_outer = 8` — one strip per outer wave — and relies on a byte-granular retire mask (already present for Pack) to emit each wave's 8 row-sums to its own 32-byte region of `D0`:
+
+```
+# repeat for s = 0..7:
+FOR s IN 0..7:
+    #   phase 1: accumulate strip s into LO plane (1 beat)
+    BEAT 6·s + 0:  src_A=SA, s_A=s, xp_A=0;  alu=PASS_A;  acc=INIT, slot=LO
+                   # after this beat: acc[16·r + k] = A[8s + r, k]  for r∈0..7, k∈0..15
+
+    #   phase 2: within-row halving merge, log2(16) = 4 beats
+    BEAT 6·s + 1:  acc=MERGE_STAGE, acc_slot={merge_bit=3}   # stride 8
+    BEAT 6·s + 2:  acc=MERGE_STAGE, acc_slot={merge_bit=2}   # stride 4
+    BEAT 6·s + 3:  acc=MERGE_STAGE, acc_slot={merge_bit=1}   # stride 2
+    BEAT 6·s + 4:  acc=MERGE_STAGE, acc_slot={merge_bit=0}   # stride 1
+                   # after beat 6·s + 4: acc[16·r] = Σ_k A[8s + r, k]  for r∈0..7
+
+    #   phase 3: partial retire — pack 8 row-sums into bytes [32·s .. 32·s + 31] of strip 0 of D0
+    BEAT 6·s + 5:  src_A = ACC_readback, slot = LO
+                   gather slots {0, 16, 32, 48, 64, 80, 96, 112} → 8 FP32 = 32 B
+                   wr_en_D0 = 1, wr_strip_D0 = 0, wr_byte_mask = [32·s .. 32·s + 31]
+```
+
+Total: **49 compute beats** (`8 × 6 + 1` ≈ `8 × 6` once the final per-strip retire subsumes the final drain). This is ~5.4× more expensive than the 9-beat cross-group-tree version (§5.7.4-alt). The blow-up is concentrated in the `R × log₂(W/E) = 64 × 4 = 256` MERGE beats plus the per-wave partial-retire overhead.
+
+> **Scaling note.** The `~6` beats-per-wave factor above is roughly `1 + log₂(W/E) + 1`. For the more common narrow shapes:
+> - `W = 128 B, R = 32` (FP32): `K_outer = 4`, ~`4 × (1 + 5 + 1) = 28` beats (~3.1× cross-group-tree).
+> - `W = 256 B, R = 16` (FP32): `K_outer = 2`, ~`2 × (1 + 6 + 1) = 16` beats (~1.8× cross-group-tree).
+> - `W = 512 B, R ≤ 8`: falls back to §5.7.2, only +4 beats (≤1.45×).
+>
+> So the no-cross-group-tree penalty is largest at the narrowest extreme and tapers off rapidly for wider tiles.
+
+##### 5.7.4-alt Narrow row-reduce with cross-group reducer network (alternative configuration, *not* recommended)
+
+An implementation that keeps the cross-group reducer network (§5.8.1 alternative) can do the per-strip 16 → 1 fold inside stage (B) directly, collapsing each strip's ALU beat into an 8-row-sum-producing beat and using a trivial 8-slot collector:
+
+```
+# 8 compute beats + 1 retire beat (the cross-group reducer tree runs inside stage (B) each beat)
+FOR s IN 0..7:
+    BEAT s:
+        src_A = SA, s_A = s, xp_A = 0
+        alu_op = TREE_ROW_REDUCE(16 → 1)              # reduce each 16-lane row-group to 1
+        acc_op = (s == 0) ? INIT : ACCUM              # writes 8 row-sums per beat into 8 slots
+        acc_slot = row-band s                          # slot-band s holds rows 8s..8s+7
+
+# after beat 7: acc[0..63] = row-sums for rows 0..63
+BEAT 8: src_A=ACC_readback, slot=0;  wr_en_D0=1, wr_strip_D0=0
+```
+
+Total: 9 compute beats. **Saves ~8 beats** vs. the recommended baseline at the cost of the cross-group adder network. As for §5.7.2-alt, the benefit applies only to `TROW*` family instructions.
+
+### 5.8 Stage (B) — unified ALU shared with the accumulator
+
+Stage (B) keeps the same **128 independent compute groups** (`N_group = 128`) and the same per-group SIMD width `W_ALU,i` as [`vector4k.md`](vector4k.md) §4.1 / §9.3.2 — including the per-group *intra-group tree* that folds a narrow-format SIMD lane back down to a single 32 b FP32 partial before the Acc write port. In v2 each ALU group also serves as the **accumulator ALU**: the dedicated Acc RMW adder that [`vector4k.md`](vector4k.md) §4.1 drew as a block between stage (B) and the Acc register file is **removed**; its function moves into the main stage (B) ALU via an input-side MUX and an Acc readback path.
+
+The **cross-group reducer network** of [`vector4k.md`](vector4k.md) §5.3.2 — a separate programmable fold that combines outputs of multiple groups in one beat — is **removed from the recommended v2 baseline** (§5.8.1). Row-axis reductions that the v1 tree handled in one beat are expressed in v2 as `tilelet_xpose` + `ACCUM` + a short `MERGE_STAGE` chain (§5.6.2, §5.7.2, §5.7.4); all of this stays strictly lane-aligned. §5.8.1 preserves the analysis and the alternative implementation that *does* keep the cross-group network, as a reference for implementations that prioritise `TROW*` throughput over stage (B) area.
+
+```text
+  stage (A) value pipes (2 × 128-slice)        SC (mask buffer, §4.2.2)
+      SA→  ─┐              SB→  ─┐                 │
+            │                    │                 │  128 × 1 b predicate bus
+            ▼                    ▼                 ▼
+       ┌────────┐           ┌────────┐       ┌──────────────┐
+       │ MUX_A  │           │ MUX_B  │       │  mask_src    │
+       │ value  │           │ value  │       │   MUX        │
+       │ src_A={│           │ src_B={│       │ ={SC_mask,    │
+       │  SA,   │           │  SB,   │       │   IMM_ALL_1, │
+       │  ACC_LO│           │  ACC_LO│       │   IMM_SOP}   │
+       │  ACC_HI│           │  ACC_HI│       └──────┬───────┘
+       │  SX,SY,│           │  SX,SY,│              │
+       │  IMM_0 }           │  IMM_0 }              │
+       └───┬────┘           └───┬────┘              │
+           │                    │                   │
+   (src_Z MUX — optional 3rd value input for FMA:   │
+    {ACC_LO, ACC_HI, SX, SY, IMM_0}) ──────┐        │
+                                          │        │
+           ▼                    ▼         ▼        ▼
+       ┌──────────────────────────────────────────────────┐
+       │            128-lane ALU group                     │
+       │   value inputs: A, B, (optional Z)               │
+       │   mask input:   M  (per-lane enable)             │
+       │   alu_op:  ADD / MUL / FMA / MAX / CMP /         │
+       │            AND / XOR / SELECT / PASS_A / …       │
+       │   per-lane gate: if (M[lane]=0) output[lane] ←   │
+       │                     A[lane]    (masked → identity)│
+       └──────────────────────┬───────────────────────────┘
+                              │
+             ┌────────────────┼──────────────────┐
+             │                │                  │
+             ▼                ▼                  ▼
+        ┌─────────┐     ┌─────────┐        ┌───────────┐
+        │ Pack    │     │ Acc.LO  │        │ Acc.HI     │
+        │ (bypass │     │ (1R1W,  │        │ (1R1W,     │
+        │  D0/D1) │     │ 128 slot)│        │ 128 slot)  │
+        └─────────┘     └────┬────┘        └────┬──────┘
+                             │                  │
+                             └────── read ──────┘
+                             feeds MUX_A / MUX_B / MUX_Z above
+                             (two concurrent read channels: LO + HI)
+```
+
+**Added / changed hardware vs. [`vector4k.md`](vector4k.md) §4.1 Stage (B) + Acc:**
+
+| Block | Width | Note |
+|-------|-------|------|
+| Per-value-slot input MUX (A, B, optional Z) | 6:1, 128 × `W_ALU,i` | Source is `{SA, SB, ACC_READ_LO, ACC_READ_HI, SX_bcast, SY_bcast, IMM_ZERO}`; driven by beat word's `src_*` field. **SC is not listed as a value source** — it feeds the mask input only |
+| Mask-input MUX | 3:1, 128 × 1 b | Source `{SC_mask, IMM_ALL_ONES, IMM_FROM_SOP}`; `IMM_ALL_ONES` is the default when `SOP.has_mask = 0` (§4.4) |
+| Per-lane gate | 128 × 2:1 on the `W_ALU,i`-bit ALU output | `out[lane] = M[lane] ? alu_core_out[lane] : A[lane]` — an unmasked lane preserves the incoming `A` value (for `TSELECT`-style semantics) or takes the old accumulator value (for masked `ACCUM`). The exact fall-through is selected by `alu_op` (`SELECT` returns `B` instead of `A` when `M = 0`, etc.) |
+| Acc sub-bank split | — | Single 256 × 32 b plane split into LO (0..127) + HI (128..255) sub-banks; each 1R1W. Same total storage, free concurrent LO+HI read |
+| Acc → ALU feedback wires | 2 × 128 × 32 b | `ACC_READ_LO` and `ACC_READ_HI` routed to all three value-input MUXes |
+| **Removed:** separate Acc RMW adder | — | Function folded into stage (B) ALU via feedback |
+
+The net area change is approximately neutral: the dedicated Acc adder disappears; the MUXes and extra Acc sub-bank read port replace it. The big win is **functional**: a single ALU now realises element-wise compute, Acc init (`INIT`), Acc accumulate (`ACCUM` = FMA-accumulate with `src_Z = ACC_READ_*`), Acc cross-slot merge (`MERGE_STAGE` with both value inputs from Acc), Acc readout (`READOUT`), and all **masked / predicated** variants of each of the above (simply by flipping `mask_src` from `IMM_ALL_ONES` to `SC_mask`) — any op the ALU supports is automatically available for every `acc_op` mode **and** every masked / unmasked mode (e.g. masked `MAX`-reduction for conditional argmax, masked `OR`-reduction for sparse-bitmask accumulation, predicated `ADD` for sparse vectors, etc.) with **no additional datapath hardware**.
+
+**Pipeline implications.** Stage (B) ALU → Acc write → Acc read feedback forms a loop at the granularity of one beat. For a 1-cycle ALU (integer add, fp16 add, simple compare) this loop closes in one cycle and back-to-back `ACCUM` beats to the same `acc_slot` chain without bubbles. For a multi-cycle ALU (fp32 FMA — typically 2–3 cycles), back-to-back `ACCUM` on the **same slot** introduces a bubble of `latency(ALU) − 1` cycles unless the compiler schedules the work across multiple `acc_slot`s (double-buffered reduction), which is already the standard `#W`-wave policy of [`vector4k.md`](vector4k.md) §5.3.2. The `MERGE_STAGE` pattern never RMWs the same slot back-to-back, so it is bubble-free.
+
+What is new in v2 vs. v1 at this stage is only that the **microcode beat word** of §5.4 drives ALU op-code, input MUX selectors, reducer-tree configuration, and Acc write routing **per beat** rather than via a calendar template id — and that the accumulator ALU *is* the compute ALU.
+
+#### 5.8.1 Compute-group count, per-group width, and the two kinds of tree
+
+This subsection nails down three things that the inherited [`vector4k.md`](vector4k.md) §4.1 diagrams leave partially implicit: (a) `N_group` is a datapath constant independent of data format; (b) what *does* vary with format is each group's internal SIMD width `W_ALU,i` and the associated **intra-group** fold; (c) the **cross-group** reducer network is a **separate, optional** block — **removed in the recommended v2 baseline**, retained as an alternative configuration described at the end of this subsection.
+
+##### Recommended baseline (v0.8)
+
+- **Keep** the per-group intra-group tree (format widener, one per compute group).
+- **Remove** the cross-group reducer network.
+- All `TROW*` instructions go through the `tilelet_xpose + ACCUM + MERGE_STAGE` path (§5.6.2, §5.7.2, §5.7.4).
+- The stage (B) datapath is then **strictly lane-aligned on every beat** — no data ever crosses a compute-group boundary. The only "cross-lane" primitive is `MERGE_STAGE`'s in-Acc pair merge (§5.5), which uses the Acc LO/HI read ports rather than any inter-group wiring.
+
+##### `N_group = 128` is fixed (format-independent)
+
+The constant falls out of two alignments that are *also* format-independent: 512 B / 4 B = 128 FP32-shaped slots per beat, and the Acc sub-bank geometry of 128 slot × 32 b (LO) + 128 slot × 32 b (HI) per ping-pong plane (§5.5). Narrower formats are not modelled by "more groups" but by **wider SIMD inside each group**, with the group's intra-group tree folding that SIMD back down to a single 32 b FP32-shaped partial before the Acc write port.
+
+##### Per-group width and intra-group fold by format
+
+| Format | Elements per strip | Elements per group (= SIMD width) | `W_ALU,i` (core ALU) | Intra-group fold | Sub-beats per strip | `W_tree,i` (to Acc) |
+|--------|--------------------|-----------------------------------:|----------------------|------------------|---------------------:|---------------------|
+| FP32 / INT32 | 128 | **1** | 1 × 32 b | depth 0 (bypass) | 1 | 32 b |
+| FP16 / BF16 | 256 | **2** | 2 × 16 b | 2 → 1 (1 adder) | 1 | 32 b |
+| FP8 | 512 | **4** | 4 × 8 b | 4 → 1 (2 adder levels) | 1 | 32 b |
+| FP4 | 1024 | **4** (per sub-beat) | 4 × 4 b | 4 → 1 (2 adder levels) | **2** (see §5.3) | 32 b |
+
+The intra-group fold is **not** a general reduction network — it is the format-specific widen / narrow step that lets a fixed-size Acc slot (32 b FP32) serve every format uniformly. Element-wise instructions (`TADD`, `TMUL`, `TAND`, `TCMP`, `TSELECT`, `TFMA_ACC`, masked variants of any of the above) use it in **depth-0 bypass** for FP32, and otherwise treat it as "format widener" — no cross-group data movement occurs. **Removing it** would either fatten Acc slots (breaking `N_run = 512` and the Acc geometry above) or force serialising every narrow-format beat into `E` sub-beats, both of which carry large penalties. It is therefore **always retained**, and is better understood as part of the unpack chain than as part of the reduction subsystem.
+
+##### Strict lane-alignment guarantee (recommended baseline)
+
+On **every** stage (B) beat in the recommended baseline, the output of lane `k` depends only on `{A[k], B[k], Z[k], M[k], ACC_READ_*[k], SX_bcast, SY_bcast}` — no data crosses a group boundary. This holds for element-wise ops, Acc `INIT`/`ACCUM`/`READOUT`, `MERGE_STAGE` (in-Acc pair merge using Acc's own LO/HI read ports, not a cross-group path), and every `TROW*` / `TCOL*` beat routed through `tilelet_xpose`.
+
+##### Alternative configuration — keep the cross-group reducer network (*not* recommended)
+
+An implementation may optionally retain the [`vector4k.md`](vector4k.md) §5.3.2 cross-group reducer network (`N_tree` parallel programmable trees) in stage (B). This network can sum across multiple groups' outputs in one beat and is driven by a `N_tree_en` field in the microcode beat word (§4.4). In this alternative configuration:
+
+- **Idle** during every element-wise beat (`alu_op ∈ {ADD, MUL, FMA, MAX, CMP, AND, XOR, SELECT, PASS_*}` with `acc_op ∈ {NONE, INIT, ACCUM, READOUT}`). `N_tree_en = 0` masks all trees to bypass.
+- **Active** only for:
+  - `TROW*` beats that fold a row's columns to a single row-sum in one beat — collapsing the `log₂(W/E)` `MERGE_STAGE` beats of the recommended baseline into one pack-time fold. See §5.7.2-alt and §5.7.4-alt.
+  - `TCOL*` beats on shapes where multiple logical row-segments pack into one strip and must be combined before the Acc write (wide `W > 512 B` regimes, already describable without the tree but with more beats).
+
+**Quantitative comparison:** §5.7.2 (wide `TROW*`) goes from 9 beats (alt) to 13 beats (recommended): +44%. §5.7.4 (narrow `TROW*` at `R = 64, C = 16`, the worst case) goes from 9 beats (alt) to 49 beats (recommended): +5.4×. For common shapes (wide row-reduce, or narrow with `W ≥ 128 B`) the penalty is ≤3×; §5.7.4's scaling note enumerates the intermediate cases. **All non-`TROW*` op families are unaffected** — element-wise, `TCOL*`, masked ops, Acc RMW all run at identical beat counts in both configurations. The alternative therefore buys faster row-reductions at the cost of ~1 × 128-input programmable reducer network in stage (B); the recommended baseline takes the opposite trade.
+
+### 5.9 Dual retire (`D0`, `D1`)
+
+When `retire_mask = 2'b11`, Pack produces two 512 B strips per retire cycle, one per result tile, routed to `W0` and `W4`. Typical use cases:
+
+- **`TSORT32` with value-index pair output:** `D0` = sorted values, `D1` = permutation indices.
+- **`TDIVMOD` / `TEUCLID`:** `D0` = quotient tile, `D1` = remainder tile.
+- **`TROWARGMAX`** with separate value / index tile outputs.
+- **`TFMA`** writing both the product tile and the accumulate tile (diagnostic / debug mode).
+
+Retire format of each output is independent (**`retire_format_0`**, **`retire_format_1`** in `SOP`), so `D0` can be FP32 while `D1` is FP16 or an integer index tile.
+
+The retire phase still takes **8 cycles** to drain 4 KB per output tile (one strip per cycle on each write port), so **one-result** and **two-result** instructions retire in the **same** number of cycles — the second write port is simply idle when `retire_mask[1] = 0`.
+
+---
+
+## 6. Variable-Length Operand Fetch
+
+### 6.1 Motivation
+
+With only **two** read ports and up to **three** tile operands, an instruction must be allowed to **stretch** the operand-fetch phase. VEC-4K-v2 treats operand fetch as a variable-length **prologue** to the compute phase, with **per-operand valid bits** (§4.2) gating the compute stage. Compute starts as soon as the **minimum subset of strips** required by the first calendar step is in the staging registers — **not** after all 4 KB of every operand is in.
+
+### 6.2 Fetch-phase cycle count
+
+Let:
+
+- `N_val ∈ {1, 2}` be the **value-tile** operand count (A alone, or A + B). Predicated ops that also fetch the C mask tile add a short mask fetch as described below.
+- `has_mask ∈ {0, 1}` — whether the instruction fetches `SC` (`SOP.has_mask`, §4.4). When `has_mask = 0` the mask path defaults to `IMM_ALL_ONES` at zero cost.
+- `T_ep = 8` cy (TRegFile epoch, [`tregfile4k.md`](tregfile4k.md) §3).
+- `T_mask ≤ 2` cy — cycles spent fetching `SC` when `has_mask = 1`. The mask payload is `⌈R · C / 8⌉ ≤ 1024 B`, i.e. 1 or 2 strips, and piggybacks on whichever read port is idle in the final value-operand epoch (§6.3). `T_mask` does **not** occupy a full epoch.
+- `align_penalty ∈ {0 … 7}` cy — the distance from issue to the **next** epoch boundary; depends on when the issuer writes the pending `reg_idx`.
+- `N_epoch` = number of TRegFile epochs microcode needs to commit all **value** tile operands, accounting for [`tregfile4k.md`](tregfile4k.md) §6 rule R2 (uniform `is_transpose` per epoch):
+  - `N_val = 1`, any `is_transpose`: `N_epoch = 1`.
+  - `N_val = 2` with matched `is_transpose_A = is_transpose_B`: `N_epoch = 1`.
+  - `N_val = 2` with mismatched `is_transpose_A ≠ is_transpose_B`: `N_epoch = 2`. The extra epoch is the **`xpose_mismatch` term**.
+  - **`N_val = 3` with `c_role = VALUE` (v2.1 增量 — native `TFMA` family, §7.6):** all three value tiles fetch in parallel using the 3-port binding R0/R1/R4 (§3.1). With uniform `is_transpose_{A,B,C}`, `N_epoch = 1`. With mixed `is_transpose`, microcode partitions into 2 epochs (16 cy total) according to the rule R2 grouping that satisfies the largest matched subset on epoch 0.
+
+Then the operand-fetch phase length is:
+
+```
+  T_fetch = align_penalty + N_epoch × T_ep + (has_mask ? T_mask : 0)
+```
+
+| `N_val` | `c_role` | `is_transpose` mix | `has_mask` | `N_epoch` | Best-case `T_fetch` (aligned) | Worst-case `T_fetch` (misaligned) |
+|--------:|----------|--------------------|-----------:|----------:|-------------------------------:|----------------------------------:|
+| 1 | n/a | any         | 0 | 1 | **8 cy**  | 15 cy |
+| 1 | n/a | any         | 1 | 1 | **8 cy** (mask rides the idle port in epoch 0) | 15 cy |
+| 2 | n/a | uniform     | 0 | 1 | **8 cy**  | 15 cy |
+| 2 | n/a | uniform     | 1 | 1 | **8 cy** (mask rides the idle port in epoch 0) | 15 cy |
+| 2 | n/a | mixed       | 0 | 2 | **16 cy** (`xpose_mismatch`) | 23 cy |
+| 2 | n/a | mixed       | 1 | 2 | **16 cy** (mask rides the idle port in epoch 0 or 1) | 23 cy |
+| **3** | **VALUE** | **uniform** | **0** | **1** | **8 cy** (R0+R4+R1 all active, 3-port parallel fetch) | **15 cy** |
+| **3** | **VALUE** | **mixed (1 odd-out)** | **0** | **2** | **16 cy** (R2-compatible pair on epoch 0, the odd-one-out on epoch 1) | **23 cy** |
+| **3** | **VALUE** | **all three different** (rare) | **0** | **3** | **24 cy** | **31 cy** |
+
+**Key observation — masked ops are usually free.** In every 2-value-operand case the mask fetch reuses an already-idle read port within one of the value epochs (the second port is always free for 1–2 strips of the 8-cycle epoch), so adding a mask never extends `T_fetch`. The only exception is a highly tuned case where both value epochs happen to be fully occupied by value prefetch of a subsequent instruction (§6.3 rule 5); there the mask fetch costs an additional 1–2 cycles.
+
+**Note.** The compute phase (§5) overlaps with the tail of the fetch phase via the strip-valid bitmap (§4.2): for elementwise ops, compute starts ~1 cycle after the **first strip** of each required operand is latched, so the *end-to-end* instruction latency for `N_val = 2` typical elementwise can still be ~10–12 cycles (matching [`vector4k.md`](vector4k.md) §5.1). The `T_fetch` above is the **prologue** measured at the **SA/SB/SC fill completion** milestone, not the total instruction latency.
+
+### 6.3 Operand-to-epoch assignment rules
+
+Subject to [`tregfile4k.md`](tregfile4k.md) §6 **rule R2** (the two physical read ports active in one TRegFile epoch must share the same `is_transpose`), microcode assigns operands to epochs as follows:
+
+1. **`N_val = 2`, uniform `is_transpose`.** Epoch 0 fetches `{A, B}` on `{R0, R4}`. If `has_mask = 1`, `C`'s 1–2 strip mask fetch rides on whichever port's FSM finishes its value strips first within epoch 0 (mask and value share a port but **not** the same cycle, so no port-contention). **Cost: 8 cy** total.
+2. **`N_val = 2`, mismatched `is_transpose_A ≠ is_transpose_B`.** Epoch 0 = `{A}`, epoch 1 = `{B}`, each with one idle port. If `has_mask = 1`, the mask fetch slots into either idle port. **Cost: 16 cy** (the `xpose_mismatch` term).
+3. **`N_val = 1`, with or without mask.** Epoch 0 = `{A}` on one port; the other port is idle or used by the mask fetch (if `has_mask = 1`). **Cost: 8 cy**.
+4. **Mask `is_transpose_C`.** When `has_mask = 1`, the mask's `is_transpose_C` must match the `is_transpose` of whichever read-port epoch the mask piggybacks on, by rule R2. Microcode picks the compatible epoch; if no compatible epoch exists (extremely rare, and typically avoidable by flipping the mask's transpose bit in the assembler since the per-element bitmap is transpose-symmetric for most ops), the mask fetch gets its own 1–2 cy cycle slot at the end of the prologue.
+5. **Back-to-back instructions can share an epoch** when a prior instruction's operand-fetch leaves a port idle: the idle port can pre-fetch an operand of the next instruction at **zero incremental cost**, provided R2 is satisfied for that next-instruction operand. This is invisible at the single-instruction level but shows up as a throughput gain in pipelined execution.
+6. **Choice of which port fetches each operand** is a free parameter subject to R2; microcode picks the port with the shorter first-strip-to-compute dependency for that instruction's beat program (§5.4).
+6a. **`N_val = 3, c_role = VALUE` (v2.1 增量).** Epoch 0 fetches `{A, B, C}` on `{R0, R4, R1}` — all three value tiles in parallel. Rule R2 requires `is_transpose_A = is_transpose_B = is_transpose_C` for this single-epoch path. If exactly one of the three has the opposite `is_transpose`, microcode places the **two matched** operands on `{R0, R4}` in epoch 0 and the odd-one-out on `R1` (or `R0`, whichever is free) in epoch 1, costing `N_epoch = 2 = 16 cy`. If all three have distinct `is_transpose` values (a degenerate corner case), microcode falls back to `N_epoch = 3 = 24 cy`. Common kernels (LayerNorm `γ·x̂ + β`, Welford updates) all use uniform `is_transpose` and hit the **8 cy** path. **Cost in the typical case: 8 cy** — same as a binary op.
+7. **Scalar operands (`SX`, `SY`) are fetched off the TRegFile path entirely.** `SX` / `SY` sourced as **scalar GPR** (`sx_src = GPR`, `sy_src = GPR`, §4.3) are read from the scalar general-purpose register file at issue time and latched into the staging slots **before** the operand-fetch prologue starts — they cost **0 cy** in `T_fetch`, and they are always valid by the first microcode beat. `IMM` sources are similarly 0 cy. Only the rarer `TILE` and `ACC` scalar tags require a vector-side read: a `TILE`-sourced scalar reads one element out of a tile register via whichever R0 / R4 port is idle (typically sharing the same epoch as the value tiles) at the cost of 1 cy of port occupancy; an `ACC`-sourced scalar reads one Acc slot through the stage (B) → Acc feedback path and does not touch the TRegFile ports at all. All scalar source kinds therefore leave `T_fetch` unchanged in the common case — the dominant cost of a tile–scalar instruction is the tile prologue alone.
+
+Because the mask tile is small (≤ 2 strips) and independent of the value-tile epochs' port allocation, **adding a mask never lengthens the prologue** in the common case — full-tile and masked variants of the same op have identical fetch-phase cost. This is a direct consequence of treating `C` as a mask rather than a full 4 KB third value tile.
+
+For workloads with predominantly uniform `is_transpose` (GEMM-style kernels that pre-align tile views, or purely row-mode elementwise chains), the rule R2 penalty never materializes. For workloads that mix transpose modes (e.g. `TFMA D = Aᵀ · B + acc` via `ACC_READ_*`, where `A` is transposed but `B` is not), §10 quantifies the net effect: a 16 cy `T_fetch` in v2 vs. a 2 × 8 cy `TTRANS` + 1 × 8 cy 2-op fetch = 24 cy baseline in v1, so v2 still wins even when R2 forces the extra epoch.
+
+### 6.4 Example calendar — masked 2-value op, uniform `is_transpose`
+
+Assume `TADD_MASKED D = A + B` with a per-element mask in `C`, all three operands FP32, `R = 8`, `C_shape = 128` (one row per 512 B strip, [`vector4k.md`](vector4k.md) §4.4 Example A geometry). `is_transpose_A = is_transpose_B = is_transpose_C = 0`, `has_mask = 1`. Mask payload = 1024 elements / 8 = 128 B = 1 strip's worth.
+
+| Epoch | Cycle `t` | R0 (`is_transpose` this epoch) | R4 (`is_transpose` this epoch) | Staging writes | Stage (A/B) compute |
+|------:|---------:|--------------------------------|--------------------------------|----------------|---------------------|
+| 0 | 0..7 | `A` strip `s=0..7` (xp=0) | `B` strip `s=0..7` (xp=0) **plus** mask `C.strip[0]` inserted into R4's FSM at t=7 (R4 finishes `B` at t=7 and has one free cycle before R2 re-arbitration) | `SA[s] ← A strip` + `SB[s] ← B strip` each cycle; `SC[0] ← mask strip` at t=7 | starts as each strip-valid asserts; ADD overlaps SA/SB fill from t=1; mask input is `IMM_ALL_ONES` for lanes touched before `SC.valid[0]` asserts |
+| 1 | 8..15 | free | free | — | compute completes; `D0 → Wr0` drains concurrently |
+
+**End-to-end latency:** ~**10–12 cycles** — identical to an unmasked 2-value ADD, because the mask fetch piggybacks on the single value epoch and costs no extra cycles.
+
+**Unmasked counterpart.** For `TADD D = A + B` (`has_mask = 0`), the calendar is identical to the above minus the `C.strip[0]` slot in epoch 0; the ALU's mask MUX is statically wired to `IMM_ALL_ONES` by the microcode assembler, and no `SC` storage is touched.
+
+### 6.5 Example calendar — 2-value op, mixed `is_transpose` (`A_xp ≠ B_xp`)
+
+Same shape, now `is_transpose_A = 1`, `is_transpose_B = 0`, `has_mask = 1` with `is_transpose_C = 0` (matching `B`). Rule R2 forbids mixing xp=0 and xp=1 on the same epoch's two active ports, so microcode splits:
+
+| Epoch | Cycle `t` | R0 (`is_transpose` this epoch) | R4 (`is_transpose` this epoch) | Staging writes | Stage (A/B) compute |
+|------:|---------:|--------------------------------|--------------------------------|----------------|---------------------|
+| 0 | 0..7 | `A` strip `s=0..7` (xp=1, col-mode) | *(idle — would violate R2 if used with xp=0; can be used by a prior instr's xp=1 prefetch)* | `SA[s] ← col-mode strip` | idle |
+| 1 | 8..15 | `B` strip `s=0..7` (xp=0) | mask `C.strip[0]` (xp=0) at t=8, then idle t=9..15 | `SB[s]` each cycle; `SC[0] ← mask strip` at t=8 | starts once `SB.valid[0]`, `SC.valid[0]`, and `SA.valid[*]` for the first compute beat are all set |
+| 2 | 16..23 | free | free | — | compute completes |
+
+**End-to-end latency:** ~**17–19 cycles** — governed by the `xpose_mismatch` term on the value tiles, not by the mask. The only penalty is one R4-port-beat of *aggregate* read bandwidth during epoch 0, which microcode can reclaim as prefetch slack for neighbouring instructions.
+
+### 6.6 Handshake between operand fetch and compute stage
+
+The operand-fetch FSM exposes three signals to the compute scheduler:
+
+| Signal | Meaning |
+|--------|---------|
+| `SA.valid[s]`, `SB.valid[s]`, `SC.valid[s]` | Strip-level ready bits (each a vector of 8 bits per operand register). |
+| `SA.meta_valid`, `SB.meta_valid`, `SC.meta_valid` | Set on the **first** strip arrival; compute consults these before reading `SOP.(E,R,C)_X`. |
+| `fetch_done` | Asserted when **all** required strip-valid bits are set. Implies the compute stage can run to completion without stalling on operand fetch. |
+
+Compute stage (A)/(B) reads operand strips through a **barrier-free** interface: a microcode beat that reads `SX[s, xp]` stalls if the sub-banks it needs are not yet marked valid and resumes on the cycle they arrive. For strip-major microcode programs (element-wise, row-mode reductions), stalls never happen beyond the first strip. For cross-chunk-grid-row programs (col-mode reductions, narrow-tile merges) the stall budget is bounded by `⌈T_ep⌉` cycles per operand.
+
+---
+
+## 7. Instruction Categories and Cycle Sketches
+
+All op categories of [`vector4k.md`](vector4k.md) §5 carry over with **identical cross-lane / cross-strip behaviour**. The only v2 changes are listed below.
+
+### 7.1 Elementwise (Tile–Tile)
+
+`TADD`, `TMUL`, `TAND`, `TCMP`, `TCVT`, predicated / masked variants of each of the above (gated by `C`, §3.3c), **ternary-via-accumulator** ops (`TFMA_ACC`, `TSELECT`-over-Acc), and **native 3-source ternary** ops (`TFMA`, `TFNMA`, `TLERP`, §7.6):
+
+| Op arity | Fetch (best-case) | Compute tail | End-to-end |
+|---------:|------------------:|-------------:|-----------:|
+| Unary (full-tile) | 8 cy | ~2 cy | ~10 cy |
+| Unary, masked | 8 cy (mask piggybacks) | ~2 cy | ~10 cy |
+| Binary (full-tile) | 8 cy | ~2 cy | ~10–12 cy |
+| Binary, masked | 8 cy (mask piggybacks) | ~2 cy | ~10–12 cy |
+| Binary, mixed-`is_transpose` | 16 cy (`xpose_mismatch`) | ~2 cy | ~18 cy |
+| Ternary-via-Acc (e.g. `TFMA_ACC D = A·B + Acc`) | 8 cy | ~2–3 cy | ~11 cy |
+| **Ternary-3-tile** (`TFMA D = A·B + C`, `TFNMA`, `TLERP`) — **v2.1 增量** | **8 cy** (3-port parallel, §3.1, §6.2) | **~2 cy** | **~10–12 cy** — same as binary |
+| Ternary-3-tile, mixed-`is_transpose` | 16 cy (one odd-out epoch) | ~2 cy | ~18 cy |
+
+Masked variants do **not** pay a fetch penalty in the common case — the `C` bitmask fits in 1–2 strips (§3.3c) and rides on whichever read port is idle inside the value-tile epoch (§6.2).
+
+### 7.2 Tile–Scalar / Tile–Immediate
+
+Scalar routed through `SX` / `SY` (§4.3). The scalar operand's source — **scalar GPR** (the common case, selected by `sx_src = GPR` with a 5-bit GPR index in the instruction word), **immediate**, **tile-element**, or **ACC slot** — is captured at issue: `GPR` and `IMM` scalars arrive with the instruction packet at 0 VEC-pipeline cycles, so they are fully broadcasted and ready by the first microcode beat (§6.3 rule 7). The prologue therefore shrinks to just the 1-tile fetch (one 8 cy epoch). Cycle count unchanged vs. [`vector4k.md`](vector4k.md) §5.2 — ~10 cy end-to-end for one scalar + one tile, regardless of whether the scalar is a GPR read or an immediate.
+
+### 7.3 Axis Reduce / Expand
+
+Wave formulas (`#W`, `K_outer`, `f`) **unchanged** from [`vector4k.md`](vector4k.md) §5.3. The **TRegFile col-mode read** ([`tregfile4k.md`](tregfile4k.md) §7, activated by `is_transpose = 1` on the VEC read port §3.3) gives `TCOL*` a direct route to `TROW*`-shaped work without any RF replay: microcode sets `is_transpose = 1` on the operand of `TCOL*`, and TRegFile-4K delivers the **chunk-grid transpose** of the operand on every strip of the fetch epoch. The staged payload in `SA` / `SB` / `SC` is the transposed tile, and stage (A) consumes it just as it would for a `TROW*` on that transposed tile.
+
+- For tile shapes with `W = C·E ≤ 64 B` or `W = 512 B` (whole-row regimes in [`tregfile4k.md`](tregfile4k.md) §7.5), the col-mode delivery is **element-level clean**: stage (A) forwards it unchanged and executes the same microcode program (§5.6.1) as `TROW*` would on a `C × R` tile. The `#W` factor collapses to the smaller of the two axes (`min(⌈C/N_acc⌉, ⌈R/N_acc⌉)` etc.) whenever the transposed axis is shorter.
+- For the remaining `W` regimes, stage (A)'s byte-segment-recombine fixup (§5.3) lifts the chunk-grid transpose to element granularity in-line; the microcode program is unchanged, and the end-to-end cycle count matches the whole-row case plus at most a few stage-(A) re-timing cycles.
+- §5.7 gives four worked microcode examples (wide / narrow × row / col reduce) that demonstrate both the per-beat `tilelet_xpose` mechanism and the narrow-tile accumulator merge.
+
+This mechanism **re-uses** the col-mode support in [`tregfile4k.md`](tregfile4k.md) §7 end-to-end: same chunk granularity, same element-level fixup in stage (A), and subject to the same §6 rule R2 scheduling constraint — which only shows up when `TCOL*` is chained with another operand of different transpose mode in one instruction (§6.3).
+
+### 7.4 Complex ops
+
+- **`TSORT32`** / **`TMRGSORT`**: can now emit **value + index** tiles as `D0` / `D1` in a single instruction (vs. two issues in v1). `C` mask optionally marks sort-active lanes (padding otherwise).
+- **`TSELECT D = C ? A : B`**: the classic predicated-select — A in `SA`, B in `SB`, per-element predicate in the `C` bitmask. Uses `alu_op = SELECT`, `mask_src = SC_mask`, with the lane gate returning `B` when the predicate is 0 (§5.8).
+- **Masked reductions** (`TCOL_MASKED`, `TROW_MASKED`, masked `TROWARGMAX`): microcode sets `mask_src = SC_mask` on every `ACCUM` beat; lanes with `M = 0` bypass the ALU and leave `acc[k]` unchanged (or write the unmasked identity `0` / `-∞` / `+∞` depending on the reduction op).
+- **`TGATHER`**: the index tile travels in `SA` (as a value tile, interpreted as row-pointers); `SB` or `SC` is unused depending on the gather flavour. Purely-value-keyed gathers set `has_mask = 0`; conditional gathers with an enable mask set `has_mask = 1`.
+- **`TQUANT`** / **`TDEQUANT`**: the scalar `scale` / `zero_point` travel in `SX` / `SY` (§4.3); an optional per-tile calibration mask (e.g. "quantize only the nonzero elements") uses `C`.
+
+### 7.5 Novel ops: in-tile matrix inverse (`TINV`) and row-range product (`TROWRANGE_MUL`)
+
+Two new PTO instructions that are **natively enabled** by the v2 pipeline — specifically by the unification of the accumulator ALU with stage (B) (§5.5), the Acc feedback butterfly for MERGE (§5.5.1), the per-beat microcode (§5.4), and the predicate path for masked reductions (§5.6). Both are implementable with **no new hardware beyond the `RECIP` / `RSQRT` alu_ops** noted in §5.4 (used only by `TINV`).
+
+#### 7.5.1 `TINV` — tile matrix inverse
+
+**ISA form.** `TINV Tdst, Tsrc` (single-tile form) or `TINV {Tdst+}, {Tsrc+}, N` (multi-tile form for wide matrices). Metadata in `SOP` (§4.4) supplies `shape.x = shape.y = N` (square), `format`, and an implicit `num_tiles = ⌈N² · E / 4096⌉` for the multi-tile form. The matrix is square (non-square rectangular inverse is out of scope — it's the Moore–Penrose pseudo-inverse which is a different algorithm).
+
+**Supported shapes.**
+
+| Format (E) | Single-tile maximum `N` (N²·E ≤ 4 KB) | Multi-tile `N = 128` case | Notes |
+|------------|---------------------------------------|---------------------------|-------|
+| FP32 (4 B) | 32×32 (N² = 1024 elements = 4 KB) | **16 tiles in / 16 tiles out** | Production precision for ML kernel inverses (e.g. NeRF / 3DGS pose solvers, Kalman updates). |
+| FP16 / BF16 (2 B) | 45×45 — in practice pad to 32×32 or 64×64 (32×32 × 2 B = 2 KB, half-tile; 64×64 × 2 B = 8 KB = 2 tiles) | **8 tiles in / 8 tiles out** | Typical inference-time precision; Newton–Raphson refinement keeps the residual within ~1 ulp. |
+| FP8 (1 B) | 64×64 (N² = 4096 = exact tile) | **4 tiles in / 4 tiles out** | Quantised-model kernels; one NR refinement usually enough. |
+| FP4 (0.5 B) | 90×90 — in practice 64×64 (2 KB) | **2 tiles in / 2 tiles out** | Exploratory; numerical stability is a kernel-level concern. |
+
+The **128×128 case at FP32 is 16 tiles in / 16 tiles out** — the largest supported shape. It is expressed in the ISA with a **tile-register range** (e.g. `{T0..T15}`), encoded as `base_reg + log2(num_tiles)` in the PTO instruction word (2 bits for `num_tiles ∈ {1, 2, 4, 8, 16}`).
+
+**Algorithm (Gauss–Jordan with row normalisation, ALU/Acc-resident).** Uses the Acc as the staging ground for the **normalised pivot row** and the parallel-tracked inverse row, and uses the shared ALU in FMA mode for the elimination step. No new datapath — only microcode.
+
+Per iteration `k ∈ 0..N-1`:
+
+```
+# Phase Pk.1 — read pivot element A[k,k] and compute 1/A[k,k]
+  beat: src_A = SA[strip = k/R_per_strip, element k%R_per_strip × N + k],
+        alu_op = RECIP                  # ~11-bit approximate 1/A[k,k]
+        → broadcast result to Acc[pivot_recip_slot] (single scalar)
+
+  beats (NR refinement, ×2):
+        src_A = SA[pivot element as above], src_Z = ACC_READ(pivot_recip_slot),
+        alu_op = FNMA                    # x -= x·(A[k,k]·x - 1)  — two FMA NR iters
+        → Acc[pivot_recip_slot] ← refined reciprocal r_k
+
+# Phase Pk.2 — normalise pivot row A[k,:]  and identity row I[k,:] (= output row)
+  for each strip s covering row k (N/C_per_strip strips, C_per_strip = 512/E):
+    beat: src_A = SA[row k, strip s], src_B = BROADCAST(Acc[pivot_recip_slot]),
+          alu_op = MUL
+          → SA write-back: row k, strip s                 # A[k,:] *= r_k
+    beat: src_A = SB[row k, strip s], src_B = BROADCAST(Acc[pivot_recip_slot]),
+          alu_op = MUL
+          → SB write-back: row k, strip s                 # I[k,:] *= r_k   (I initially = identity)
+
+# Phase Pk.3 — eliminate column k in all other rows i ≠ k
+  # Read the column-k slice (all N elements of column k) into Acc[col_slots, 0..N-1]
+  # using tilelet_xpose = 1 on SA (stage-(A) fixup, §3.3b / §5.2) to get column in row-strips
+  for col_strips c in 0..⌈N/128⌉-1:
+    beat: src_A = SA[column k, strip c, xp = 1],
+          alu_op = PASS_A, acc_op = INIT, acc_slot = col_k_base + c
+          → Acc[col_k_base..col_k_base + N - 1] ← A[0..N-1, k]
+
+  # For each i ≠ k, update row i: A[i,:] -= A[i,k] · A[k,:]
+  for i in 0..N-1, i ≠ k:
+    factor = Acc[col_k_base + i]              # scalar per row
+    for each strip s covering row i:
+      beat: src_A = SA[row i, strip s],
+            src_B = SA[row k, strip s]         # already normalised from Phase Pk.2
+            src_Z = BROADCAST(factor)
+            alu_op = FNMA                      # A[i,s] ← A[i,s] − factor · A[k,s]
+            → SA write-back: row i, strip s
+      beat: src_A = SB[row i, strip s],
+            src_B = SB[row k, strip s]
+            src_Z = BROADCAST(factor)
+            alu_op = FNMA                      # I[i,s] ← I[i,s] − factor · I[k,s]
+            → SB write-back: row i, strip s
+
+# After k = N-1: SA holds I (within numerical precision), SB holds A⁻¹.
+```
+
+On exit, the content of `SB` is retired to `Tdst` (multi-tile retire — one strip per write port per cycle, 8 strips per tile × num_tiles).
+
+**Initial conditions.** `SA` is loaded from `Tsrc` (normal 2-read-port fetch prologue). `SB` is loaded with the identity matrix — this is done **implicitly at the start of the instruction** by a hardware `SB = IDENT(N)` init beat sequence that uses `alu_op = PASS_A, src_A = BROADCAST(0.0)` for off-diagonal elements and `BROADCAST(1.0)` for diagonal elements (gated by a hardware-generated identity-diagonal mask produced in `SC`). This identity-init phase is ⌈N/R_per_strip⌉ beats, amortised once per instruction.
+
+**Key hardware reuse.**
+- **Unified ALU + Acc feedback (§5.5)**: the `RECIP` + NR refinement, the row normalise, and the row eliminate are all `FMA` / `MUL` / `FNMA` beats that source from `SA` / `SB` / `Acc` via the per-lane 6:1 MUX (§5.8) and write back through the same fabric. Zero new RTL in the datapath proper.
+- **Acc as scratch (§5.5.1)**: the column-k slice (N scalars) and the scalar reciprocal `r_k` live in Acc slots. N = 128 fits in half a ping-pong plane, so no overflow; for `MERGE_STAGE`-free kernels the parity butterfly is idle and acts as a transparent pipe.
+- **`tilelet_xpose` (§3.3b)**: the column read in Phase Pk.3 uses `xp = 1` to collapse the cross-strip column gather to one strip per 128 elements, instead of 128 single-element reads.
+- **Per-lane mask gate (§5.8)**: lanes outside column range `k` (when partial strip, `k % C_per_strip ≠ 0`) are gated via `SC` predicate, no special-case microcode.
+
+**Cycle count.**
+
+Let `S_row = ⌈N · E / 512⌉` = strips per row (1 for N·E ≤ 512, typically 1 for the 128×128 FP32 case because 128·4 = 512 B). Per iteration k:
+
+- Phase Pk.1: 1 + 2 = 3 beats
+- Phase Pk.2: 2 × S_row beats (SA and SB pivot-row normalise)
+- Phase Pk.3: S_col + (N−1) × 2 × S_row beats  (column gather + per-row elimination in SA and SB)
+
+Total beats ≈ `N × (3 + 2 · S_row + S_col + 2 · (N−1) · S_row) ≈ 2 · N² · S_row + N · S_col + 3 · N`.
+
+Worked cases:
+
+| Shape | `S_row` | `S_col` | Total beats | At 1 GHz |
+|-------|---------|---------|-------------|----------|
+| 32×32 FP32 (single tile) | 1 | 1 | ~2·32² + 32 + 96 = **2.2 K beats** | 2.2 µs |
+| 64×64 FP8 (single tile) | 1 | 1 | ~2·64² + 64 + 192 = **8.5 K beats** | 8.5 µs |
+| **128×128 FP32 (16 tiles)** | **1** | **1** | **~2·128² + 128 + 384 = 33 K beats** | **33 µs** |
+| 256×256 FP16 (32 tiles) — conceptual extension | 1 | 1 | ~131 K beats | 131 µs |
+
+For reference, a naive CPU FP32 128×128 inverse is ~4 M cycles at ~25 GFLOPs effective on a single vector lane — `TINV` gives a **~100× speed-up** on the same silicon footprint by keeping the matrix resident in staging throughout the O(N³) algorithm.
+
+**Fetch prologue.** The multi-tile `TINV` prologue extends the normal 2-read-port fetch (§6) by `num_tiles − 1` extra epochs — each 8 cy. For 16 tiles in, this is **~120 cy prologue** on top of the ~8-cy baseline, negligible next to the 33 K-beat compute phase.
+
+**Numerical considerations** (kernel-level, not hardware):
+- No partial pivoting in the baseline microcode — caller must ensure `A` is well-conditioned (e.g. scale rows before the call). An extension microcode variant with row-pivot search (adds ~`N²` CMP beats, ~+50 % latency) is programmable as a different `SOP.ucode_base`.
+- Reciprocal refinement uses 2 NR iterations; this gives ~full-precision FP32 after refinement from the 11-bit seed. One NR iteration suffices for FP16; the 11-bit seed itself suffices for FP8.
+- Ill-conditioned inputs produce large residuals — the same failure mode as any library LU solver. Caller can compute `residual = A · A_inv − I` and reject on norm.
+
+#### 7.5.2 `TROWRANGE_MUL` — column-wise product over a row sub-range
+
+**ISA form.** `TROWRANGE_MUL Tdst, Tsrc, r_start, r_end` where `r_start, r_end` come from `SX` / `SY` (§4.3; typically scalar-GPR sourced so that the range is dynamic and cheap — 0 vector-side cycles, §4.3). The output is a single row (1 × C) written to row 0 of `Tdst`, where `out[c] = ∏_{r=r_start}^{r_end−1} Tsrc[r, c]`. Optional `has_mask = 1` with `C` as a per-element bitmask further restricts the product to the masked elements (e.g. for sparse-tensor row products).
+
+**Semantic positioning.** This is the **multiplicative analogue of `TCOLREDUCE_ADD`** restricted to a row sub-range. Useful for:
+- **Log-space softmax denominators** (running-product normalisation in attention's exp-stabilised variant);
+- **Geometric-mean reductions** along a time axis (physics / signal-processing kernels);
+- **Joint-probability aggregation** in probabilistic-graphical-model kernels;
+- **Polynomial-root / characteristic-polynomial evaluation** (Newton's identities) where the coefficients live as tile rows and a product over the live root set is needed.
+
+Prior to v2 these were typically emulated by `exp(sum(log(...)))` — 3 tile passes (log, sum, exp) with accumulated precision loss. `TROWRANGE_MUL` does it in one pass with no log/exp.
+
+**Algorithm.** A straightforward `MUL`-accumulate loop using the same hardware as the existing `TCOL*` reductions, with two twists: (1) the alu_op is `MUL` rather than `ADD`, and (2) the per-strip predicate activates only for strips that intersect `[r_start, r_end)`.
+
+```
+# Phase P0 — initialise the accumulator to 1.0 across all N columns
+  beat: src_A = IMM_1.0 (broadcast), alu_op = PASS_A,
+        acc_op = INIT, acc_slot = col_acc_base..col_acc_base + C − 1
+
+# Phase P1 — MUL-accumulate over the active row range
+  for strip s in 0..S-1:                              # S = strips per tile for Tsrc
+    strip_first_row = s · R_per_strip
+    strip_last_row  = strip_first_row + R_per_strip − 1
+    if strip_last_row < r_start or strip_first_row ≥ r_end:
+        skip strip (no beat emitted)
+    else:
+        # Build per-element row-range predicate for this strip
+        predicate[lane] = (strip_first_row + lane_row(lane) ≥ r_start) AND
+                          (strip_first_row + lane_row(lane) < r_end)
+        # Optional user mask AND into predicate if has_mask
+        if has_mask: predicate &= SC_strip[s]
+        beat: src_A = SA[strip s],
+              src_B = ACC_READ(col_acc_base, C),
+              alu_op = MUL,
+              mask_src = (synthesised row-range predicate OR SC_mask),
+              acc_op = ACCUM, acc_slot = col_acc_base..col_acc_base + C − 1
+              # Masked lanes leave acc unchanged (per-lane output gate, §5.8)
+              # Unmasked lanes: acc[c] ← acc[c] · Tsrc[row_in_strip, c]
+
+# Phase P2 — retire: Acc → Tdst row 0
+  beat: src_A = ACC_READ(col_acc_base, C), alu_op = PASS_A,
+        wr_en_D0 = 1, wr_strip_D0 = 0
+```
+
+**Predicate synthesis.** The row-range predicate `(strip_first_row + lane_row(lane) ∈ [r_start, r_end))` is a per-element boolean computed from 3 scalars (`r_start`, `r_end`, `strip_first_row`) and the per-lane geometric index. This generator is a ~100-gate combinational block in the stage (B) control path — effectively free. One microcode bit (`rrange_pred_en`) gates it into the per-lane output-gate input (§5.8) in place of (or AND-combined with) `SC_mask`.
+
+**Cycle count.**
+
+Let `S_active = ⌈(r_end − r_start) / R_per_strip⌉ + 1` = number of strips that intersect the range (worst case S+1 but always ≤ S = 8). Typical case for full-tile range is `S_active = 8`.
+
+- P0 (accumulator init): 1 beat (N lanes wide, all columns in parallel)
+- P1 (MUL-accumulate): `S_active` beats
+- P2 (retire): 1 beat
+
+**Total beats: `1 + S_active + 1` ≤ 10 beats.**
+
+With the 8-cycle fetch prologue (single value-tile, §6.2), end-to-end latency for a typical row-range multiply is **≈ 18 cycles** — comparable to an elementwise op and far cheaper than the emulated `log–sum–exp` triple-pass (~60+ cycles on v1).
+
+**Key hardware reuse.**
+- `ACCUM` / `INIT` / `READOUT` mode machinery (§5.5) — unchanged.
+- `alu_op = MUL` is already a first-class ALU operation (§5.8) — the shared stage-(B) FMA unit does `MUL` as `FMA` with `src_Z = 0`.
+- The per-lane output-gate predicate (§5.8) already supports the "skip-lane" semantic — just takes a different predicate source.
+- Scalar GPR `r_start` / `r_end` arrive via `SX` / `SY` at issue (§4.3); 0 vector-side cycles.
+
+**Interaction with `tilelet_xpose` / column-range form.** Setting `tilelet_xpose = 1` on the SA read inside Phase P1 transposes the range interpretation — the instruction then computes a **column-range product into a single column** instead of a row-range product into a single row. This reuses the same microcode verbatim and yields a free `TCOLRANGE_MUL` as a variant selected by a flag in the instruction word.
+
+#### 7.5.3 `TMRGSORT` — full-tile mergesort via a 256-lane shuffle + compare-swap primitive
+
+**Motivation.** Sorting is a workhorse primitive for top-k, argmax-over-ranges, quantile estimation, histogram bucketing, beam-search rank, and sparse-attention pruning. v1 implements this with hard-wired sort networks: `TSORT32` sorts fixed 32-blocks in parallel across a tile, and `TMRGSORT` merges such runs up to a fixed maximum block length. Each distinct block size needs its own RTL network (32-wide, 128-wide, etc.), and unusual block sizes are not supported at all.
+
+v2 replaces the family of hard-wired networks with **one reconfigurable primitive**: a **256-lane perfect-shuffle + compare-swap (CAS) block** sitting inside stage (B), controlled per beat by microcode. An arbitrary bitonic / Batcher merge network of any power-of-2 size is then expressed as a microcode loop of `O(log² N)` beats of this primitive. The result: one piece of silicon serves every sort kernel (fixed-block, full-tile, value-plus-index, key-range restricted, narrow-format) including future variants that were not anticipated at tape-out.
+
+**Hardware: 256-lane shuffle + CAS block.**
+
+- **Location**: inside stage (B), activated when `alu_op ∈ {SHUFFLE_CAS_UP, SHUFFLE_CAS_DOWN}` (§5.4).
+- **Datapath**: 256 lanes wide (= 2 × 512 B strips at FP32, equivalent wider SIMD for narrower formats). The 256 lanes flow through (i) a **log-stride shuffle butterfly** (log₂(256) = 8 stages of 2:1 MUXes, programmable stride `∈ {1, 2, 4, 8, 16, 32, 64, 128}`), then (ii) **128 parallel compare-swap units** that take adjacent output lanes `(2k, 2k+1)` and emit `(min, max)` for `UP` direction or `(max, min)` for `DOWN` direction. Each CAS unit carries an optional **index companion lane** (default 16 b) that tracks the original position — the comparator only examines the value half, but both value and index are transferred together on swap.
+- **Inputs**: two 128-lane strips `(src_A, src_B)` from any pair of staging registers, tile-element broadcasts, or ACC slots (reusing the standard stage-(B) input MUX, §5.8). In the common sort loop they come from `SA` (values) and `SB` (indices), with write-back through the same two buses on the next cycle.
+- **Outputs**: two 128-lane output strips, routed through the normal stage-(B) output fabric back into staging (for iterative bitonic stages) or to the retire path (final result).
+- **Area cost**: log-stride shuffle butterfly ≈ 8 stages × 128 × 32 b 2:1 MUX ≈ ~50 K gate; 128 × value-comparator-plus-swap ≈ ~70 K gate; direction-mux + control ≈ ~10 K gate. **Total ≈ ~130 K gate system-wide** — ~7 % of the stage (B) compute core, negligible next to the crossbar / cross-group-net savings of §10.1.
+
+**Single-beat primitive: one level of a sort network.** A single `SHUFFLE_CAS_*` beat consumes a 256-lane bitonic / sorted input, applies one permutation level (determined by `shuffle_stride`), and performs 128 parallel compare-swaps. This is exactly **one layer** of Batcher's bitonic sort or bitonic merge. Multiple layers chain through the staging register between beats — the staging write port absorbs the 256 lanes of output, the next beat reads them back as input.
+
+**Building full sort networks in microcode.**
+
+Batcher's bitonic sort of `N = 2^p` elements has depth
+
+$$
+D(N) = \frac{p(p+1)}{2} \qquad \text{layers (each layer = 1 CAS stage)}.
+$$
+
+Each layer performs `N/2` independent CAS operations. With a 128-CAS datapath, one layer costs `⌈(N/2) / 128⌉ = ⌈N / 256⌉` beats. Total beats for an in-tile `N`-element sort:
+
+$$
+B_{\text{sort}}(N) = D(N) \cdot \lceil N / 256 \rceil = \frac{p(p+1)}{2} \cdot \lceil 2^p / 256 \rceil.
+$$
+
+For `N ≤ 256` the ceiling is 1 — the entire sort fits in `D(N)` beats, one beat per layer (this is where the user's "log(N) beats per merge" intuition is most direct, generalised to `log² N / 2` for a full sort). For `N > 256` each layer spreads over multiple beats, one strip-pair per beat.
+
+| Sort size `N` | p = log₂N | Depth `D(N)` | Beats per layer | **Total beats** | @ 1 GHz |
+|---------------|-----------|---------------|------------------|-----------------|---------|
+| 32  | 5  | 15 | 1 (32 < 256, datapath idle on unused lanes) | **15**  | 15 ns |
+| 64  | 6  | 21 | 1 | **21**  | 21 ns |
+| 128 | 7  | 28 | 1 | **28**  | 28 ns |
+| **256** | **8** | **36** | **1** | **36**  | **36 ns** |
+| 512 | 9  | 45 | 2 | **90**  | 90 ns |
+| 1024 (1 tile, FP32) | 10 | 55 | 4 | **220** | 220 ns |
+| 2048 (1 tile, FP16) | 11 | 66 | 8 | **528** | 528 ns |
+| 4096 (1 tile, FP8)  | 12 | 78 | 16 | **1 248** | 1.25 µs |
+| 8192 (1 tile, FP4)  | 13 | 91 | 32 | **2 912** | 2.9 µs |
+
+For reference, a CPU comparison-sort of 1024 FP32 elements runs in ~15 µs on a single core — `TMRGSORT` at 220 ns is ~**70× faster** on the same silicon, with no cache pressure because the entire sort stays in staging.
+
+**Mergesort decomposition (optional optimisation).** For large `N` the same hardware can run a merge-sort rather than a single monolithic bitonic sort, which is often ~15–20 % faster in total beats because the initial in-block sorts do not have to see cross-block shuffle strides:
+
+```
+# Level A — sort each 256-block with bitonic network (36 layers, 1 beat each)
+for block in 0..N/256 − 1:
+    for layer in bitonic_sort_layers(256):        # 36 layers
+        SHUFFLE_CAS_UP  src_A = SA[2b],   src_B = SA[2b+1],
+                        shuffle_stride = layer.stride
+                     → writeback to the same two strips
+
+# Level B..K — iteratively double the merged-block length
+#   each merge of two sorted runs of length 2^k into a run of length 2^{k+1}
+#   takes log(2·2^k) = k+1 bitonic-merge layers, N/2^{k+1} pairs per level
+for k in 8..p−1:
+    run_len = 2^{k+1}
+    for pair in 0..N/run_len − 1:
+        for layer in 0..k:                        # bitonic-merge, k+1 layers
+            stride = 2^(k − layer)
+            SHUFFLE_CAS_UP  src_A, src_B = pair's two sub-runs,
+                            shuffle_stride = stride
+                         → writeback
+```
+
+Total beats under mergesort decomposition: slightly fewer than the monolithic bitonic bound above (e.g. ~200 instead of 220 for `N = 1024`), driven by the same `O(N log² N)` asymptotic. The document uses the monolithic bound as the conservative quote; either decomposition uses the same hardware and the same `SHUFFLE_CAS_*` beats, just a different microcode program.
+
+**Value + index tiles (`D0` / `D1`).** Each CAS unit treats the `(value, index)` pair as a single 32 b + 16 b compound lane. The shuffle butterfly carries both halves in lockstep. Comparator logic only examines the value half. When `TMRGSORT` is issued with `retire_mask = 2'b11` the staging write-back also emits the index strip to `SB`, and the final retire writes `SA` to `D0` (sorted values) and `SB` to `D1` (permutation indices).
+
+**Key-range / partial sort variant.** Setting `mask_src = SC_mask` on the shuffle+CAS beat applies a per-lane predicate — lanes with `M = 0` are pinned to `+∞` (for `UP`) or `−∞` (for `DOWN`) before the comparator, so they sink to the appropriate end of the sorted output. This converts `TMRGSORT` into a **partial sort** (only masked-in elements participate in the ordering) with zero extra beats — another example of the per-lane predicate (§5.8) multiplying into an unrelated instruction for free.
+
+**Hardware reuse.**
+
+| v2 mechanism | How `TMRGSORT` uses it |
+|--------------|------------------------|
+| Stage (B) input MUX (§5.8) | Selects `SA` / `SB` staging reads as 256-lane shuffle+CAS input |
+| Staging write path | Output of the shuffle+CAS writes back to staging for the next layer; no new write fabric |
+| Per-lane mask gate (§5.8) | `SC_mask` gates lanes into `±∞` for partial sort (free variant) |
+| `tilelet_xpose` (§3.3b) | Used in the cross-strip merge levels (k ≥ 8) to bring lanes from non-adjacent strips into `src_A` / `src_B` of one beat without additional staging copies |
+| Microcode ROM (§5.4, §9 item 8) | Stores the depth-`D(N)` layer-schedule table for each supported `N`; ~8 KB of extra ROM (`8 × p × p` entries × 16 b/entry) |
+| Dual retire (§3.1) | `D0 = values`, `D1 = indices` in one instruction, no second issue |
+
+**New hardware** (vs. v1's hard-wired `TSORT32` / `TMRGSORT` networks):
+
+- 256-lane log-stride shuffle butterfly (~50 K gate)
+- 128 × value-comparator-and-swap units (~70 K gate)
+- Direction mux + control decode (~10 K gate)
+- **Net**: ~130 K gate **added**, but v1's family of hard-wired sort networks (32-wide, 128-wide, merge-stages) is **removed** — those networks cost roughly the same (~100–150 K gate combined) but only support the exact sizes they were synthesised for. v2 gets the same silicon budget but supports **all** power-of-2 sizes from 32 to 8192 (FP4 tile) in one piece of hardware.
+
+**Cycle count — worked example for 1024 FP32 sort**:
+
+- 4 × Level-A block sorts: 4 × 36 = **144 beats**
+- 2 × 512-block merges (Level B): 2 × 9 × 2 = **36 beats**
+- 1 × 1024-block merge (Level C): 10 × 4 = **40 beats**
+- **Total: 220 beats** = 220 ns @ 1 GHz.
+
+**Comparison with v1.**
+
+| Aspect | v1 (`TSORT32` + `TMRGSORT`, hard-wired) | v2 (`TMRGSORT`, microcode + reconfigurable CAS) |
+|--------|------------------------------------------|----------------------------------------------|
+| Block sizes supported | 32 (and fixed MERGE stages up to tile) | **Any power-of-2** from 32 to 8192 (full FP4 tile) |
+| Per-block-size silicon | Separate network per size | One primitive, schedule picked by microcode |
+| 1024-element FP32 sort | ~15 beats × 32 blocks = ~500 beats (but sequentially retired — actually much longer if merging is not fully pipelined) | **220 beats** |
+| Value + index output | Two instructions in v1 (values first, then indices on a second issue with a different op) | **One instruction** — dual retire (§3.1) |
+| Partial sort (predicated) | Not supported — required caller to pre-mask the input tile | **Free variant** via `mask_src = SC_mask` (§5.8) |
+| Narrow-format sort (FP8 / FP4) | Not supported in v1 after format removal | Supported — CAS comparator runs native width |
+| Key + payload (e.g. sort by value, carry pointer) | Indices only (16 b) | Indices (16 b) or user-defined payload via an extra staging register |
+
+**Summary.** The entire sort family — fixed-block sort, full-tile sort, partial / predicated sort, multi-format sort, value + index — collapses into one **reconfigurable 256-lane shuffle+CAS primitive** driven by a microcode layer-schedule. The user's original observation is the key insight: **"log-depth per merge, log-many merge levels, one beat per layer"** — formally `B = p(p+1)/2 × ⌈N/256⌉` beats for an `N = 2^p` sort, giving 36 beats for `N = 256` (the "log²(N)/2 = 36 beats" fit-in-one-datapath regime the user's description maps to most cleanly) and scaling linearly in `⌈N/256⌉` for larger `N`. Hardware cost `~130 K gate` — approximately breakeven with v1's hard-wired networks while unlocking arbitrary sort shapes.
+
+#### 7.5.4 Summary table for all §7.5 novel ops
+
+| New ISA op | Reuses | New hardware | Cycle budget | What v1 couldn't do |
+|------------|--------|--------------|--------------|---------------------|
+| `TINV` (matrix inverse, up to 128×128 FP32 / 64×64 FP8, §7.5.1) | Unified ALU + Acc feedback (§5.5); `tilelet_xpose` (§3.3b); per-lane mask gate (§5.8); ACCUM / INIT / READOUT (§5.5) | Only `RECIP` / `RSQRT` alu_ops (~50 K gate); ~1 K bits extra microcode ROM | ~2 K beats (32×32 FP32) … ~33 K beats (128×128 FP32) | v1 has no Acc feedback, no FMA-with-Acc, no `tilelet_xpose` — tile-resident Gauss–Jordan needs DRAM traffic |
+| `TROWRANGE_MUL` (row-range product, §7.5.2) | ACCUM machinery; per-lane mask gate; `MUL` as alu_op | ~100-gate predicate synthesiser | ~10 beats | v1 has no per-element mask and no `MUL`-accumulate beat — caller emulates via log/exp triple-pass |
+| `TMRGSORT` (full-tile mergesort / bitonic sort, §7.5.3) | Stage (B) input MUX; per-lane mask gate; staging write-back; dual retire; microcode ROM | 256-lane shuffle+CAS primitive (~130 K gate, breakeven with v1's hard-wired networks) | `p(p+1)/2 × ⌈N/256⌉` beats: 36 (N=256), 220 (N=1024), 1 248 (N=4096 FP8), 2 912 (N=8192 FP4) | v1's hard-wired networks support only 32-block sorts plus a fixed merge hierarchy; v2 supports arbitrary `N = 2^p` including partial / value+index / narrow-format variants |
+
+All three instructions fall out of the same architectural properties that motivated v2 (unified ALU, Acc feedback, microcode per beat, per-element masking, dual retire), confirming the generality of the design.
+
+---
+
+### 7.6 Native 3-source ternary FMA — `TFMA D = A·B + C` (v2.1 增量)
+
+> **Motivation.** v0.16 of this document only supported ternary FMA via the **accumulator feedback path** — `TFMA_ACC D = A·B + Acc` (§5.8) — where the addend was implicitly the previous `Acc[*]` content. This works for kernels where the third operand is a running partial (GEMM epilogue, attention softmax staging, FMA-accumulate reductions) but **fails** for the **classic FMA pattern** `D = A·B + C` where all three operands are independent tile registers. The canonical real-world cases are:
+>
+> | Kernel | FMA form | Operands |
+> |--------|----------|----------|
+> | **LayerNorm / RMSNorm final affine** | `y_i = γ_i · x̂_i + β_i` | `γ`, `x̂`, `β` are **three independent tiles** (`γ`, `β` are learned parameters; `x̂` is the normalised activation). This is the **most frequent FMA in modern transformers** — every LayerNorm in BERT / GPT / Llama / Qwen executes one per token. |
+> | **Welford incremental update (mean)** | `μ_new = δ · inv_n + μ_old` | `δ = x_i - μ_old`, `inv_n` precomputed scalar broadcast, `μ_old` running tile. |
+> | **Welford incremental update (M2)** | `M2_new = δ · δ_2 + M2_old` | `δ_2 = x_i - μ_new`. Avoids catastrophic cancellation on small variance terms — single-rounding FMA preserves precision. |
+> | **Welford state merge** | `μ = δ · factor + μ_A`; `M2 = M2_A + δ · (δ · factor_m2) + M2_B` | Cross-thread / cross-block merges in distributed normalisation. |
+> | **Activation polynomials** | `gelu(x) ≈ 0.5·x·(1+tanh(√(2/π)·(x+0.044715·x³)))` — multiple FMAs over different tile inputs |
+> | **Trigonometric polynomials** | `sin(x) ≈ x·(c₁ + x²·(c₃ + x²·c₅))` — Horner-form FMAs |
+> | **Two-tile blend (`TLERP`)** | `D = A·(1−t) + B·t` = `D = (A−B)·(1−t) + B` (one FMA after one TSUB) |
+>
+> Without a native 3-source `TFMA`, every one of the above kernels must execute **two instructions** (`TMUL D' = A·B` followed by `TADD D = D' + C`), at **half the throughput** and **two roundings** instead of one. For LayerNorm this directly costs ~2× peak FP32 throughput on the affine step (which is the bottleneck of the entire op for tile-resident operands). The precision penalty is acute for FP16 / BF16 / FP8 narrow formats: two-rounding `mul + add` accumulates ~1 ulp of error per pair, which matters for normalisation kernels that re-feed the result into subsequent reductions.
+
+#### 7.6.1 ISA semantics
+
+```
+  TFMA   Td0,                Ta, Tb, Tc       # Td0 = Ta · Tb + Tc          (3-source FMA)
+  TFNMA  Td0,                Ta, Tb, Tc       # Td0 = -(Ta · Tb) + Tc       (negated 3-source FMA)
+  TLERP  Td0, Td1 [optional], Ta, Tb, Tc      # Td0 = Ta · (1 − Tc) + Tb · Tc  (linear interpolation)
+```
+
+All three instructions use the same operand binding:
+
+- **A, B**: value tiles, fetched on R0 / R4 → `SA` / `SB`
+- **C**: value tile (third addend), `c_role = VALUE` (§3.3c.2), fetched on **R1** → `SC` (the staging register's value-mode read port; not the mask port)
+- **D0**: result tile, retired on `W0` (`TLERP` may also retire to `W4` if a derivative tile is requested by `D1`)
+
+ISA encoding: same 32-bit R-type word as elementwise binary ops (§2.2.3 of [`Davinci_superscalar_v2.md`](Davinci_superscalar_v2.md)) plus a **`c_role = VALUE`** bit in `funct6`. v1 software cannot generate `c_role = VALUE` (the bit defaults to `MASK`), so v1 binaries see only the existing 1- and 2-source ops — full backward compatibility.
+
+#### 7.6.2 Microcode example — full-tile FP32 `TFMA D = A·B + C`, `R = 8, C = 128` (`W = 512 B`)
+
+The simplest possible microcode: 8 beats (one per strip) of `alu_op = FMA` with `src_C = SC[s]` instead of the 2-source binary path's `src_Z = IMM_ZERO`.
+
+```
+# Prologue — 3-port parallel fetch in 1 epoch (8 cy, §6.2 row N_val=3 c_role=VALUE)
+#   R0 ← A   strip s = 0..7
+#   R4 ← B   strip s = 0..7
+#   R1 ← C   strip s = 0..7
+# All three strips of index s land in SA[s], SB[s], SC[s] in lockstep.
+
+# Compute — 8 beats, 1 per strip (compute starts as soon as SA.valid[0], SB.valid[0], SC.valid[0])
+for s in 0..7:
+    beat: src_A = SA[s, xp = is_xpose_A],
+          src_B = SB[s, xp = is_xpose_B],
+          src_Z = SC[s, xp = is_xpose_C]            # value-mode read, §3.3c.2
+          alu_op = FMA, mask_src = IMM_ALL_ONES,
+          acc_op = NONE,
+          wr_en_D0 = 1, wr_strip_D0 = s             # retire D0[s] = SA[s] · SB[s] + SC[s]
+```
+
+End-to-end: **8 cy fetch + 8 beats + 1 cy first-beat compute fall-through** = **~10–12 cy total** — same as a binary `TADD` or `TMUL`. The 8-beat compute phase fully overlaps with the next instruction's fetch, so steady-state throughput is **1 tile / 8 cy / instruction**.
+
+#### 7.6.3 LayerNorm worked example — `y = γ·x̂ + β`
+
+Assume `R = 8, C = 128` FP32 (one row per strip). After the upstream Welford reduction has produced `x̂` in tile `Tx_hat`, the affine step is one `TFMA`:
+
+```
+  TFMA   T_y,   T_gamma, T_x_hat, T_beta     # T_y = T_gamma · T_x_hat + T_beta
+```
+
+**Cycles**:
+
+| Step | v1 baseline (no native TFMA) | v2.0 (Acc-feedback only) | **v2.1 (native 3-source TFMA)** |
+|------|------------------------------|---------------------------|------------------------------------|
+| Compute `T_temp = γ · x̂` | `TMUL T_temp, T_gamma, T_x_hat` — ~10 cy end-to-end, 8 cy fetch | same as v1 (Acc feedback can't hold an arbitrary 3rd tile) — ~10 cy | folded into the FMA — **0 separate cy** |
+| Compute `T_y = T_temp + β` | `TADD T_y, T_temp, T_beta` — ~10 cy (8 cy fetch incl. mask-rule R2) | same — ~10 cy | folded — **0 separate cy** |
+| **Single fused TFMA** | n/a | n/a | `TFMA T_y, T_gamma, T_x_hat, T_beta` — **~10–12 cy** |
+| **Total** | **~20 cy** | **~20 cy** | **~10–12 cy** |
+| **Speed-up vs. v1** | 1× | 1× | **~2×** |
+| **Roundings** | 2 (one per `TMUL` + `TADD`) | 2 | **1** (single FMA rounding — IEEE-754 correct) |
+
+The same 2× speedup applies to every Welford update beat in the LayerNorm reduction, RMSNorm's `y = γ · x̂` (degenerate 2-source) plus `y += β`, and any kernel where the third operand is **not** the running accumulator.
+
+#### 7.6.4 Hardware cost vs. baseline v0.16
+
+| Block | v0.16 (`TFMA_ACC` only) | **v2.1 native `TFMA`** | Δ |
+|-------|--------------------------|------------------------|---|
+| Stage (B) per-lane FMA core | 128 × FP32 FMA (~10 K/lane) | **same** — already supports `A·B + Z` | 0 |
+| `MUX_Z` per-lane input MUX | 6:1 sources `{ACC_LO, ACC_HI, SX, SY, IMM_ZERO, *unused*}` (§5.8) | **6:1 sources `{ACC_LO, ACC_HI, SX, SY, IMM_ZERO, SC_value}`** — one MUX input retargeted | ~0 (one source bit changes; same gate count) |
+| `SC` staging payload | 4 KB diagonal-skew array, 128 × 1 b mask read port only | **same 4 KB array** + **value-mode 512 B/cy read port** (existing 64 × 64 B sub-bank tree, output rotator, bank-select MUX from §4.2.1 reused) | ~5 K gate — small bank-MUX widening to drive the 512 B value bus alongside the 1 b predicate bus |
+| TRegFile read port (VEC binding) | 2 active (R0, R4) | **3 active (R0, R4, R1)** — TRegFile-4K-v2 already has 8R | 0 (allocation only) |
+| Microcode beat field | `src_C` already in beat word (§5.4) for ternary-via-Acc | **same** — `src_C = SC` value-mode now valid | 0 |
+| Issue-time decode | `has_mask` 1 b | **`has_mask` 1 b + `c_role` 1 b** — one extra bit in the issue packet | ~1 K gate (control-path widening through Tile RAT, RS, dispatch) |
+| **Total v2.1 increment vs. v0.16** | — | — | **~6 K gate** (~0.2 % of VEC-4K-v2 area) |
+
+**Summary**: native 3-source `TFMA` is **essentially free** in silicon — the FMA core, staging payload, microcode beat machinery, and 8-port TRegFile are all already in place. The only structural changes are:
+
+1. binding R1 (a previously idle TRegFile read port) to VEC,
+2. adding a 512 B value-mode read path on `SC` (in addition to the existing 1-bit mask path),
+3. adding the `c_role` bit to the issue packet and steering `MUX_Z`.
+
+The throughput gain is **2×** on the LayerNorm / RMSNorm affine step (the dominant cost in transformer normalisation), and the precision gain is **half the rounding error** for narrow-format kernels.
+
+#### 7.6.5 Interaction with masking, transpose, and dual retire
+
+- **Masking + 3-source FMA**: `c_role = VALUE` precludes simultaneously using `C` as a mask. For the rare case of a masked 3-source FMA (`if mask: D = A·B + C else: D = D_old`), the kernel splits into two instructions: unconditional `TFMA D' = A·B + C`, then `TSELECT D = mask ? D' : D_old`. The cost is one extra ~10 cy `TSELECT` issue.
+- **Transpose**: `is_xpose_{A,B,C}` independently selectable per operand. Uniform transpose → single 8 cy fetch epoch (§6.2 row `N_val=3, c_role=VALUE, uniform`). Mixed transpose → 16 cy or 24 cy depending on which subset of the three operands shares a mode.
+- **Dual retire**: a 3-source op may legitimately want to emit two results — `TLERP` could emit both `D0 = (1-t)·A + t·B` and `D1 = ∂D0/∂t = B - A`. The standard `D0` / `D1` pair (§3.1) handles this with `retire_mask = 2'b11` and a 2-beat retire schedule.
+- **Per-beat `tilelet_xpose`**: works on `SC` value-mode reads exactly as on `SA` / `SB` (§3.3b) — microcode can flip `xp_C` per beat to e.g. transpose a `β` tile mid-instruction. Useful if the stored layout of `β` differs from `γ` and `x̂`.
+
+#### 7.6.6 Summary table
+
+| Aspect | Native 3-source `TFMA` (§7.6) — v2.1 增量 |
+|--------|--------------------------------------------|
+| ISA form | `TFMA Td0, Ta, Tb, Tc` (also `TFNMA`, `TLERP`) |
+| Operand binding | A → R0/SA, B → R4/SB, **C → R1/SC value-mode (new)** |
+| Encoding cost | 1 bit `c_role ∈ {MASK, VALUE}` in `funct6` |
+| Best-case fetch | **8 cy** (3-port parallel, uniform `is_transpose`) |
+| Worst-case fetch | 16 cy (one `is_transpose` mismatch) / 24 cy (all three different — degenerate) |
+| Compute beats | **8** (one per strip, full-tile FP32 `R=8,C=128`) |
+| End-to-end (typical) | **~10–12 cy** — same as binary `TADD` |
+| Throughput gain vs. emulated `TMUL + TADD` | **~2×** |
+| Precision gain (narrow formats) | **2 rounding errors → 1** (IEEE-754 single-rounding FMA) |
+| Hardware delta vs. v0.16 | **~6 K gate** (~0.2 % of VEC-4K-v2) |
+| Real workloads enabled | LayerNorm / RMSNorm affine; Welford updates; activation polynomials (gelu, swiglu, …); trig polynomials (sin, cos, …); `TLERP` blend |
+
+---
+
+## 8. Cross-Lane / Cross-Strip Summary
+
+Unchanged from [`vector4k.md`](vector4k.md) §6, with the following additions:
+
+| Category | Cross-lane (within 512 B strip) | Cross-strip (among 8 strips) | New in v2 |
+|----------|-----------------------------------|-------------------------------|-----------|
+| Masked elementwise (`TADD_M`, `TMUL_M`, `TSELECT`) | Independent lanes; 2 value slices + 1-bit predicate per lane from `SC` | None | Per-element bitmask `C` (§3.3c); default `IMM_ALL_ONES` for unmasked |
+| `TFMA_ACC` (ternary via accumulator) | Independent lanes; `src_Z = ACC_READ_*` feeds the addend | None | 3-value-input ALU with Acc feedback (§5.8) |
+| `TCOL*` with `is_transpose = 1` on the operand | TRegFile §7 col-mode delivers chunk-grid transpose; stage (A) fixup reaches element-level `C × R` tile; `TROW*` template runs as-is | Strip replay collapses to the transposed axis | VEC-v2 forwards `is_transpose` to TRegFile-4K (§3.3); no new VEC hardware |
+| `TGATHER` with index tile | Index SIMD from `SA` (or `SB`), data mux from the other value tile; optional enable mask from `C` | Arbitrary strip sources | Indices live in a tile register; optional mask |
+| `TSORT32` with dual output | Sort network per 32-block | Block spanning strip boundary | `D1` carries indices |
+| **`TMRGSORT`** (§7.5.3) | 256-lane log-stride shuffle butterfly + 128 parallel value-compare-swap units; one beat = one layer of a Batcher bitonic network (128 CAS per beat) | Cross-strip merges at layers `k ≥ 8` pair `src_A` from one strip with `src_B` from a non-adjacent strip — the staging read + `tilelet_xpose` picks the right strips each beat | Single reconfigurable sort primitive replaces v1's family of hard-wired per-block-size networks; value+index retired together via `D0`/`D1`; `mask_src = SC_mask` converts the op into a free partial-sort variant |
+| **`TINV`** (§7.5.1) | Pivot row + `r_k = 1/A[k,k]` live in Acc; per-strip `FMA / FNMA` elimination ran lane-parallel; `tilelet_xpose` collapses the column-k gather to 1 strip / 128 lanes | Multi-tile for `N > N_max(format)` (e.g. 16 tiles for 128×128 FP32) — operand fetch extends the prologue but the compute loop is tile-serial | First PTO op that reuses the **full** Acc-feedback + FMA-accumulate + `RECIP` + per-lane predicate combo inside one instruction; no new ALU core, just the reciprocal-seed block |
+| **`TROWRANGE_MUL`** (§7.5.2) | 128-lane `MUL`-accumulate with per-lane row-range predicate | Skip strips entirely outside `[r_start, r_end)`; 1 beat per active strip, max 8 strips | Row-range predicate is combinationally synthesised from `SX` / `SY` scalars — no `SC` payload required for the range, `SC` remains available for an optional user mask |
+| **`TFMA` / `TFNMA` / `TLERP`** (native 3-source ternary, §7.6) — **v2.1 增量** | 128-lane FMA (existing FP32 FMA core; `MUX_Z` retargeted to `SC` value-mode read instead of `IMM_ZERO`) | None — pure elementwise, all three operands read in lockstep on R0/R4/R1 in one 8 cy epoch | New `c_role = VALUE` selector promotes `SC` to a 3rd value tile; +1 VEC-side TRegFile read port (R1, allocation only); ~2× throughput on LayerNorm `γ·x̂ + β`, Welford updates, activation polynomials |
+
+---
+
+## 9. Implementation Notes
+
+1. **Opcode decode** produces `SOP.(opcode, format, R, C, has_mask, is_xpose_{A,B,C}, ucode_{base,len}, retire_mask, #W / K_outer / f)`. The microcode ROM (§5.4) expands one beat word per compute cycle, driving the staging-register read ports (`{src, s, xp}` per slot + `mask_src`), stage (A) / (B), Acc (`acc_op`), and Pack / retire enables directly — **there is no crossbar or calendar-template resolution in between**. The R2-aware epoch packing policy for `N_val ≥ 2` with mixed tregfile-side `is_xpose` is decided at operand-fetch entry (§6.3) and is orthogonal to the microcode program. When `has_mask = 0`, the mask fetch is elided and the mask MUX is statically tied to `IMM_ALL_ONES`.
+2. **Staging register file area budget** (reference): `SA` + `SB` + `SC` = 3 × 4 KB = **12 KB** of payload storage. The **recommended production baseline** implements the payload as 1R1W SRAM (24 × 4 Kb × 1R1W macros, 8 per staging register, one per row-group — see §9.1 for the functional-equivalence analysis, area comparison, and migration caveats). An **alternative flip-flop implementation** is available for early-architecture / FPGA-prototype variants, at roughly 2× the staging-payload gate count but with combinational read access. `SC` is oversized for its typical payload (the packed mask is ≤ 1 KB), but keeping it the same physical size as `SA` / `SB` avoids a second diagonal-skew variant and leaves headroom for a future interpretation of `C` as a full-tile value operand if needed. Double-buffering (optional, §9.2) doubles staging macro / FF count. Metadata registers, the strip-valid bitmap, scalar staging (`SX`, `SY`), and `SOP` are negligible (a few hundred bits) and stay flip-flop regardless of the payload choice.
+3. **Staging transpose hardware.** Each staging register is a local instantiation of the [`tregfile4k.md`](tregfile4k.md) §7 bank-conflict-free diagonal skew: 64 × 64 B flip-flop sub-banks + 1 write-side 64 B rotator + 1 read-side bank-select mux + 1 read-side 64 B output rotator. Per-beat `tilelet_xpose` steers the read-side mux / rotator. All components are 64 B granularity and negligible vs. the 4 KB flip-flop store. The [`tregfile4k.md`](tregfile4k.md) §7 read port itself is reused unchanged for the operand-fetch-time `is_xpose_*` bits (§3.3a).
+4. **Metadata write discipline** (§2.2): a tile register’s metadata is updated **only** by the instruction that produces the tile (implicit `TSETMETA` at retire time). An operand-fetch path that observes inconsistent `shape.x · shape.y · E ≠ 4096` at the decoded `E` is a microcode-reportable error.
+5. **Determinism**: PTO ops still retire atomically; the variable fetch phase is internal scheduling and does not expose out-of-order effects at the ISA boundary.
+6. **Resource conflicts**: with only two read ports, mixed-`is_transpose` 2-value-operand instructions pay one extra epoch under rule R2 (§6.3). The mask fetch (`C`, §3.3c) is small enough to ride on an idle port cycle within a value epoch and never needs its own epoch in the common case, so adding a mask is free in the prologue. A narrow-issue core that avoids mixed-mode within a single instruction (either by chaining two kernels or by pre-transposing inputs as a producer-side decision) sees full overlap.
+7. **TRegFile-4K binding**: VEC-4K-v2 can use either row-mode or col-mode reads on R0 / R4 (per-operand `is_xpose_*`, §3.3a). An existing [`tregfile4k.md`](tregfile4k.md) instance plugs in unchanged. In most workloads the staging-side per-beat `tilelet_xpose` (§3.3b) covers all transpose needs, so the tregfile-side `is_xpose_*` bits default to 0 and [`tregfile4k.md`](tregfile4k.md) §6 rule R2 is trivially satisfied; when a tile is re-read many times in the transposed view, microcode can set tregfile-side `is_xpose_*` to cache the transposed form and save a chunk-grid transpose per reuse (§6.3).
+8. **Microcode ROM area**: the beat-word ROM is ~64 b × `max_ucode_len` × `N_programs`. Even at `max_ucode_len = 64` beats and `N_programs = 256`, the ROM is ~128 KB — a few percent of a 4 MB L1 data array; in practice it is SRAM or ROM-style logic close to `SOP`. Microcode can be regenerated off-chip per ISA version (no RTL change).
+
+### 9.1 Staging store: SRAM baseline vs. flip-flop alternative
+
+The **production baseline** for `SA` / `SB` / `SC` is **1R1W SRAM** (24 macros × 4 Kb × 1R1W, 8 macros per staging register, one per row-group). An earlier draft used 12 KB of flip-flop storage (~100 K FF, ~1.0 M gate) for all three staging registers — that FF version is still a valid **alternative** for early-architecture / FPGA-prototype builds where combinational read access or trivial scan-chain coverage matters more than area, but at roughly 2× the payload-store gate count. This subsection works through the functional requirements that any staging-payload implementation must satisfy, demonstrates that both options meet them exactly, and documents the migration caveats that led to the SRAM choice as the production baseline.
+
+#### 9.1.1 Functional requirements that any staging-store implementation must preserve
+
+Pulling requirements from §3.5, §4.2, §4.2.1, §4.2.2, §5.4, §5.6, §6:
+
+- **R1. 64 B-granularity sub-bank independence (col-mode read).** Per-beat `tilelet_xpose = 1` reads 8 × 64 B sub-chunks, one from each of the 8 row-groups, at **independently-chosen entries** (the diagonal skew `bank_id = 8·g + ((l + g) mod 8)`). Equivalently: the 8 row-groups must each be addressable with a distinct read index in the same cycle. Row-mode (`tilelet_xpose = 0`) drives the same 8 row-groups with a common entry.
+- **R2. Three concurrent full-width reads per cycle.** `SA`, `SB`, `SC` all emit 512 B in the same beat (§5.4 `src_A` / `src_B` / `src_Z` / `mask_src` can each name a different staging register). Each staging register is a physically separate sub-system, so its port budget is independent.
+- **R3. One strip write per cycle, overlapping read during double-buffering / fetch-overlap (§6, §9.2).** During operand fetch of instruction N+1 one 512 B strip arrives per TRegFile read-port cycle (8 × 64 B sub-chunks after the write rotator) and is deposited at 8 entries in the 8 row-groups of one staging register. In the same cycle the *compute* phase of instruction N may still be reading the same staging register. Worst case: **1 strip-write and 1 strip-read per staging register per cycle**, at possibly-different entries in each row-group.
+- **R4. Arbitrary strip arrival order, tracked by strip-valid bitmap.** Fetch scheduling (§6.3) can deliver strips out of tile-linear order; the strip-valid bitmap determines when the instruction is ready to enter compute. The payload store itself just needs random-entry write access.
+- **R5. Lane-predicate unpack port on `SC` (§4.2.2).** `SC` additionally emits a 128 × 1 b predicate bus each compute beat. Since `SC` is a mask-only buffer (§4.2), the mask-strip read and the payload read for `SC` are the **same read** downstream; no extra read port on the storage is required.
+- **R6. Metadata and strip-valid flags (§2.2, §4.2) can live outside the payload store.** They are small (tens of bits per staging register), routinely updated per instruction, and best kept in flip-flops regardless of payload implementation.
+
+The diagonal skew itself is an *addressing* property, not a storage-cell property — it stays identical under any payload implementation.
+
+#### 9.1.2 Mapping the requirements to SRAM macros
+
+Pick the macro granularity that exactly satisfies R1:
+
+- **One 1R1W SRAM per row-group**, 8 entries × 512 b = **4 Kb × 1R1W** each.
+- **8 row-groups per staging register** × **3 staging registers** = **24 such macros** in total.
+
+Check each requirement against this choice:
+
+| Req | Satisfied by 4 Kb × 1R1W, 8 macros per staging |
+|-----|------------------------------------------------|
+| R1  | ✅ Each row-group is its own macro → 8 independent read addresses per cycle (col-mode), or one shared address (row-mode). |
+| R2  | ✅ `SA` / `SB` / `SC` are physically separate banks, so 3-way concurrent reads cost nothing extra. |
+| R3  | ✅ 1R1W port provides same-cycle read + write to different entries. Strip write fans out 8 × 64 B across the 8 macros; compute read takes 1 entry per macro; the two ports are independent. |
+| R4  | ✅ Random entry addressability; the strip-valid bitmap (FF, external) gates readiness. |
+| R5  | ✅ `SC` needs only the same 1R1W as `SA` / `SB`; the mask unpack is a downstream transformation of the 64 B sub-chunk read. |
+| R6  | ✅ Metadata and strip-valid FF arrays are untouched. |
+
+All functional properties of the v2 flip-flop staging — per-beat `tilelet_xpose`, three-way concurrent read, strip-valid tracking, mask-port compatibility, out-of-order strip write — are preserved. **SRAM is a functional equivalent at this macro granularity.**
+
+#### 9.1.3 The SRAM-read latency pipeline stage
+
+One subtle behavioural difference: a flip-flop read is combinational from address to data (one MUX tree, ~0.2–0.4 ns at a modern node), whereas a 1R1W SRAM macro typically latches its output at the end of the access cycle (sense-amp / output-register combined path, ~0.3–0.5 ns but usually *ending* in a register).
+
+Concretely this inserts one pipeline stage between "beat word latched in `SOP.ucode_*`" and "staging read data available at stage (A) input". Two coping strategies, both functionally transparent to microcode:
+
+- **Option (a) — issue beat word one cycle earlier**: the microcode counter pre-increments, so cycle *t* issues address for cycle *t+1*'s compute. Microcode pseudo-code (§5.7) is unaffected; it's a pure scheduler shift.
+- **Option (b) — add one pipeline register after the SRAM output**, before stage (A). Adds 1 cycle to the end-to-end fetch-to-retire latency but does not change steady-state throughput. This is the approach most SRAM-based designs use.
+
+Option (b) costs 1 cycle of first-beat latency — negligible given the fetch-phase dominant cost (~8–16 cy before compute starts anyway, §6.2). **Throughput is unchanged in both options.**
+
+#### 9.1.4 Macro aspect-ratio and foundry compiler caveats
+
+`4 Kb × 1R1W, 512 b × 8 depth` is a **narrow-and-shallow** macro shape. Foundry SRAM compilers typically support dual-port macros only above some minimum depth (often 16 or 32) and minimum aspect ratio. If the target PDK does not expose `512 b × 8 × 1R1W`, three practical fall-backs:
+
+- **(i) Over-depth macro — `512 b × 16 × 1R1W`**, half empty. Uses ~2× the nominal bit count but still beats FF on absolute area because the dual-port bit cell (~8–10 T) is still much denser than a FF cell (~20–24 T cell + MUX peripherals). **Common choice** when PDK minimum depth is 16.
+- **(ii) Fold depth across two staging registers into one macro — `512 b × 16 × 1R1W` shared by e.g. `SA[row-group g]` and `SB[row-group g]`**. Breaks R2 (three-way concurrent read); **not recommended**.
+- **(iii) Split width — `256 b × 8 × 1R1W` × 2 macros per row-group** = 16 macros per staging, 48 total. Doubles macro count (more peripheral overhead) but keeps any minimum-width constraint manageable. Usable but increases routing nets between macros and stage (A).
+
+At most target nodes a combination of (i) for `SA` / `SB` and a smaller / narrower variant for `SC` (whose 16 B / mask-beat bandwidth is actually much lower than 64 B) is workable.
+
+#### 9.1.5 Area comparison
+
+Rough gate-equivalent numbers at a modern FinFET node (cell areas vary ±30 % by node and compiler; treat as a ruler, not a synthesis number).
+
+| Storage choice | Cells / bit | Peripheral overhead | Per-macro area | Count | Total area |
+|---------------|-------------|---------------------|----------------|-------|------------|
+| **v2 baseline — flip-flop store** | ~8–10 NAND2 per bit (FF + local MUX) | — (no macro peripherals; just read-MUX tree) | n/a | — | **~1.0 M gate** for 96 K FF |
+| **SRAM-1R1W, 512 b × 8 nominal (ideal PDK)** | ~1.8 gate per bit (8T dual-port cell) | ~50–100 % of array area on a 4 Kb macro | ~12–18 K gate | 24 | **~0.3–0.4 M gate** |
+| **SRAM-1R1W, 512 b × 16 half-empty (realistic PDK)** | same cell | same overhead, doubled array | ~20–25 K gate | 24 | **~0.5–0.6 M gate** |
+| **SRAM-1R1W, 256 b × 8, 2-per-row-group (alt shape)** | same cell | peripheral fraction larger at half-width | ~9–12 K gate | 48 | **~0.4–0.5 M gate** |
+
+**Net savings**: even in the realistic half-empty-macro case, moving the staging payload from FF to 1R1W SRAM saves **roughly 0.4–0.6 M gate** per staging set — **~40–60 % of the staging area**, or **~15–25 % of the total VEC-4K-v2 gate count**.
+
+Under the double-buffered option (§9.2), savings roughly double in absolute terms because the FF staging would be 24 KB (~2.0 M gate) while the SRAM version stays close to ~0.8–1.0 M gate.
+
+Feeding this back into the §10.1 totals:
+
+| Total (order-of-magnitude) | FF staging (v2 baseline) | SRAM staging (proposed) |
+|----------------------------|--------------------------|--------------------------|
+| Single-buffered | ~2.35 M gate | **~1.85 M gate** |
+| Double-buffered | ~3.35 M gate | **~2.35 M gate** |
+
+Compared to v1 (~2.57 M gate), SRAM-staged v2 is **~28 % smaller single-buffered** and roughly v1-sized even double-buffered. This makes SRAM migration especially attractive as a prerequisite for enabling the double-buffering option.
+
+#### 9.1.6 Secondary trade-offs and risks
+
+- **Timing closure**: SRAM sense-amp delay may eat more of the cycle budget than the v2 FF MUX tree. Mitigated by Option (b) above (extra pipeline stage after macro output); may still matter at high target frequencies.
+- **Floorplan**: 24 small macros are less dense and less routable than 8 flip-flop islands (3 staging × 8 row-groups of FF). Macros must be co-located with their row-group MUX / rotator to keep the 512 b buses short. In practice the 24 macros can be placed in three 8-macro columns, one per staging register, each column flanked by its write rotator and read bank-select MUX.
+- **Power**: SRAM macros are usually lower dynamic power per bit-access than an equivalent FF array (the FF's CLK tree drives every bit every cycle, whereas SRAM toggles only the accessed word line + bit lines). Expect **significant dynamic-power reduction** on the staging store from the migration, in addition to the area win. Static (leakage) power depends on the specific macro compiler and threshold mix.
+- **Reset / unknown-state semantics**: FF arrays can be cold-reset to a known value; SRAM macros cannot. The strip-valid bitmap (kept in FF) already arbitrates "valid vs. garbage" at cycle 0, so no functional gap — but post-silicon debug, scan-chain coverage, and BIST hooks must be designed in from the start.
+- **Scan / DFT**: flip-flop stores have trivial scan coverage. SRAM macros need MBIST infrastructure (~1–2 % area overhead). Typically acceptable given the area win.
+- **Minimum retention voltage**: FF and dual-port SRAM have different minimum retention voltages. If VEC is expected to operate at deep-Vdd-scaling low-power points, FF may retain to lower voltage than the SRAM macro — but this is unlikely to matter if the whole chip's SRAMs gate this decision anyway.
+- **Yield**: 24 small SRAM macros add ~0.01 % yield loss (negligible at modern DPPM targets).
+
+#### 9.1.7 Summary
+
+- **Production baseline** (recommended and used in §10.1 totals): `SA` / `SB` / `SC` as **one 1R1W SRAM macro per row-group, 8 macros per staging register, 24 macros total**. Use 1 pipeline-stage Option (b) after each macro output. Area: **~500 K gate** for the payload store (vs. ~1.0 M gate for the FF alternative), plus a meaningful dynamic-power reduction because only accessed word lines toggle per cycle.
+- **Early-architecture / FPGA-prototype alternative**: flip-flop payload store. Read path is combinational (no pipeline-stage option needed), scan coverage is trivial, no MBIST, and the FPGA-mapping case automatically becomes BRAM for free. Area: **~1.0 M gate** — double the SRAM baseline, but still preferable in the contexts listed.
+- **Strip-valid bitmap, metadata registers, `SX` / `SY` scalar staging, and `SOP` control register must stay flip-flop** regardless of the SA/SB/SC choice — they're too small to amortise SRAM peripheral overhead and they need combinational read access for the scheduler FSM.
+
+**Functional equivalence is preserved** across both options. Every requirement in §9.1.1 is satisfied. The only microarchitectural change under the SRAM baseline is the optional extra pipeline stage on the staging read path (Option (b) of §9.1.3), which adds one cycle to first-beat fetch-to-retire latency but does not change steady-state throughput or any externally-visible ISA property. Because this latency is absorbed into the already-~10-cycle operand-fetch prologue (§6.2), **no PTO-level kernel cycle count anywhere in this document changes** between the two staging-payload implementations.
+
+---
+
+## 10. Comparison: VEC-4K-v2 vs VEC-4K (v1)
+
+| Axis | VEC-4K (v1) | **VEC-4K-v2** | Net effect |
+|------|-------------|---------------|------------|
+| Tile operands per instruction | 1–2 (all value tiles) | **1–3 value tiles**: `A`, `B`, **plus optional `C` with dual role — `c_role = MASK` (1-bit bitmask) or `c_role = VALUE` (full third value tile, v2.1 增量, enables native `TFMA D = A·B + C` family, §7.6)** | Per-element predication / masking on every op (full-tile ops skip `C` fetch entirely, §3.3c, §6.2). **Native 3-source ternary FMA** via `c_role = VALUE` powers LayerNorm `γ·x̂ + β`, Welford updates, activation polynomials, etc. — ~2× throughput vs. emulated `TMUL + TADD`, single rounding for narrow-format kernels (§7.6). Acc-feedback ternary FMA (`TFMA_ACC D = A·B + Acc`) remains available as the GEMM-epilogue / FMA-accumulate path. |
+| Tile results per instruction | 1 | **1–2** | `TSORT32`, `TROWARGMAX`, `TDIVMOD` emit value+index in a single op |
+| Read ports | 2 × 512 B | **3 × 512 B** (R0 = Port A, R4 = Port B, **R1 = Port C** — v2.1 增量; §3.1) — TRegFile-4K already has 8 read ports, so the change is a binding allocation, no SRAM / bank changes | Enables 1-epoch parallel fetch of three value tiles for native `TFMA` (§7.6); R1 is idle and energy-gated when no 3-source op is in flight (purely elementwise / cube-only / reduce-only kernels see v1-equivalent 2-port utilisation). |
+| Write ports | 2 × 512 B | 2 × 512 B | unchanged |
+| Transpose mechanism | — (none at VEC level; `TTRANS` materializes a row-mode copy) | **Two orthogonal chunk-grid transpose points: (i) per-operand `is_xpose_*` at TRegFile read ([`tregfile4k.md`](tregfile4k.md) §7); (ii) per-beat `tilelet_xpose` at staging read (§3.3b, §4.2.1).** The staging-side is the primary path and costs zero scheduling; the tregfile-side is optional and subject to [`tregfile4k.md`](tregfile4k.md) §6 rule R2 | `TCOL*` / `TROW*` become a single per-beat microcode pattern (§5.6); no `TTRANS` predecessor; element-level fixup for non-aligned `W` reuses the stage (A) permute block |
+| Compute-datapath front-end | crossbar + calendar-template decoder | **no crossbar** — 3 direct 512 B operand buses from `SA` / `SB` / `SC` to stage (A); per-beat microcode (§5.4) selects `{src, s, xp}` per slot | Simpler hardware; per-beat flexibility; reduction workloads expressed as straight-line microcode with accumulator RMW (§5.6) |
+| Accumulator ALU | **Separate** RMW adder block between stage (B) and the Acc register file | **Merged into stage (B) ALU** via per-operand-slot input MUX + Acc LO/HI sub-bank dual-read feedback path (§5.5, §5.8); no dedicated Acc adder exists | Same (or smaller) area; `ACCUM` / `MERGE_STAGE` automatically inherit every ALU op (ADD / MAX / MIN / OR / FMA, …) — e.g. cross-slot `MAX`-reduction for argmax, `FMA`-accumulate for GEMM epilogue, all without adding hardware |
+| Compute-group count `N_group` | 128, format-independent | **128, format-independent** (§5.8.1) | Unchanged; narrow formats scale by widening each group's SIMD (`W_ALU,i`), not by multiplying groups |
+| Intra-group tree (format widener) | present, depth 0 (FP32) … 3 (FP4); narrows `W_ALU,i` → `W_tree,i = 32 b` | **present, same role** (§5.8.1) — now explicitly documented as the format-narrow → FP32 widen step, not a general reduction | Required for a 32-b FP32-shaped Acc slot to serve every format uniformly; cannot be removed without fattening Acc or serialising narrow formats |
+| Cross-group reducer network (`N_tree` trees, `#waves_tree`) | always present; used by `TROW*`/`TCOL*` for one-beat 128 → k folds | **Removed in the recommended v2 baseline** (§5.8.1). All `TROW*` instructions are routed through `tilelet_xpose + ACCUM + MERGE_STAGE` (§5.6.2): +44% beats for wide `TROW*` (§5.7.2 vs. 5.7.2-alt), up to +5.4× for the narrowest-`W` `TROW*` (§5.7.4 vs. 5.7.4-alt). An alternative configuration that retains the network is fully documented but is **not** the recommended implementation | Saves a 128-input programmable adder network in stage (B); guarantees strict lane alignment on **every** stage (B) beat. Penalty is confined to the `TROW*` family — element-wise, FMA-accumulate, masked, and `TCOL*` paths are unchanged in beat count |
+| Tile-level metadata | implicit in opcode | **explicit 32 b per tile register** (shape.x, shape.y, format) | Simplifies cross-instruction chaining; opcode carries no shape |
+| Supported formats | FP32, FP16, BF16 | **FP32, FP16/BF16, FP8, FP4** | Low-precision paths re-enabled for inference |
+| Staging registers | implicit SRAM buffers in stages (A)/(B) | **explicit flip-flop `SA`, `SB` (value, 4 KB each, diagonal-skewed, 64 × 64 B sub-banks) + `SC` (mask buffer, 4 KB physical, 128 × 1 b lane-predicate read port, §4.2.2) + `SX`, `SY`, `SOP`** | Operand fetch and compute decouple; variable-length prologue; per-beat row/col-mode tilelet delivery at 512 B/cy (§4.2.1); per-element predication via `SC` with zero default mask fetch (§3.3c) |
+| Operand-fetch latency | fixed 8 cy (2 tiles) | **variable 8 / 16 cy** (`N_val` and tregfile-side `is_xpose_*` mix; `has_mask` is free in the common case) | 2-value-op uniform-xp path unchanged with or without mask; 2-value-op mixed-xp path costs +8 cy under rule R2 (§6.2) — avoidable by using staging-side `tilelet_xpose` instead. Masked vs. unmasked have identical `T_fetch` in the common case |
+| Compute datapath width | 512 B | 512 B | unchanged |
+| Acc ping-pong, `N_run`, tree | unchanged | unchanged | 256 × 32 b × 2, `N_run = 512` |
+| Compute cycle formulas (`#W`, `K_outer`, `f`) | [`vector4k.md`](vector4k.md) §5.3.2 | **same formulas, now expressed as microcode programs (§5.4)**: col-reduce = accumulate-loop with `acc_op = INIT / ACCUM`; row-reduce = same loop with `tilelet_xpose = 1` on the operand; narrow-tile = loop + `acc_op = MERGE_STAGE` finale (§5.6.3) | Replaces `TCOL*` strip replay with a flat microcode loop; narrow-tile shapes pay `log2(512/W)` extra merge beats instead of per-wave reduction control |
+| Dual-result retire extra cost | n/a (not supported) | **0 additional cycles** — second write port was already present | "free" when `retire_mask = 2'b11` |
+| Master table size (§9 of each doc) | 35 rows (23 unique) | same 35 rows × 4 formats = **expanded when FP8/FP4 re-included** | extra shapes covered by same datapath |
+
+### 10.1 Circuit size (area) — block-level delta
+
+All numbers below are **order-of-magnitude** estimates for a 512 B / 128-lane FP32-equivalent datapath, expressed in NAND2-gate-equivalents. Treat as a comparison ruler between v1 and v2, not as a synthesis PDK number — absolute numbers are ±30 % sensitive to process node and compiler choices, but the **delta** between v1 and v2 is robust because both versions share stage (A) / stage (B) core / Acc storage / pipeline registers in common.
+
+**Cell calibration** (modern FinFET node, NAND2 equivalents):
+
+| Cell | Gate count | Notes |
+|------|------------|-------|
+| Flip-flop (D-FF, +local mux tree) | ~10 | Single bit of FF storage with routine surrounding combinational |
+| 1R1W SRAM bit cell (8T) with amortised peripheral | ~3–4 | For a *small* (~4 Kb) macro; bit-cell is ~1.5 gate, peripheral overhead roughly doubles it |
+| 32 b 2:1 MUX | ~50 | The "input-MUX-layer" building block; a 32 b 8:1 = ~150, a 32 b 16:1 = ~250 |
+| 32 b integer adder | ~150 | Pure carry-propagate |
+| **32 b FP32 adder** | **~1 500** | 24 b exponent-align shifter + 24 b mantissa add + LZD/normalise + round + special-value handling |
+| 32 b FP32 multiplier (standalone) | ~6 000–8 000 | 24×24 Booth / Wallace partial-product array dominates |
+| **32 b FP32 FMA** (fused mul-add) | **~10 000** | 24×24 mul + 48 b aligned add + shared exponent logic |
+| 32 b FP32 min / max / cmp / select | ~400 | Sign+exp+mantissa compare + mux |
+| SIMD narrow-format SIMD per 32 b lane (FP16×2 / FP8×4 / FP4×4) | +30–50 % | Over a pure-FP32 lane: adds per-format unpack/widen + sub-beat pipeline |
+
+These numbers matter because the dominant per-lane area in a 128-lane compute core is the FP32 FMA, not the adder — so blocks that **only need ADD + MIN / MAX** (e.g. the v1 Acc RMW unit) are roughly 5–10× smaller per lane than a block that also supports FMA / multiplication (e.g. the stage (B) ALU). Earlier drafts of this section conflated the two and under-counted blocks whose dominant cell is a scalar FP32 adder.
+
+**Block-level delta table.** The v2 column uses the **§9.1 recommended production baseline: `SA` / `SB` / `SC` implemented as 1R1W SRAM** (24 macros, 4 Kb × 1R1W each, typically `512 b × 16` half-empty at a real PDK). A separate row is retained for the FF variant (early-architecture / FPGA prototype), since each option is fully documented in §9.1.
+
+| Block | v1 size | v2 size (SRAM-staging baseline) | v2 size (FF-staging alternative, §9.1) | Δ vs v1 (SRAM baseline) | Notes |
+|-------|---------|---------------------------------|----------------------------------------|-------------------------|-------|
+| **Staging payload store** (`SA` + `SB` + `SC`) | ~0 (no explicit staging; implicit pipeline regs ~50 K gate) | 24 × `512 b × 16 × 1R1W` macros ≈ **~500 K gate** | 3 × 4 KB FF = 12 KB FF (~100 K FF) ≈ **~1.0 M gate** | **+500 K gate** (SRAM) / +1.0 M gate (FF) | The single largest add in v2. SRAM baseline halves the FF cost. See §9.1 for the functional-equivalence analysis. |
+| **Staging read MUXes** (rotator + bank-select, × 3 staging) | — | 3 × (64 B 8:1 output rotator + 64 × 2:1 bank-select) ≈ **~30 K gate** | same ≈ **~30 K gate** | +30 K gate | Reused 1:1 from [`tregfile4k.md`](tregfile4k.md) §7.4. |
+| **Input crossbar** (read-port → stage A) | 2 × 8 strips × up to 3 operand slots × 512 B with unaligned-offset shuffle ≈ **~1.0 M gate** | **absent** — 3 point-to-point 512 B buses from `SA`/`SB`/`SC` to stage (A) | **absent** | **−1.0 M gate** | Dominant area block in v1 and the largest routing-congestion contributor (§10.2). |
+| **Microcode ROM** | — (calendar-template combinational decoder ≈ ~30 K gate) | 64 b × ~64 beats × ~256 programs ≈ 128 Kb small-SRAM/ROM ≈ **~200 K gate** | same ≈ **~200 K gate** | +170 K gate (net, after removing v1 calendar decoder) | Small SRAM-like structure beside `SOP`. |
+| **Stage (A) align / unpack / permute** | **~200 K gate** | ~200 K gate | ~200 K gate | **0** | Unchanged; v2 reuses for format fixup and for the non-512 B `W` regimes that [`tregfile4k.md`](tregfile4k.md) §7.5 delivers. |
+| **Stage (B) compute-group core** (128 × FP32 FMA + min/max/cmp/select + narrow-format widener; intra-group tree retained) | 128 × ~14 K/lane ≈ **~1.8 M gate** | same ≈ **~1.8 M gate** | same ≈ **~1.8 M gate** | **0** | Same core ALU in both versions. FMA is the dominant cell per lane (~10 K/lane). v2 reuses this ALU for accumulator RMW via feedback (§5.5). |
+| **Stage (B) value-input MUXes** | 2:1 per lane × 2 operands (static operand selection) ≈ **~12 K gate** | 6:1 per lane × 3 slots (`MUX_A` / `MUX_B` / `MUX_Z`) + 3:1 mask + output gate ≈ **~115 K gate** | same ≈ **~115 K gate** | +103 K gate | Absorbs the Acc feedback, `SX` / `SY` scalar broadcast, and masking controls into one per-lane block. |
+| **Cross-group reducer network** (`N_tree` trees, §5.3.2 of v1) | 8 × (128 → k) FP32-adder network ≈ **~1.1 M gate** (8 × ~127 FP adders + programmable fabric) | **absent** in recommended baseline (§5.8.1) | **absent** | **−1.1 M gate** | An alternative config that retains the net adds ~1.1 M gate back and re-introduces the `TROW*` cross-lane wiring corridor. |
+| **Dedicated Acc RMW adder** | 128 × (FP32 adder + min/max + mux) ≈ 128 × ~2 K/lane ≈ **~250 K gate** | **absent** — folded into the shared stage (B) ALU via feedback (§5.5) | **absent** | **−250 K gate** | v1's Acc RMW supports only ADD / MIN / MAX (no FMA, no multiply, no narrow-format SIMD) so its per-lane cell is ~7× smaller than the stage (B) ALU per lane — this is why the block is much smaller than stage (B) even at the same lane count. |
+| **Accumulator storage** | 256 × 32 b × 2 ping-pong = 16 Kb FF ≈ **~200 K gate** | same 16 Kb, split into 2 parity-indexed banks (§5.5.1) ≈ **~200 K gate** | same ≈ **~200 K gate** | **0** | Parity split is wiring-only; every bit still lives in exactly one FF. |
+| **Accumulator feedback MUX** (MERGE_STAGE pair-selection, §5.5.1) | — | 2 × 8-way 128 × 32 b read shuffle + per-lane 2:1 swap + 8-way 64 × 32 b write-back shuffle ≈ **~250 K gate** | same ≈ **~250 K gate** | +250 K gate | Structured log-stride butterfly — absorbs the function of the v1 dedicated Acc RMW adder, so the **net area vs. v1 for the "accumulator RMW service" (Acc adder + feedback MUX) is ≈ 0**. |
+| **Strip-valid bitmap + per-operand FSM** | — | ~500 FF + small FSM ≈ **~5 K gate** | same ≈ **~5 K gate** | +5 K gate | Enables variable-length fetch + overlapped compute (§6). |
+| **Calendar-template decoder** | **~30 K gate** (combinational decode per op; v1 only) | — (replaced by microcode ROM row above) | — | **−30 K gate** | Subsumed by the ROM's per-op table. |
+| **Write-port / retire path** | 1 write port, no dual retire | 2 write ports active | 2 write ports active | 0 | v1 already had 2 write ports; v2 activates both concurrently when `retire_mask = 2'b11`. |
+| **Reciprocal / RSQRT seed unit** (for `TINV`, §7.5.1) | — | Per-group LUT + mantissa-shift block sharing FMA mantissa path ≈ **~50 K gate** (system-wide, ~400 gate/group × 128 groups) | same ≈ **~50 K gate** | **+50 K gate** | New for `TINV`. Piggybacks on the FP32 FMA mantissa path — no separate divider. Also usable by future `TNORM` / softmax kernels. |
+| **Row-range predicate synthesiser** (for `TROWRANGE_MUL`, §7.5.2) | — | Combinational `lane_row ∈ [r_start, r_end)` generator ≈ **~100 gate** | same | **+100 gate** | Negligible. Feeds the per-lane mask gate (§5.8) as an alternative to `SC_mask`. Also serviceable for any future range-restricted reduction. |
+| **Sort primitive** — v1: hard-wired `TSORT32` + fixed `TMRGSORT` networks; v2: reconfigurable 256-lane shuffle+CAS (§7.5.3) | Hard-wired 32-block bitonic sort network + fixed-size merge-stage networks ≈ **~130 K gate** | 256-lane log-stride shuffle butterfly (~50 K) + 128 × value-CAS units (~70 K) + direction-mux + control (~10 K) ≈ **~130 K gate** | same ≈ **~130 K gate** | **~0 (breakeven)** | v2 replaces a family of per-block-size hard-wired networks with **one reconfigurable primitive** that supports every `N = 2^p` from 32 to 8192 (FP4 tile), plus value+index dual retire, narrow-format sort, and free partial-sort via `mask_src = SC_mask`. Silicon budget unchanged; flexibility dramatically increased. |
+
+**Net area (order-of-magnitude sums).**
+
+| Category | v1 | v2 (SRAM-staging baseline) | v2 (FF-staging alternative, §9.1) |
+|----------|-----|----------------------------|-------------------------------------|
+| Staging payload + read-path MUX | ~50 K (implicit) | ~530 K | ~1.03 M |
+| Input crossbar | ~1.0 M | 0 | 0 |
+| Stage (A) | ~200 K | ~200 K | ~200 K |
+| Stage (B) ALU core + input MUX | 1.8 M + 12 K = ~1.81 M | 1.8 M + 115 K = ~1.92 M | ~1.92 M |
+| **Stage (B) auxiliary units** (RECIP/RSQRT seed + row-range predicate synth) | 0 | ~50 K | ~50 K |
+| Cross-group reducer network | ~1.1 M | 0 | 0 |
+| Acc RMW service (v1: dedicated adder; v2: Acc feedback MUX) | ~250 K | ~250 K | ~250 K |
+| Accumulator storage | ~200 K | ~200 K | ~200 K |
+| **Sort primitive** (v1: hard-wired TSORT32+TMRGSORT nets; v2: 256-lane shuffle+CAS) | ~130 K | ~130 K | ~130 K |
+| Control (calendar decoder / µcode ROM + strip-valid FSM) | ~30 K | ~205 K | ~205 K |
+| Pipeline / misc | ~50 K | ~50 K | ~50 K |
+| **Total (order-of-magnitude)** | **~4.82 M gate** | **~3.53 M gate** | **~4.03 M gate** |
+| vs v1 | — | **−1.29 M gate (~27 %)** | **−0.79 M gate (~16 %)** |
+
+**Alternative configurations.**
+
+| Config (v2 baseline ± options) | Total | Δ vs v1 | Rationale |
+|-------------------------------|-------|---------|-----------|
+| v2 SRAM-staging + cross-group net retained (§5.8.1 alt) | ~4.63 M | **−0.19 M** (~4 %) | Keeps 9-beat `TROW*` tree paths for latency-critical `TROW*` kernels; loses ~85 % of the v2 area advantage. |
+| v2 SRAM-staging + double-buffered staging (§9.2) | ~4.03 M | **−0.79 M** | Doubles staging macro count (48 macros), still smaller than v1. Double-buffering becomes affordable *because* of the SRAM migration — same feature on FF staging would push v2 to ~5.03 M gate (> v1). |
+| v2 FF-staging + double-buffered staging (§9.2) | ~5.03 M | **+0.21 M** | Marginally *larger* than v1; the FF-doubling is what tips the scale. |
+| v2 SRAM-staging + cross-group net + double-buffered | ~5.13 M | **+0.31 M** | Fully-featured "max throughput" config; ~6 % larger than v1 in exchange for 2× staging throughput plus 9-beat `TROW*`. |
+| v2 SRAM-staging **without** §7.5 novel ops (apples-to-apples vs. a hypothetical v1 that also lacks any sort support) | ~3.35 M vs. hypothetical v1 ~4.69 M | **−1.34 M** (~29 %) | Lower-cost variant that drops `TINV` / `TROWRANGE_MUL` / `TMRGSORT` entirely. Saves ~50 K (RECIP/RSQRT seed) + ~130 K (shuffle+CAS primitive) from the v2 side, and the matching ~130 K of hard-wired sort networks from the v1 side. Included for reference only — the §7.5 ops cost < 4 % of v2's total area and should be kept; the baseline above already pays for them. |
+
+**Takeaway.** The recommended v2 baseline (SRAM staging, no cross-group net, single-buffered, full §7.5 ISA) is **~27 % smaller than v1** at strictly greater functionality (adds `TINV`, `TROWRANGE_MUL`, and a reconfigurable sort primitive that subsumes v1's `TSORT32` + `TMRGSORT`). The biggest contributors, in decreasing order:
+
+1. **–1.1 M gate** from removing the cross-group reducer network — biggest single saving, enabled by the `tilelet_xpose + MERGE_STAGE` path (§5.6.2, §5.7.2).
+2. **–1.0 M gate** from removing the input crossbar — v2's point-to-point staging buses replace it entirely. This is also the biggest routing win (§10.2).
+3. **+500 K gate** for the SRAM staging payload store (baseline) — half the FF-alternative cost.
+4. **–250 K gate** from the dedicated Acc RMW adder, offset by **+250 K gate** of Acc feedback MUX → net 0 for the "accumulator RMW service" category.
+5. **+170 K gate** net for the microcode ROM (after subtracting the v1 calendar decoder).
+6. **+103 K gate** for the wider stage (B) input MUXes that enable ALU unification and masking.
+7. **+50 K gate** for the §7.5 RECIP/RSQRT seed unit (plus ≈100 gate for the row-range predicate synthesiser) — amortised across 128 compute groups.
+8. **~0 net** for the sort primitive: v2's reconfigurable 256-lane shuffle+CAS replaces v1's family of hard-wired sort networks at roughly the same silicon cost (~130 K gate), but covers every power-of-2 sort size from 32 to 8192 plus value+index dual retire and free partial-sort (§7.5.3).
+
+**Why the v1 Acc RMW adder is much smaller than the stage (B) ALU, even at the same 128-lane count.** v1's Acc RMW only supports **ADD / MIN / MAX** on FP32 partials that have already been widened by the intra-group tree. Per lane:
+- Acc RMW: 1 × FP32 adder (~1 500) + 1 × min/max/cmp (~400) + 3:1 mux (~150) ≈ **~2 000 gate/lane** → 128 × 2 K ≈ ~250 K.
+- Stage (B) ALU: 1 × FP32 **FMA** (~10 000) + min/max/cmp (~400) + narrow-format SIMD overhead (+30–50 %) + wide input MUXes ≈ **~14 000 gate/lane** → 128 × 14 K ≈ ~1.8 M.
+
+The ~7× per-lane gap comes from: **no multiplier / FMA in Acc RMW** (~–8 000/lane, which is the dominant term), **no narrow-format SIMD in Acc RMW** (Acc slots are pure FP32 partials post-widener), and **no 6:1 input mux in Acc RMW** (only Acc-slot vs. stage-B output). Intuitively: the accumulator sits *after* all the FMA and format-widening work is done in stage (B), so its RMW only needs to fold a single FP32 number into a slot.
+
+### 10.2 Routing difficulty
+
+Area and routability are two different axes. Even for identical gate counts, one topology can be dramatically easier to route than another. The two versions differ structurally on **where the long wires go**.
+
+**v1 — centralised crossbar dominates global routing.**
+
+The v1 crossbar sits between the TRegFile read ports and stage (A). Its logical connectivity is "any of 2 × 8 incoming strips × 64 B sub-chunks may reach any of up to 3 × 8 outgoing operand slots × 64 B sub-chunks". Even with hierarchical decomposition, the interconnect has:
+
+- **Wire count ≈ `2 × 512 B × 3 × 512 B = ~24 K bit-granularity wire crossings** in the worst case. Most practical layouts realise this as 64 B-granularity bus crossings (so ~384 per read port × 3 op slots = ~1.1 K 64 B-wide bus crossings), but they are all *global* wires that must fan from the left edge of the block (read ports) to the centre (stage A).
+- A **congestion hot spot** in the middle of the block where all read-port bundles, all operand-slot bundles, and the control signals for the calendar decoder converge.
+- **Balancing latency** across the crossbar requires either careful manual buffering or a deep pipelined decoder (the v1 calendar-template decoder absorbs two cycles of this).
+- Scaling to future variants (more operand slots, wider datapaths) amplifies the problem quadratically.
+
+The dedicated Acc RMW adder, the cross-group reducer network, and the final retire merge also add a *second* global routing corridor between stage (B) and the Acc / write port — smaller in scale than the crossbar but still global.
+
+**v2 — distributed, localised wiring; longest wires are fixed-width point-to-point buses.**
+
+v2 turns the crossbar into three **independent islands** plus one **predictable butterfly**:
+
+- Three **staging islands** (`SA`, `SB`, `SC`), each a self-contained 64 × 64 B flip-flop array with **local** write rotator + bank-select MUX + read rotator. Each island footprint is ~1 mm² at a modern node. The only long wires leaving an island are:
+  - One 512 B input bus from the associated TRegFile read port (fixed direction, always point-to-point with a single source and a single sink, no multiplexing in transit).
+  - One 512 B output bus to stage (A) (also fixed, point-to-point). This wire does not fan out or merge.
+- The **stage (B) input MUXes** are *per-compute-group* (128 independent 6:1 MUX + 3:1 mask MUX cells), so they sit *inside* the group and only absorb short local wires (the group's own lane of each 512 B bus). No cross-lane routing needed at this stage.
+- The **Acc feedback MUX** (§5.5.1) is a structured log-stride butterfly local to the Acc storage. Butterfly networks are known to have a layout-friendly regular pattern (two-level MUX tree maps cleanly to a rectangular grid); the longest wire in the butterfly is one "bank-flip" span = half the Acc plane = ~8 mm of routing at worst.
+- The **shuffle + compare-swap primitive** (§7.5.3) is a **second** structured butterfly, local to stage (B), activated only during `TMRGSORT` beats (`alu_op ∈ {SHUFFLE_CAS_UP, SHUFFLE_CAS_DOWN}`). Its inputs arrive via the existing stage (B) input MUXes and its outputs leave via the existing stage (B) output fabric, so it introduces **no new inter-block corridor** — only an internal 256-lane log-stride butterfly (8 levels of 2:1 swaps, regular pattern). The longest wire inside the butterfly is one "half-width" span ≈ the width of stage (B); same layout character as the Acc feedback MUX.
+- The **microcode ROM** sits next to `SOP`. Its 64 b output fans out to the compute-datapath control lines — a fan-out pattern that can be pipelined (the beat word is latched once per cycle; physical distribution of the control bits can use one pipeline stage of buffering if necessary, costing zero microarchitectural latency since the beat is issued one cycle in advance). Adding the 5-bit `alu_op` (§7.5.3 widened it from 4 b) plus the 3-bit `shuffle_stride` field costs only a handful of extra control wires in this fan-out.
+- The **RECIP/RSQRT seed unit** (§7.5.1) and the **row-range predicate synthesiser** (§7.5.2) are both *per-compute-group* or *per-lane* combinational blocks; they piggyback on the existing FMA mantissa path and the per-lane mask gate respectively, and add no new cross-group wiring.
+- The **strip-valid bitmap** is a small 8-bit × 3 FSM block co-located with each staging island; no long wires needed.
+
+The net effect: **v2 has no global routing hot spot**. The longest wires are the three 512 B staging-to-stage-(A) buses, which are *point-to-point with a single source and a single sink* and therefore straightforward to route as dedicated bus channels.
+
+**Side-by-side summary.**
+
+| Routing aspect | v1 | v2 | Comment |
+|----------------|-----|-----|---------|
+| Global connectivity hot spot | Input crossbar + reducer corridor | **None** | v2's longest wires are 3 × 512 B point-to-point buses; no many-to-many fabric |
+| Longest wire-bundle (worst-case) | `2 × 3 × 512 B` = ~24 Kb-bits crossing centre of block | `3 × 512 B` = 12 Kb-bit point-to-point ≈ 1 bus channel per operand | ~2× reduction, plus no crossbar-style fan-in/fan-out |
+| Cross-lane wires | Acc RMW corridor + cross-group reducer net + hard-wired `TSORT32` / `TMRGSORT` sort networks | **None** from reductions in recommended baseline (§5.8.1); **two structured log-stride butterflies**: in-Acc (§5.5.1, for `MERGE_STAGE`) and in-stage-(B) (§7.5.3, for `TMRGSORT` shuffle+CAS). Both are layout-friendly regular patterns | Strict lane alignment remains a layout invariant for value/FMA/reduce beats; only the two butterflies are cross-lane, and each is fully contained inside its host block |
+| Congestion classes | Crossbar middle, cross-group tree fan-in, Acc merge, sort-network fan-out | Acc butterfly (contained), shuffle+CAS butterfly (contained inside stage (B)), staging bus channels (dedicated) | v2's remaining nets are **structured**, not random-fabric; both butterflies have 2:1 swap granularity and regular log-stride topology |
+| Timing closure | Crossbar depth dominates critical path; 2-cycle pipeline already spent | Staging read + stage (A) is comparable to v1; beat-word fan-out is pipelinable | Similar clock rate achievable; v2 frees the 2 crossbar pipeline stages for other uses (e.g. ROM fan-out buffering) |
+| Floorplan modularity | Monolithic — hard to partition around the crossbar | High — each staging island is a drop-in block; stage (B) compute groups are identical tiles | v2 is easier to compose with `N_group` or `W_ALU,i` changes |
+| Scaling to wider datapath (e.g. 1024 B) | Crossbar grows as **O(W²)** — quadratic routing blowup | Staging FF grows as **O(W)**; stage (B) MUX grows as **O(W)**; Acc butterfly grows as **O(W log W)** | v2 scales essentially linearly; v1's crossbar is the primary limiter |
+| Scaling to more operand slots | Crossbar grows as **O(W × M)** where M = operand slots | Adds one more staging island + one more 512 B point-to-point bus | v2 adds **independent** islands, avoiding any cross-operand interference |
+| Sensitivity to PDK / node | Crossbar becomes relatively worse as wires become slower than gates | Improves as wires become slower (more of the silicon is in the short intra-island nets) | v2 ages better across process nodes |
+
+**Takeaway.** v2 is **both smaller *and* substantially easier to physically implement** than v1 (smaller by ~27 % at the SRAM-staging baseline, as tabulated in §10.1, even *after* paying for the §7.5 `TINV` / `TROWRANGE_MUL` / `TMRGSORT` additions). The crossbar is the single hardest block to route in v1, and its complete removal is the biggest quality-of-result win in v2 — more consequential for timing closure and power-grid planning than the headline gate counts alone suggest. The remaining v2 nets are either local (inside a staging island or inside a compute group) or structured (the Acc feedback butterfly and the shuffle+CAS butterfly — both regular log-stride patterns, each fully contained inside its host block, and never both active on the same cycle). An implementation team trading v1 for v2 typically gets **1 node's worth of frequency headroom back** in the critical path, **and/or** can choose to spend it on a tighter floorplan or a wider datapath variant.
+
+The v2 block that warrants the most routing attention is the **staging payload store**. At the §9.1 SRAM baseline this is 24 × 1R1W macros (8 per staging register); the layout concern is **macro placement and bus-channel planning** — each macro needs a short 512 b input bus from its bank-select MUX, a short 64 B output bus to the per-row-group staging read MUX, and an MBIST tap. Because macros must abut their row-group MUX + rotator, a natural floorplan is **three vertical columns of 8 macros each**, one column per staging register, with rotator + MUX strips flanking each column. Under the FF-staging alternative (§9.1), the same islands exist but the concern shifts from macro placement to a **12 KB / 100 K-FF clock tree** — also manageable thanks to the 3 × 4 KB island decomposition (three well-bounded, symmetrical sub-trees), just different in character.
+
+**Summary of when to choose v2 over v1:**
+
+- Kernels dominated by **FMA-accumulate** (attention softmax staging, GEMM post-processing with bias): v2 fuses `mul + accumulate` into one `alu_op = FMA` beat with the addend served by the Acc feedback path (`src_Z = ACC_READ_*`, §5.8), saving ~40% cycles.
+- Kernels with **paired value-index outputs** (argmax reductions, sort networks): v2 halves the instruction count.
+- **Predicated / masked kernels** (sparse activations, attention with lookahead masks, per-token dropout, conditional accumulation): v2 supports per-element masking natively through the `C` bitmask (§3.3c, §4.2.2, §5.8) with no fetch-phase penalty and no dedicated mask hardware beyond a 128 × 1 b lane-predicate unpack + a per-lane ALU output gate.
+- Kernels with **heavy `TCOL*` / `TROW*` on any legal shape** (not just chunk-aligned): the microcode accumulator loop + per-beat `tilelet_xpose` (§5.6, §5.7) collapses axis reductions into flat straight-line code, eliminating strip replay and `#W`-wave control.
+- Kernels with **FP8 / FP4 activations**: format support restored.
+- Kernels that today require frequent **explicit `TTRANS`** before GEMM / reduce: the staging-side per-beat `tilelet_xpose` (§3.3b) provides a zero-cost chunk-grid transpose at dispatch, so these predecessor instructions can be deleted outright.
+
+**When v1 is sufficient:**
+
+- Elementwise-heavy, 2-operand workloads on FP32 / FP16 / BF16 with no transpose — v2's extra staging area is dark silicon.
+- Area / power minimization on a fixed ISA compatibility target.
+
+---
+
+## Appendix V1 — v1 Content Inherited Verbatim (内容未变更, 完整复制自 [`vector4k.md`](vector4k.md))
+
+> **本附录的目的:** §1–§10 主体描述了 VEC-4K-v2 的完整新设计,并在多处用 "unchanged from [`vector4k.md`](vector4k.md) §X" 引用 v1 baseline 内容。为了让用户**仅阅读 `vector4k_v2.md` 即可获得完整且最新的设计信息**,本附录将所有相关的 v1 章节**逐字复制到 v2 文档中**,并明确标注 `(v1 → v2: 内容未变更, 完整复制自 v1 §X)`。下列章节中,任何 "v2 增量" 已在主体 §1–§10 中以单独章节形式给出 — 本附录只承载 v1 baseline。
+
+### A.1 v1 §3.2 Physical Strip — minimum streaming latency derivation
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §3.2。)**
+
+A **strip** is a contiguous **512-byte** chunk at offset `s·512` for `s ∈ {0,…,7}` within the 4 KB tile. This matches one port transaction per strip.
+
+**Minimum streaming latency (full tile), ignoring bank conflicts:**
+
+- **Unary** (both read ports read **the same logical tile** at consecutive strip offsets): **2 strips/cycle** × **512 B** = **1024 B/cycle** → **4096 / 1024 = 4 cycles** to read one full tile.
+- **Binary elementwise** (typical: **Rd0 → src0** strip `s`, **Rd1 → src1** strip `s`): **1 strip per operand per cycle** → **8 strips/operand** → **8 cycles** to ingest **both** full tiles. Software/hardware can **reuse** a buffered operand (e.g. **op reuse buffer**) to hide half of the reads on back-to-back dependent ops.
+
+The micro-architecture below assumes **8 strip indices** per 4 KB tile and schedules **cross-strip** work where reductions, expands, or gathers require it.
+
+> **(v2 上下文: 在 VEC-4K-v2 中,这两条延迟数仍然成立,但 §6 "Variable-Length Operand Fetch" 给出更精确的 v2 fetch 形式,包括 mask 操作 piggyback、3-tile 操作的 16 cy 延伸,以及 mixed-`is_transpose` 受 §6 R2 约束的额外 epoch。)**
+
+### A.2 v1 §4.2 Lanes vs Strips definitions
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §4.2。)**
+
+- **SIMD lane**: one parallel datapath processing **one logical element** after unpack (width depends on op; internal **FP32** is a reasonable unified width for expensive ops).
+- **Strip**: 512 B of **spatially contiguous** storage; SIMD width = `elem_per_strip`.
+- **Cross-lane** (within strip): reductions along a dimension that fits in one strip (partial row/col).
+- **Cross-strip**: **control-programmed** **crossbar** + **trees** combine strip `s` and `s′` contributions, or **multi-cycle accumulation** into **Acc** ping-pong / **staging**.
+
+> **(v2 上下文: VEC-4K-v2 取消了 v1 的 crossbar — strip 之间的组合改由 microcode-driven `tilelet_xpose` + `MERGE_STAGE` 实现 (§5.4 / §5.5 / §5.6.2)。)**
+
+### A.3 v1 §4.3 Fiber ID and strip read calendar
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §4.3。)**
+
+**`fiber_id`** is the **logical index along the axis** that **reduce** and **expand/broadcast** class ops treat as a **fiber** — one **output slot** per fiber after a reduce, or one **scalar source** per fiber when expanding along that axis:
+
+| Opcode family | `fiber_id` | Range |
+|---------------|------------|--------|
+| **`TROW*`** (row reduce) | row index **`r`** | `0 … R−1` |
+| **`TCOL*`** (column reduce) | column index **`c`** | `0 … C−1` |
+| **`TROWEXPAND*`** | **`r`** (splat target row) | `0 … R−1` |
+| **`TCOLEXPAND*`** | **`c`** (splat target column) | `0 … C−1` |
+
+**Elementwise** tile–tile ops do **not** use a single global **`fiber_id`**; they are scheduled **strip-by-strip** only. **Gather/sort/merge** use their own index streams; where they write **per-row/col** state, that state can still be keyed like **`fiber_id`** for buffer allocation.
+
+**From strip bytes to `(r, c)` and `fiber_id`:** For strip index **`s ∈ {0,…,7}`** and lane/byte offset inside the **512 B** chunk, **row-major** layout fixes a linear element order; decode **`(r, c)`** from **`(R, C, E)`**. Then **`fiber_id = r`** or **`c`** according to the opcode's **axis**.
+
+**Strip read calendar:** A **calendar** is the **cycle-by-cycle** schedule that binds **what arrives on Rd0/Rd1** to **what the datapath does** — in particular, **which operands feed each lane** and **which `fiber_id`(s)** touch **Acc** or **per-fiber `v` buffers** that cycle.
+
+- **Port row (per cycle `t`):** specifies **`s(t)`** (which **512 B** chunk), **which logical tile** each port reads (**`src0`**, **`src1`**, narrow **`v`** tile, scalar tile, **ping-pong scratch** for merges/sorts, or **idle**), and optional **second-pass** phases. **`TCOL*`** does **not** use a **transpose scratchpad** — only **normal** `reg_idx` tiles in **row-major** strip order. TRegFile ports **cannot gather**; **`TCOL*`** may **repeat** the **same** `reg_idx` over **multiple TRegFile epochs** when **`#W = max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉) > 1`**.
+- **Operand sources (`TROW*` reduce):** **Tile elements** arrive strip-serially from **read ports**; after **unpack → within-strip tree → cross-strip combine**, the reducer performs **RMW** on **Acc** at **`fiber_id = r`**. Physical slot **`ρ`** is **`fiber_id` remapped** into **`[0, N_run)`** for the current **Acc wave** (§A.8 §9.3.2).
+- **Operand sources (`TCOL*` reduce):** Ports still deliver **full rows** inside each **512 B** strip; **VEC** **selects** **`(r,c)`** for the scheduled **column band** from **strip buffers** (**no RF gather**). **Acc[`c`]** **+=** partial sums across strip-beats and, if needed, across **re-scans** of the tile.
+- **Operand sources (expand):** **`v[fiber_id]`** is supplied from a **narrow per-fiber vector** streamed on a read port, from **Acc / staging** after an in-place reduce, or from a **small buffer** filled in a **prefetch** phase; **`src`** elements still arrive **strip-major**. The calendar interleaves **`v`** strip reads with **`src`** strips so each cycle's SIMD sees a consistent **`(fiber_id, lane)`** map.
+
+**Templates:** Opcode decode picks a **calendar template** from **`(format, R, C, opcode)`**. The §A.8 metrics **`rS`, `rW`, `rK`, …** / **`cS`, `cW`, …** fix how many **within-strip** rounds and how **8-strip** walks align with **cross-strip** merge. The 47 distinct scheduling recipes in **§A.8 §9.5.1** are **calendar families** over the same datapath, not separate RTL blocks.
+
+> **(v2 上下文: VEC-4K-v2 用 microcode beat words (§5.4) 取代 calendar templates。"calendar template" 这一概念在 v2 中等价于 microcode ROM 中的一个 program — fiber_id 概念本身仍然适用,且通过 `acc_slot` 字段映射到 Acc。)**
+
+### A.4 v1 §4.4 Epoch-aligned fiber calendars vs `tregfile4k.md` (worked examples A–E)
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §4.4。)**
+
+Full **(format, shape)** enumeration would need **35** row-axis templates alone (§A.8); this subsection fixes **VEC ↔ TRegFile-4K** timing and shows **five** representative **`fiber_id`** calendars (**Examples A–E**; FP32 and FP16 / BF16 only). See [`tregfile4k.md`](tregfile4k.md): global **`e = cy[2:0]`** (phase within an **8-cycle** epoch); read port **Rp** presents bank-group **`G = (p + e) mod 8`** (**512 B** = one **strip** **Gs**).
+
+**Port binding for the tables:**
+
+| Logical name | TRegFile read port | Phase `p` | Strip delivered at phase `e` |
+|--------------|-------------------|-----------|--------------------------------|
+| **Port A** | **R0** | 0 | **`G_A = e`** |
+| **Port B** | **R4** | 4 | **`G_B = (4 + e) mod 8`** |
+
+**Epoch start:** **`t = 0`** is an **epoch boundary** (**e = 0**): the **`reg_idx`** for each port is **active** for the next 8 cycles (pending→active promotion, [`tregfile4k.md`](tregfile4k.md) §4).
+
+**Dual-port same `tile_idx` on A+B:** in cycles **`t = 0…3`** (**e = 0…3**), the pair **`(G_A, G_B)`** visits **`(0,4), (1,5), (2,6), (3,7)`** — every strip **Gs ∈ {0,…,7}** appears **exactly once** as **one** of the two 512 B beats. Cycles **`t = 4…7`** repeat the **same** strip schedule (second lap with the **same** latched tile on both ports); microcode **suppresses duplicate Acc** or reuses beats for **another operand** / **write path**. **Benefit:** **4 cycles** to see all **8** strips **once** with **two** 512 B reads/cycle vs **8 cycles** with **Port A only** (Port B **idle**).
+
+**Row-major element index** in a strip: byte offset **`512·Gs + δ`** with **`δ`** increasing along **columns** within the row segment; **`fiber_id = r`** for **`TROW*`** labels which **row**'s **C** elements are being reduced or expanded.
+
+#### Example A — `TROWSUM`, **FP32**, **8×128** (`C = 128`, **one row = one strip**)
+
+**Geometry:** **`row_B = 512 B`**, strip **Gs** holds **exactly row `r = Gs`**. **Dual-port** same **`src`** tile on **A+B** gives **two rows/cycle** for **`t = 0…3`**; **`t = 4…7`** are duplicate strip delivery (masked).
+
+| `t` | `e` | Port A | Port B | Fibers (A) | First elem @ A | `#elem` | Fibers (B) | First elem @ B | `#elem` | Reduce note |
+|----:|----:|--------|--------|------------|----------------|--------:|------------|----------------|--------:|-------------|
+| 0 | 0 | `src@G0` | `src@G4` | `r=0` | elem `(0,0)` | 128 | `r=4` | elem `(4,0)` | 128 | 2× `TROWSUM` lane-tree → **Acc** `r=0`, `r=4` |
+| 1 | 1 | `src@G1` | `src@G5` | `r=1` | `(1,0)` | 128 | `r=5` | `(5,0)` | 128 | **Acc** `r=1`, `r=5` |
+| 2 | 2 | `src@G2` | `src@G6` | `r=2` | `(2,0)` | 128 | `r=6` | `(6,0)` | 128 | **Acc** `r=2`, `r=6` |
+| 3 | 3 | `src@G3` | `src@G7` | `r=3` | `(3,0)` | 128 | `r=7` | `(7,0)` | 128 | **Acc** `r=3`, `r=7` |
+
+**Unique ingest complete at `t = 3`** (8 fibers, each **128** elements). **Single-port (A only):** stretch to **`t = 0…7`**, one row/cycle, **no** dual-port gain.
+
+#### Example B — `TROWSUM`, **FP32**, **32×32** (`C = 32`, **4 rows / strip**)
+
+**Geometry:** **`row_B = 128 B`**, **`512 / 128 = 4`** rows per strip. **Single Port A** is enough; **Port B idle**.
+
+| `t` | `e` | Port A | Port B | Fibers (A) | First elem @ A (each fiber) | `#elem` each | Reduce note |
+|----:|----:|--------|--------|------------|-----------------------------|----------------|-------------|
+| 0 | 0 | `src@G0` | — | `0,1,2,3` | `(0,0)`, `(1,0)`, `(2,0)`, `(3,0)` | 32 | 4× lane-tree (`K=32`) → **Acc** `r=0…3` |
+| 1 | 1 | `src@G1` | — | `4,5,6,7` | `(4,0)`…`(7,0)` | 32 | **Acc** `r=4…7` |
+| 2 | 2 | `src@G2` | — | `8,9,10,11` | … | 32 | **Acc** |
+| 3 | 3 | `src@G3` | — | `12…15` | … | 32 | **Acc** |
+| 4 | 4 | `src@G4` | — | `16…19` | … | 32 | **Acc** |
+| 5 | 5 | `src@G5` | — | `20…23` | … | 32 | **Acc** |
+| 6 | 6 | `src@G6` | — | `24…27` | … | 32 | **Acc** |
+| 7 | 7 | `src@G7` | — | `28…31` | … | 32 | **Acc** |
+
+**32** fibers, each **`#elem = C = 32`** from **Port A** only; **8 cycles** = one epoch, **one** `reg_idx` on **R0**.
+
+#### Example C — `TROWEXPANDADD`, **FP32**, **8×128** (`v[r]` + `src`)
+
+**`v` tile:** **8** row scalars as **FP32** = **32 B** at **byte offset 0** of strip **G0** (tile **`v`**; remaining bytes **don't-care**).
+
+**Why prefetch `v`:** If **`v@G0`** were on **Port B** while **`src`** streams on **Port A** only, **dual-port** cannot also deliver **`src@G4`** in the **same** cycle. **High bandwidth schedule:** (1) **Pre-epoch** or **`t_pre`:** **Port B** reads **`v@G0`** once; latch **`v[0]…v[7]`** (bytes **0–3**, **4–7**, …, **28–31**). (2) **`t = 0…3`:** **Port A** and **Port B** both carry **`src`** with **same** `reg_idx` as Example **A** — **two rows/cycle**.
+
+| `t` | `e` | Port A | Port B | Fibers (`src`) | First elem @ A | First elem @ B | `#elem` / fiber | **`v[fiber_id]`** | Expand |
+|----:|----:|--------|--------|----------------|----------------|----------------|-----------------|-------------------|--------|
+| 0 | 0 | `src@G0` | `src@G4` | `r=0`, `r=4` | `(0,0)` | `(4,0)` | 128 | **latched** **`v[0]`**, **`v[4]`** | **128** lanes/fiber: **`src` + v** |
+| 1 | 1 | `src@G1` | `src@G5` | `r=1`, `r=5` | `(1,0)` | `(5,0)` | 128 | **latched** **`v[1]`**, **`v[5]`** | … |
+| 2 | 2 | `src@G2` | `src@G6` | `r=2`, `r=6` | … | … | 128 | **latched** **`v[2]`**, **`v[6]`** | … |
+| 3 | 3 | `src@G3` | `src@G7` | `r=3`, `r=7` | … | … | 128 | **latched** **`v[3]`**, **`v[7]`** | … |
+
+**`t_pre` (one beat, e.g. previous epoch):** **Port B** = **`v@G0`**, **Port A** = **—** or next **`src`** prefetch; **first element of `v[0]`** = **byte 0** of **B**'s 512 B chunk.
+
+#### Example D — `TROWSUM`, **FP16**, **128×16** (`C = 16`, **`E = 2`**, **16 rows / strip**)
+
+**Geometry:** **`row_B = 32 B`** = **16** FP16 elements/row; **`512 / 32 = 16`** distinct **`fiber_id`** values per strip. **Dual-port** same **`src`**: **`t = 0…3`** covers all **128** rows (**32** partial trees/cycle = 16 fibers × 2 ports). **`K = 16`**, **`D_lane = 4`**.
+
+| `t` | `e` | Port A | Port B | Fiber IDs (A) | First elem @ A | `#elem` | Fiber IDs (B) | First elem @ B | `#elem` | Reduce note |
+|----:|----:|--------|--------|---------------|----------------|--------:|---------------|----------------|--------:|-------------|
+| 0 | 0 | `src@G0` | `src@G4` | `r=0…15` | row `r` at byte **`32r`** in **G0** | 16 | `r=64…79` | byte **`32(r−64)`** in **G4** | 16 | **32** lane-trees → **32** **Acc** RMW (**watch bank** = `r mod 8`) |
+| 1 | 1 | `src@G1` | `src@G5` | `16…31` | … | 16 | `80…95` | … | 16 | **32** **Acc** |
+| 2 | 2 | `src@G2` | `src@G6` | `32…47` | … | 16 | `96…111` | … | 16 | **32** **Acc** |
+| 3 | 3 | `src@G3` | `src@G7` | `48…63` | … | 16 | `112…127` | … | 16 | **32** **Acc** |
+
+**`t = 4…7`:** duplicate **`src`** strips (suppress **Acc** idempotent re-reduce) unless **`reg_idx`** advances. **BF16 128×16** reuses **exactly the same calendar** (same `E`, same byte layout, different payload semantics).
+
+#### Example E — `TCOLSUM`, **FP16**, **16×128** (`R = 16`, **`C = 128`**, **native row-major**)
+
+**TRegFile:** Read ports emit **only** full **512 B** strips ([`tregfile4k.md`](tregfile4k.md)); **there is no gather** inside the tile RF. **`TCOLSUM`** cannot request "column **`c`** only" from the file — it must **accept whole strips** on **Port A / Port B**, then **select** the needed **`(r,c)`** in **VEC** (strip buffers → unpack → **column mux** / shifter network).
+
+**Policy:** **No transpose scratchpad**. Operand remains **one** `reg_idx`, row-major. **Acc[`c`]**: **read–modify–write** associative **add** so partials from each strip-beat **accumulate** until all **R = 16** row contributions for column **`c`** are seen.
+
+**Hardware parallelism:** Let **`N_tree`** = parallel **adder / reduce trees** per beat; **`N_acc`** = parallel **Acc** **RMW** slots per cycle (**`N_acc ≤ N_run`**); **`f`** = **effective `Acc[c]` commits per tree per full tile scan**. **Same-cycle** combine+retire is capped by **`P_beat = min(N_tree, N_acc)`**.
+
+**Wave count (both limits):**
+
+**`#W = max(⌈C / N_acc⌉, ⌈C / (N_tree · f)⌉, ⌈C / N_run⌉)`**
+
+- **`#waves_acc = ⌈C / N_acc⌉`**: **Acc-band** partitioning (**≤ `N_acc`** columns **finished** per **wave** if **trees** keep up).
+- **`#waves_tree = ⌈C / (N_tree · f)⌉`**: when **`N_tree ≪ N_acc`**, **tree throughput** may require **more** **full scans** than **`#waves_acc`** predicts.
+- **`#waves_Nrun = ⌈C / N_run⌉`**: **DFF** **capacity**; **redundant** **vs** **`#waves_acc`** **when** **`N_acc ≤ N_run`**.
+
+**Illustrative numbers:** **`C = 128`**, **`N_acc = 64`**, **`N_run = 512`** → **`#waves_acc = 2`**, **`⌈C/N_run⌉ = 1`**. With **`N_tree = 8`**, **`f = 8`**: **`N_tree · f = 64`**, **`#waves_tree = 2`**, **`#W = max(2, 2, 1) = 2`**. With **`N_tree = 4`**, **`f = 8`**: **`N_tree · f = 32`**, **`#waves_tree = 4`**, **`#W = max(2, 4, 1) = 4`** — **tree**-limited.
+
+**Geometry:** **`row_B = 256 B`** (**128** FP16 elements/row); **2** rows per **512 B** strip. **Dual** read (**R0+R4**): **`t = 0…3`** (**`e = 0…3`**) delivers **unique** strip pairs **`(0,4)…(3,7)`** and visits **all 16** rows **once per full scan**. Cycles **`t = 4…7`** of the **same** TRegFile **epoch** repeat the **same** strips.
+
+**Strip calendar (identical each wave; `c_base = N_acc · k`):**
+
+| `t` | `e` | Port A | Port B | Rows in A / B | **`fiber_id` band** | Per-**`c`** row samples this `t` | **Acc** |
+|----:|----:|--------|--------|---------------|---------------------|----------------------------------|---------|
+| 0 | 0 | `src@G0` | `src@G4` | **0–1** / **8–9**   | **`c ∈ [c_base, c_base + N_acc − 1]`** | **4** | **`Acc[c] +=`** partial from **4** **`(r,c)`** (mux from **A/B**) |
+| 1 | 1 | `src@G1` | `src@G5` | **2–3** / **10–11** | same band | **4** | **+=** … |
+| 2 | 2 | `src@G2` | `src@G6` | **4–5** / **12–13** | same band | **4** | **+=** … |
+| 3 | 3 | `src@G3` | `src@G7` | **6–7** / **14–15** | same band | **4** | **16** terms integrated per **`c`** in band |
+
+**Multi-epoch summary** (**Acc-limited** **`#W = 2`**; **`N_acc = 64`**, **`N_run = 512`**):
+
+| Wave `k` | Tile read policy | Active columns | Strip beats used |
+|----------|------------------|----------------|------------------|
+| 0 | Same `reg_idx`, epoch **E0** | **`c = 0…63`**   | **`t = 0…3`** as above |
+| 1 | **Re-read** same tile, epoch **E1** | **`c = 64…127`** | repeat calendar |
+
+**Cycle lower bound (illustrative):** **`#W × 4`** strip-pair cycles, **plus** **epoch** turnarounds. **Single-port A only** doubles strip cycles per scan. **BF16 16×128** behaves identically.
+
+> **(v2 上下文: VEC-4K-v2 用 §5.7 的 4 个完整 microcode 程序例子(wide / narrow × col / row reduce)取代了 v1 的 5 个 calendar 例子; 在 v2 中, `TCOL*` 通过 `is_transpose=1` 加上 stage-(A) 修正路径变成 `TROW*`, 不再需要 strip replay 和 `#W` 多 epoch — 见 §7.3。Worked examples A–E 描述的 v1 calendar pattern 仍然是 v2 microcode 的骨架。)**
+
+### A.5 v1 §5 Instruction Categories and Cycle Sketches (baseline)
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §5。v2 §7 仅列出 v2 增量;本附录给出 v1 baseline 完整文本以保证自包含性。)**
+
+The following uses **R**/**C** notation, **S = 8** strips, and **read_pair** = one cycle with both 512 B read ports used. **Write_pair** = both write ports used (same or different tiles depending on retire policy).
+
+#### A.5.1 Elementwise (Tile–Tile) — v1 §5.1
+
+**Representative:** `TADD`, `TMUL`, `TAND`, `TCMP`, `TCVT` (unary ops such as `TABS`/`TRELU` omit one read port per strip).
+
+**Dataflow (non-pipelined sketch):**
+
+1. For each strip index `s = 0…7`: `read_pair` loads `src0[s]` and `src1[s]` → unpack → SIMD op → pack into `dst[s]` staging.
+2. Retire `dst[s]` with `write_pair` (same cycle as next strip's read if the pipeline supports **read–compute–write** overlap).
+
+**Latency:** **8 cycles** minimum to **read** both operand tiles strip-by-strip; with **pipelining** (overlap read `s+1` with write `s−1`), end-to-end often **~10–12 cycles** for a full `dst` tile (implementation-dependent buffering).
+
+**Cross-lane:** **none** for pure elementwise; SIMD lanes are independent within the strip.
+
+**Special:** `TCVT` may widen/narrow; internal path uses **wider SIMD** or **two-pass** if pack/unpack asymmetry exceeds one cycle.
+
+```mermaid
+flowchart LR
+  subgraph per_strip [Per strip s]
+    R0[Read src0 512B] --> U0[Unpack]
+    R1[Read src1 512B] --> U1[Unpack]
+    U0 --> ALU[Elementwise ALU]
+    U1 --> ALU
+    ALU --> P[Pack]
+    P --> W[Write dst 512B]
+  end
+```
+
+#### A.5.2 Tile–Scalar / Tile–Immediate — v1 §5.2
+
+**Representative:** `TADDS`, `TMULS`, `TCMPS`, `TEXPANDS`, `TADDSC`.
+
+**Scalar path:**
+
+- Immediate or **single-element tile** loaded once into **scalar broadcast reg** (optional **1-cycle** read of a dedicated scalar slot).
+- Each strip: SIMD op **lane_i = f(tile[s]_i, scalar)**.
+
+**Cycles:** same order of magnitude as A.5.1; **one fewer** long-latency operand read if scalar is in a control register.
+
+**Cross-lane:** none (unless scalar differs per row/col via side table — then becomes expand-like).
+
+#### A.5.3 Axis Reduce — v1 §5.3
+
+**Row-reduce examples:** `TROWSUM`, `TROWMAX`, `TROWARGMAX` (reduce across **columns** within each row).
+**Column-reduce examples:** `TCOLSUM`, `TCOLMAX`, `TCOLARGMAX`.
+
+**Key geometric fact:** a **512 B strip** spans a **contiguous run** of row-major storage; for **large C**, one row may span **multiple strips**; for **small C**, one strip may hold **multiple partial rows**. The control FSM computes `(row, col)` range per strip from `(R, C, E)`.
+
+##### A.5.3.1 Row-wise reduce (e.g. `TROWSUM`) — v1 §5.3.1
+
+For each **row r**, compute `acc[r] = reduce_{c} M[r,c]`.
+
+**Phase A – partial reduce within strip:** For strips that contain **multiple columns of the same row segment**, use **horizontal SIMD tree** within the strip (cross-lane inside SIMD).
+
+**Phase B – cross-strip combine for rows spanning strips:** Strips contributing to the same row feed a **segmented reduction network** or write **partial sums** to **Acc** (logical index `r`, physical slot **`ρ`** when **`N_run < R`**), then **second pass** reads back when row complete.
+
+**`TROWARGMAX` / `TROWARGMIN`:** each strip produces **(value, col_index)** pairs; cross-strip compare selects winner; **cross-lane compare tree** + **index mux**.
+
+##### A.5.3.2 Column-wise reduce (e.g. `TCOLSUM`, `TCOLMAX`) — v1 §5.3.2
+
+**Architectural rule (VEC-4K v1):** **`TCOL*`** **must** be implemented on the **operand tile's native row-major** layout as seen through **normal TRegFile** `reg_idx` / **512 B strip** reads. **No transpose scratchpad tile**.
+
+**Parallelism — `N_tree` vs `N_acc` (hardware):**
+
+| Symbol | Meaning |
+|--------|---------|
+| **`N_tree`** | Parallel **adder / reduce** paths after the **crossbar** (v1 §4.1). **Reference implementation:** **`N_tree = 128`**. |
+| **`N_acc`** | **Distinct** **`Acc[·]`** **RMW** slots **in the same cycle** — limited by **`N_tree`**, **adder feedback ports**, and **`N_run`**: **`N_acc ≤ N_run`** (**512**). |
+
+**Wave count:**
+
+- **`#waves_acc = ⌈C / N_acc⌉`**: **Acc-band** partitioning.
+- **`#waves_tree = ⌈C / (N_tree · f)⌉`**: when **`N_tree ≪ N_acc`**, **tree throughput** may require **more** full scans.
+- **`#waves_Nrun = ⌈C / N_run⌉`**: **DFF** capacity.
+
+**Combined:** **`#W = max(⌈C / N_acc⌉, ⌈C / (N_tree · f)⌉, ⌈C / N_run⌉)`**
+
+**`TROW*` mirror (row-axis output fibers, `fiber_id = r`):** **`#W_trow = max(⌈R / N_acc⌉, ⌈R / (N_tree · f)⌉, ⌈R / N_run⌉)`**.
+
+> **(v2 注: VEC-4K-v2 通过 §3.3 `is_transpose` 与 §5.6 microcode 重构, 把 `TCOL*` 在 element-level 转化为 `TROW*` shape — 在 whole-row 形态下完全消除 v1 的 strip replay 和 `#W` 多 epoch。`#W` 公式仍然适用于不需要 transpose 的场景或 narrow-tile 分块。)**
+
+#### A.5.4 Axis Expand / Broadcast — v1 §5.4
+
+**Representative:** `TROWEXPAND*`, `TCOLEXPAND*`.
+
+**Scheduling:** Expands use the same **`fiber_id`** convention: **`r`** or **`c`** selects **`v[fiber_id]`**; the **calendar** interleaves reads of the **narrow `v` tile** with **`src`** strips so each **512 B** write sees correct **splat** metadata per lane.
+
+**Row expand** (broadcast scalar along row): after computing or loading **per-row scalar** `v[r]`, for each strip determine row segments and **broadcast** `v[r]` across lane positions (SIMD **splat**).
+
+**Column expand:** same **no-transpose-scratch** rule as **`TCOL*`**: **row-major** strip walk + **multi-pass splat** / **lane scatter** with **column address generation**.
+
+**Cycles:** often **1× read** of narrow **per-row/col vector tile** + **1× read** of `src` + **streamed write** of `dst` → similar to **8–16** cycles.
+
+#### A.5.5 Complex Instructions — v1 §5.5
+
+##### A.5.5.1 `TSORT32` — v1 §5.5.1
+
+Spec: sort **each 32-element block** of `src` with paired indices from `idx`.
+
+- **Within-strip:** if `32·E ≤ 512 B`, multiple blocks per strip; process blocks **in parallel SIMD sort networks** (bitonic / odd-even) of depth O(log² 32) comparators **pipelined**.
+- **Cross-block:** independent per block → **minimal cross-strip** except when a 32-block spans strip boundary → **microcode** stitches **tail/head** in a **staging register**.
+
+**Cycles:** **many** (tens), dominated by comparator stages × number of blocks `N/32`.
+
+> **(v2 注: §7.5.3 `TMRGSORT` 用一个可重构的 256-lane shuffle+CAS primitive 取代 v1 的硬连线 `TSORT32` 与 `TMRGSORT` 网络,支持任意 `N = 2^p` 包括 32-block 的 v1 兼容路径。)**
+
+##### A.5.5.2 `TMRGSORT` (merge sort of multiple sorted lists) — v1 §5.5.2
+
+Typically **multi-list merge** with **k-way** comparator tree:
+
+1. **Load** list headers / pointers (implementation-defined in ISA).
+2. **Stream** strips from each list into **merge front buffers** (read ports time-multiplexed across lists).
+3. **Repeat:** compare **k** front elements, pick winner, push to **output strip**, refill from corresponding list.
+4. **Write** output strips via write ports.
+
+**Cross-lane / cross-strip:** **heavy**; merge **global** across lists, not SIMD-embarrassingly parallel. Expect **O(4096 / 1024) × (merge depth)** plus **compare tree** cycles — **hundreds** of cycles acceptable for a "complex" op.
+
+> **(v2 注: §7.5.3 重新定义了 `TMRGSORT` 为单条指令,通过 microcode-driven Batcher bitonic 网络在一个 reconfigurable shuffle+CAS primitive 上完成;支持 partial sort、value+index dual retire、narrow-format sort 等 v1 不支持的变体。)**
+
+##### A.5.5.3 `TGATHER` / `TGATHERB` / `TSCATTER` — v1 §5.5.3
+
+- **Index-driven** access: per element, **`addr = base + f(index)`**; within VEC-4K, **on-tile** gather means **cross-strip byte mux** driven by **index SIMD** (indices may come from second tile).
+- Realistic schedule: **batch** indices into **coalesced** groups that fall into **same or adjacent strips** to limit mux fanout.
+
+**Cycles:** **large variance**; worst case approaches **per-element** serialization if indices are random.
+
+##### A.5.5.4 `TCI`, `TTRI`, `TPART*` — v1 §5.5.4
+
+- **`TCI`:** **strip-parallel** index generation `base + stride` → **no cross-lane** dependency beyond broadcast of parameters.
+- **`TTRI`:** row/col counters compared to generate mask; **cross-lane** for diagonal boundary within strip only.
+- **`TPART*`:** valid-region mask intersects elementwise regions; same as A.5.1 with **predicate gating**.
+
+##### A.5.5.5 `TQUANT` / `TDEQUANT` — v1 §5.5.5
+
+Often **two-phase**: compute **scale/exp** per tile or per row (reduce), then **elementwise** scale. Combines A.5.3 + A.5.1.
+
+> **(v2 注: 在 v2 中, scalar 路径通过 `SX` / `SY` 显式承载,加上 §3.3c 的 per-element bitmask `C`, 使 quant/dequant 与可选 calibration mask 自然结合 — 见 §7.4。)**
+
+### A.6 v1 §6 Cross-Lane / Cross-Strip Summary (baseline)
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §6。v2 §8 仅列出 v2 增量行,本附录给出 v1 baseline 表以保证自包含性。)**
+
+| Category | Cross-lane (within 512 B strip) | Cross-strip (among 8 strips) |
+|----------|----------------------------------|------------------------------|
+| Elementwise tile–tile | Independent lanes | None (strip order arbitrary) |
+| Tile–scalar | Independent | None |
+| Row reduce | Horizontal tree for row segment in strip | Combine partials for rows spanning strips |
+| Column reduce | **Strip read** → **VEC column mux** from row data in buffer (RF **no gather**) | **Heavy** strip **replay** / **`Acc` RMW** / multi-epoch (**no transpose scratch**) |
+| Row expand | Splat scalar across row segment | Repeat/broadcast metadata per strip |
+| Column expand | Partial splat (**column-major intent**, row-major storage) | **Heavy** multi-pass (**no transpose scratch**) |
+| `TSORT32` | Sort network per 32-block | Block spanning strip boundary |
+| `TMRGSORT` | Per-element compare in tree | **Global** merge across streams |
+| `TGATHER` | Mux selected elements | Arbitrary strip sources |
+
+> **(v2 注: §8 中给出新增条目 — masked elementwise, `TFMA_ACC`, `TCOL*` with `is_transpose=1`, `TGATHER` with index tile, `TSORT32` with dual output, `TMRGSORT` (v2 重设计), `TINV`, `TROWRANGE_MUL`。)**
+
+### A.7 v1 §7 Datapath Diagram — Row Reduce with Cross-Strip Combine
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §7。)**
+
+**Read sequencing** follows a **strip calendar** (§A.3). v1 §4.1 **dataflow:** **Rd0+Rd1** → **crossbar** (**1024 B**) → **(A) align/unpack/permute** → **128 slices `W_prep,i`** → **(B) 128×(ALU `W_ALU,i` → tree `W_tree,i`)** → **Acc** ping-pong → **half-select** → **Wr0+Wr1** (**`W_ALU,i` may exceed `W_tree,i`**).
+
+```mermaid
+flowchart TB
+  subgraph rf [TRegFile]
+    RD0[Rd0 512B]
+    RD1[Rd1 512B]
+  end
+  subgraph ctl [Control]
+    C[Opcode + shape]
+  end
+  subgraph xb [Crossbar]
+    XB[1024 B in]
+  end
+  subgraph modA ["(A) Align unpack permute"]
+    A["IN 1024 B OUT 128 x W_prep,i"]
+  end
+  subgraph modB ["(B) 128 groups N_group"]
+    G["Per i: slice W_prep,i to ALU to tree W_ALU,i then W_tree,i 32b typical"]
+  end
+  subgraph acc [Acc ping-pong]
+    ACCMEM["256x32bx2 RMW or bypass to DFF"]
+  end
+  subgraph out [Retire]
+    PACK[Pack]
+    WR[Wr0 Wr1 512B]
+  end
+  RD0 --> XB
+  RD1 --> XB
+  C --> XB
+  C --> A
+  C --> G
+  C --> ACCMEM
+  XB --> A
+  A --> G
+  G -->|W_tree to Acc rho| ACCMEM
+  ACCMEM -->|256 words| PACK --> WR
+```
+
+> **(v2 替代图: §5.1 给出 VEC-4K-v2 的等价数据流图 — crossbar 已被取消,取代为 `SA`/`SB`/`SC` → 直接的 3 路 512 B operand bus → stage(A) → stage(B) ALU(含 Acc feedback)→ Pack → Wr0/Wr1。)**
+
+### A.8 v1 §9 Legal `(format, R×C)` enumeration and axis-reduce complexity
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §9。这是 v1 文档中 sizing 决策的核心参考表 — VEC-4K-v2 完全继承这些公式与 23 种合法形状,只是在 §5.6 / §5.7 中将其表达为 microcode programs 而非 calendar templates。)**
+
+This appendix **enumerates every** combination of **logical format** and **tile shape** from §2.1 and, for each, gives **paired** metrics for **row-axis** reductions (`TROWSUM`, `TROWMAX`, `TROWARGMAX`, …) and **column-axis** reductions (`TCOLSUM`, `TCOLMAX`, `TCOLARGMAX`, …). It then explains how a **single reconfigurable reduction tree** morphs with those parameters, and counts how many distinct **control shapes** appear in the table.
+
+#### A.8.1 Enumeration rules — v1 §9.1
+
+- Tile storage: **4096 bytes**, row-major, **R** and **C** powers of two.
+- **N = R·C = 4096 / E** with **E** bytes per logical element (**only two supported storage widths** in v1):
+  - **FP32:** `E = 4`, `N = 1024`, **11** shapes.
+  - **FP16** and **BF16:** `E = 2`, `N = 2048`, **12** shapes each (**24** table rows).
+
+**Master table rows:** **35**. **Unique `(E, R, C)` geometries:** **23** (FP16 and BF16 share shapes).
+
+`elem_per_strip = 512 / E` — **128** FP32 elements or **256** FP16 / BF16 elements per 512 B strip.
+
+> **(v2 注: VEC-4K-v2 还支持 FP8 (E=1) 与 FP4 (E=0.5) — 这两种格式在 v1 中明确被 out of scope, 在 v2 中通过 `format` metadata 与 stage-(B) per-lane SIMD widening 引入。FP8 / FP4 多出的几种形状只是简单延伸下表的同一公式族。)**
+
+#### A.8.2 Row-axis metrics (`TROW*`) — v1 §9.2
+
+For each **row** fiber, reduce **C** elements. **Bytes per row** `row_B = 4096/R`.
+
+| Sym | Definition |
+|-----|------------|
+| **rS** | Strips per row `= ⌈row_B / 512⌉`. |
+| **rK** | Elements in one cross-lane segment: `C` if `rS = 1`, else `512/E`. |
+| **rDl** | Cross-lane depth `= max(0, ⌈log₂ rK⌉)`. |
+| **rDc** | Cross-strip depth `= max(0, ⌈log₂ rS⌉)`. |
+| **rW** | Per-strip serial work: `rDl` if `row_B ≥ 512`, else `(512/row_B)·rDl`. |
+| **rLB** | `4 + rDl + rDc` (optimistic). |
+| **rUB** | `4 + 8·rW + R·rDc` (conservative serial tree). |
+| **rAccB** / **rStgUB** | Partial state (bytes): A.8.3.1 — **`4·R`** logical running; **`4·R·rS`** staged upper bound; **physical running** **`N_run`**. |
+
+#### A.8.3 Column-axis metrics (`TCOL*`) — v1 §9.3
+
+For each **column** fiber, reduce **R** elements. **Logical bytes per column** `col_B = 4096/C = R·E`.
+
+**Formal substitution (metrics only — not a scratch layout):** Algebraically, `TCOL*` on **R×C** matches `TROW*` on a **fictitious C×R** row-major tile with the **same** **4096 B** element multiset.
+
+| Sym | Definition |
+|-----|------------|
+| **cS** | `⌈col_B / 512⌉`. |
+| **cK** | `R` if `cS = 1`, else `512/E`. |
+| **cDl** | `max(0, ⌈log₂ cK⌉)`. |
+| **cDc** | `max(0, ⌈log₂ cS⌉)`. |
+| **cW** | `cDl` if `col_B ≥ 512`, else `(512/col_B)·cDl`. |
+| **cLB** | `4 + cDl + cDc`. |
+| **cUB** | `4 + 8·cW + C·cDc` (note **`C`** column outputs, not `R`). |
+| **cAccB** / **cStgUB** | Partial state (bytes): A.8.3.1 — **`4·C`** logical running; **`4·C·cS`** staged upper bound; **physical running** **`N_run`**. |
+
+##### A.8.3.1 Partial accumulator state (`TROW*` / `TCOL*`) — v1 §9.3.1
+
+**Assumption A — associative reduce (max / min / sum):** each output fiber keeps **one** **FP32-shaped** running partial.
+
+| Symbol | Formula | Meaning |
+|--------|---------|---------|
+| **rAccB** | **`4·R`** | Logical per-row state (bytes). Physical running file holds **`min(R, N_run)`** slots when **`N_run < R`**. |
+| **cAccB** | **`4·C`** | Logical per-column state (bytes). Physical **`min(C, N_run)`** when **`N_run < C`**. |
+
+**Implementation cap — `N_run = 512`:** when **`R > N_run`** or **`C > N_run`**, decode drives **`K_outer > 1` hardware loops**: each **outer** step maps **at most `N_run`** fibers to **`ρ`** and runs the **full** inner **strip** schedule; **completed** bands **retire** through **Pack → Wr0/Wr1**.
+
+**Assumption B — staged strip partials (upper bound):** microarchitecture retains **up to one FP32 partial per strip slot per fiber** before the **`⌈log₂ S⌉`** cross-strip tree drains them.
+
+| Symbol | Formula | Meaning |
+|--------|---------|---------|
+| **rStgUB** | **`4·R·rS`** | Row-axis staging upper bound (bytes). Never exceeds **32 768 B (32 KiB)**. |
+| **cStgUB** | **`4·C·cS`** | Column-axis analogue; peak **32 KiB**. |
+
+**`TROWARGMAX` / `TROWARGMIN` / `TCOLARG*`:** plan for **value + index** per fiber (e.g. **8 B** aligned entries). Scaling rule: **≈ `2 × rAccB`** / **`2 × cAccB`**.
+
+##### A.8.3.2 Accumulator organization (ping-pong DFF, `N_run = 512`) — v1 §9.3.2
+
+The **running partial** store: **two** ping-pong halves of **256 × 32 bit** each (**1024 B** / half, **2048 B** total). Each slot is a DFF word with two write modes:
+
+- **RMW accumulate:** new partial from (B) feeds an adder; second operand is feedback from same slot's DFF; sum writes back.
+- **Bypass combine (write-through):** new data from (B) muxed directly into DFF, skipping accumulate adder.
+
+**One** half accepts writes while control may select the **other** half for **Pack → Wr0+Wr1** (**512 B + 512 B** = **1024 B** = **256** words per retire phase).
+
+**Logical indexing:** **`ρ ∈ [0, N_run)`** with **`N_run = 512`**; e.g. **`ρ = h·256 + σ`** with half **`h ∈ {0,1}`** and **`σ ∈ [0,255]`**. **`fiber_id`** remaps to **`ρ`** across **Acc waves** when **`R` or `C > 512`**.
+
+**Capacity:** **`N_run = 512`** FP32-shaped partials across both halves; **`N_acc ≤ N_run`**; at most **256** distinct RMW targets per half per cycle.
+
+**`TROWARG*` / `TCOLARG*` variant:** **64-bit** value∥index per slot → double width or sidecar index RAM (v1 §4.1 unchanged topology).
+
+> **(v2 注: §5.5.1 `MERGE_STAGE` pair-selection network 重新组织 Acc 为 parity-indexed 2-bank, 用于支持任意 fold stride 的 cross-slot reduction; `N_run = 512`, ping-pong, RMW vs bypass 均与 v1 一致。)**
+
+#### A.8.4 Cycle model (both axes) — v1 §9.4
+
+Both axes assume v1 §3.2 (now §A.1) unary ingest: **4 cycles** minimum to read the full tile with **two** 512 B read ports.
+
+- **Lower bound (*LB*):** ideal overlap of read, **one** wide pipelined `⌈log₂ K⌉`-stage tree, and cross-strip merge.
+- **Upper bound (*UB*):** **8** strips each pay **W** tree-stage units on **one** shared tree, plus **one cross-strip phase per output fiber** (`R` outputs for `TROW*`, `C` outputs for `TCOL*`).
+
+#### A.8.5 Reconfigurable reduction tree — how the hardware "shape" follows the table — v1 §9.5
+
+The datapath is **one logical pipeline** reused by all table rows; its **effective shape** is selected by microcode from the **`r*`** or **`c*`** fields.
+
+1. **Unpack** maps a 512 B strip to up to **256** logical lanes (FP16 / BF16) or **128** lanes (FP32) — physical SIMD may be narrower; logical tree depth is still **⌈log₂ K⌉**.
+2. **Cross-lane tree (variable fan-in K):** implement as **`D_lane = ⌈log₂ K⌉`** stages of pairwise reduce ops. **K** jumps with `(format, R, C)`.
+3. **Cross-strip merger (variable S):** after each strip contributes a partial, a balanced tree of depth **`⌈log₂ S⌉`** combines partials. **S ∈ {1,2,4,8}** → at most **3** compare stages.
+4. **Temporal "stretch" (W):** when **`row_B < 512`**, multiple complete fibers land in one strip → **`W`** scales linearly with packed fiber count.
+
+```mermaid
+flowchart LR
+  subgraph strip [Per 512B strip]
+    U[Unpack] --> MUX[Mux K active lanes]
+    MUX --> LT[log2 K-stage lane tree]
+    LT --> P[Partial per fiber ID]
+  end
+  P --> CS[Depth log2 S cross-strip tree]
+  CS --> ACC[Acc ping-pong 256×32b×2 RMW v1 §4.1]
+  ACC --> OUT[Reduced fiber value]
+```
+
+##### A.8.5.1 How many distinct "shapes" are needed? — v1 §9.5.1
+
+| Counting notion | Value | Meaning |
+|-----------------|------:|---------|
+| **Physical datapaths** | **1** | One reducer suffices if it supports max K = 256 (FP16 / BF16), max `D_lane` = 8, max `S` = 8 (`D_cross ≤ 3`), with per-stage bypass and programmable lane mask. |
+| **Unique `(D_lane, D_cross, W_strip)` tuples** | **22** | Distinct time-scheduling recipes for either axis, over all 23 geometries. |
+| **Unique `(S, K, D_lane, D_cross)` quartets** | **15** | Coarser strip + tree fingerprint per axis. |
+| **Unique paired `(row tuple, column tuple)`** | **23** | One pair per `(E,R,C)`; square shapes have identical row and column metrics. |
+
+So: **one** parameterized tree + cross-strip unit covers the whole table; firmware/microcode must hold **22** scheduling templates per axis, not **35** different RTL blocks.
+
+> **(v2 注: 在 v2 中,这 22 种 calendar templates 每一种对应一段 microcode program (§5.4 / §5.7); 物理 datapath 仍然唯一,但表达介质从 calendar-template combinational decoder 变为 microcode ROM。**v2 增量** §5.6.2 引入 `tilelet_xpose + MERGE_STAGE` 路径,在 baseline 配置下消除 cross-group reducer network — 这意味着 v2 把 `TROW*` 的 `D_cross + 1` 步骤序列化进 microcode loop,用 `+44%` ~ `+5.4×` 的 `TROW*` 拍数换掉 ~1.1 M gate 的 cross-group adder network。)**
+
+#### A.8.6 Summary by format (extrema over all legal shapes) — v1 §9.6
+
+Maxima over **both** axes are **identical** for each format family (swap **R↔C** maps extreme row cases to extreme column cases).
+
+| Format | N | # shapes | max **K** (either axis) | max **D_lane** | max **S** | max **D_cross** | min *LB* | max *LB* | max *UB* (r or c) | max **rAccB** / **cAccB** | max **rStgUB** / **cStgUB** |
+|--------|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| FP32 | 1024 | 11 | 128 | 7 | 8 | 3 | 4 | 14 | 516 | 4096 | 4096 |
+| FP16 / BF16 | 2048 | 12 | 256 | 8 | 8 | 3 | 4 | 15 | 1028 | 8192 | 8192 |
+
+**Logical** peak **rAccB** / **cAccB** is **`4·R` / `4·C`** (up to **8 KiB** at **`R` or `C = 2048`**, FP16 / BF16). VEC-4K silicon: running partials = **`N_run = 512`** entries × **4 B** = **2048 B DFF**; **`R` or `C > 512`** uses **Acc waves**.
+
+#### A.8.7 Legal `(format, R×C)` enumeration (master table) — v1 §9.7
+
+**35 rows** — same master-table row count as A.8.1 (**23** distinct `(E, R, C)` geometries; FP16 vs BF16 duplicate shapes). Per-axis metrics defined in A.8.2–A.8.3; format extrema in A.8.6.
+
+| Format | E (B/elem) | N | R×C |
+|--------|------------|---|-----|
+| FP32 | 4 | 1024 | 1×1024 |
+| FP32 | 4 | 1024 | 2×512 |
+| FP32 | 4 | 1024 | 4×256 |
+| FP32 | 4 | 1024 | 8×128 |
+| FP32 | 4 | 1024 | 16×64 |
+| FP32 | 4 | 1024 | 32×32 |
+| FP32 | 4 | 1024 | 64×16 |
+| FP32 | 4 | 1024 | 128×8 |
+| FP32 | 4 | 1024 | 256×4 |
+| FP32 | 4 | 1024 | 512×2 |
+| FP32 | 4 | 1024 | 1024×1 |
+| FP16 | 2 | 2048 | 1×2048 |
+| FP16 | 2 | 2048 | 2×1024 |
+| FP16 | 2 | 2048 | 4×512 |
+| FP16 | 2 | 2048 | 8×256 |
+| FP16 | 2 | 2048 | 16×128 |
+| FP16 | 2 | 2048 | 32×64 |
+| FP16 | 2 | 2048 | 64×32 |
+| FP16 | 2 | 2048 | 128×16 |
+| FP16 | 2 | 2048 | 256×8 |
+| FP16 | 2 | 2048 | 512×4 |
+| FP16 | 2 | 2048 | 1024×2 |
+| FP16 | 2 | 2048 | 2048×1 |
+| BF16 | 2 | 2048 | 1×2048 |
+| BF16 | 2 | 2048 | 2×1024 |
+| BF16 | 2 | 2048 | 4×512 |
+| BF16 | 2 | 2048 | 8×256 |
+| BF16 | 2 | 2048 | 16×128 |
+| BF16 | 2 | 2048 | 32×64 |
+| BF16 | 2 | 2048 | 64×32 |
+| BF16 | 2 | 2048 | 128×16 |
+| BF16 | 2 | 2048 | 256×8 |
+| BF16 | 2 | 2048 | 512×4 |
+| BF16 | 2 | 2048 | 1024×2 |
+| BF16 | 2 | 2048 | 2048×1 |
+
+> **(v2 增量: VEC-4K-v2 在表中追加 FP8 (E=1, N=4096, 13 shapes from `1×4096` 到 `4096×1`) 与 FP4 (E=0.5, N=8192, 14 shapes from `1×8192` 到 `8192×1`),共计 35 + 27 = 62 master-table rows。所有公式 A.8.2 / A.8.3 在新格式下保持适用 — 仅 `elem_per_strip = 512/E` 升级到 512 (FP8) 或 1024 (FP4)。)**
+
+### A.9 v1 §8 Implementation Notes (baseline)
+
+> **(v1 → v2: 内容未变更, 完整复制自 [`vector4k.md`](vector4k.md) §8。v2 §9 是 VEC-4K-v2 自己的 implementation notes;本节保留 v1 baseline 以便对照参考。)**
+
+1. **Opcode decode** produces **control** for the v1 §4.1 **crossbar**, **(A)** **align/unpack/permute** (per-slice **`W_prep,i`**), **(B)** **128** groups (ALU **`W_ALU,i`**, tree **`W_tree,i`**, **`W_ALU,i ≥ W_tree,i`** allowed), **Acc** ping-pong addresses, per-slot **RMW** vs **bypass-to-DFF**, **Wr half-select**, and a **strip read calendar** (§A.3): per-cycle **Rd0/Rd1** targets, strip index phase, **`fiber_id`** / **Acc** side effects. Parameters include strip loop count, **`TCOL*`** wave / **`N_acc`** / **`N_tree`** / **`f`**, **`N_run`** / **`ρ` remap**, **`K_outer`**, write-side staging, splat / merge **k**, §A.8 **`r*`** / **`c*`** template id (47 families). **`TCOL*`** **replays** over **`#W = max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉)`** when **`#W > 1`**; no transpose-scratch; no RF gather.
+2. **Determinism:** PTO ops are expected to be **deterministic** at the tile level; multi-cycle internal scheduling is **invisible** if the instruction **retires atomically** from the programmer's view (barriers via **`TSYNC`** as needed).
+3. **Resource conflicts:** with only **two** read ports, **TMRGSORT** and **column-reduce** should **stall** other TRegFile clients or use **dedicated tiles** for **algorithmic** ping-pong (e.g. sort lists) — **not** for a transpose scratchpad forbidden by §A.5.3.2.
+4. **Numerics:** FP16 / BF16 reductions are evaluated with an **FP32-widened accumulator** and rounded per `TCVT` / ISA rules on retire.
+
+> **(v2 替代: §9 implementation notes 给出 VEC-4K-v2 专属的实现细节,涵盖 microcode ROM、staging area budget(SRAM baseline vs. FF alternative, §9.1)、strip-valid bitmap、staging transpose 硬件、metadata write discipline 等。)**
+
+---
+
+## 11. Related Documents
+
+- [`vector4k.md`](vector4k.md) — VEC-4K v1 (the direct predecessor). §3–§9 of this document inherit all compute-stage formulas and templates from v1; the verbatim copy of all v1 baseline content is in **Appendix V1** (§A.1–§A.9 above).
+- [`vector512.md`](vector512.md) — VEC-512: the 512 B-tile sibling (`S = 1`). The staging-register structure of v2 maps naturally onto the single-strip regime of VEC-512; a parallel v2 of VEC-512 would require only `SA`/`SB`/`SC`/`SOP`/`SX`/`SY` and no strip-valid bitmap.
+- [`tregfile4k.md`](tregfile4k.md) — 8R/8W tile RF with 8-cycle epoch and per-port `is_transpose` (§7). VEC-4K-v2's two-port binding uses R0/W0 (Port A) and R4/W4 (Port B), as in VEC-4K v1.
+- [`outerCube.md`](outerCube.md) — MXU / outer product engine; port count differs; not a sibling of v2.
+- [`PTOISA/README.md`](PTOISA/README.md) — authoritative PTO ISA.
+
+---
+
+## Document History
+
+| Version | Date | Notes |
+|---------|------|-------|
+| 0.18 | 2026-04-30 | **Native 3-source ternary FMA family added: `TFMA D = A·B + C`, `TFNMA`, `TLERP` (§7.6).** Operand `C` promoted to a **dual role** (mask **or** value tile) selected by a new 1-bit issue-time `c_role ∈ {MASK, VALUE}` flag in the instruction word's `funct6` field. With `c_role = VALUE`, `SC` is read as a 512 B/cy value-tilelet (same datapath as `SA` / `SB`) and routed into the existing per-lane FMA `MUX_Z` (§5.8) — so the FMA becomes `D = A·B + C` with `C` an arbitrary tile register, not the previous accumulator. **VEC-4K-v2 read-port binding expanded from 2 to 3 ports**: R0 = Port A, R4 = Port B, **R1 = Port C** (new); R1 is idle and energy-gated when no 3-source op is in flight, so v1-equivalent 2-port utilisation is preserved for purely elementwise / cube / reduce kernels. Three-port parallel fetch keeps `TFMA` at the same **8 cy fetch / ~10–12 cy end-to-end** as a binary `TADD`; rule R2 (uniform `is_transpose` per epoch) extends across all three operands — mixed-mode 3-source ops cost 16 cy (one odd-out epoch) or 24 cy (all three different — degenerate). **Motivation (from `FMA指令场景说明.md`)**: the canonical `y = γ·x̂ + β` LayerNorm / RMSNorm affine, Welford incremental update (`μ_new = δ·inv_n + μ_old`, `M2_new = δ·δ_2 + M2_old`), Welford state merge, activation polynomials (gelu, swiglu), and trigonometric polynomials (sin, cos) all need a third operand that is **not** the previous accumulator — v0.16's `TFMA_ACC D = A·B + Acc` does not apply. Without native 3-source FMA, every such kernel pays 2× throughput cost (separate `TMUL` + `TADD` instructions) plus 2 roundings instead of 1 (precision loss in FP16 / BF16 / FP8). **Hardware delta vs. v0.16 ≈ +6 K gate (~0.2 % of VEC-4K-v2 total area)**: ~5 K gate widening of `SC` bank-MUX to drive both 1-bit-mask and 512 B-value buses, ~1 K gate control-path widening for the `c_role` bit through Tile RAT / RS / dispatch; the FMA core itself, microcode beat machinery, and 8-port TRegFile already supported `A·B + Z` and the 3rd binding allocation. **Documentation updates**: §1 feature-list bullet 7 added; §3.1 ports table expanded to 3R; §3.3a operand-to-port table updated to show R1; §3.3c rewritten with §3.3c.1 mask-mode + §3.3c.2 value-mode + §3.3c.3 transpose-for-both-modes; §6.2 fetch-cycle table gains rows for `N_val = 3, c_role = VALUE` (uniform / mixed / all-different); §6.3 operand-to-epoch rule 6a added; §7.1 elementwise cycle table gains "Ternary-3-tile" row; **new §7.6 with full ISA semantics, microcode example, LayerNorm worked example showing 2× speedup vs. emulated TMUL+TADD, hardware-cost breakdown (§7.6.4), and interaction notes for masking / transpose / dual-retire (§7.6.5)**; §8 cross-lane summary gains `TFMA / TFNMA / TLERP` row; §10 comparison rows for "Tile operands" and "Read ports" updated to reflect the dual-role `C` and the 3rd read port. No change to compute datapath width, Acc, microcode ROM size, or any pre-existing instruction's cycle count. v1 binaries are unaffected (`c_role` defaults to `MASK`, R1 stays idle). |
+| 0.17 | 2026-04-30 | **Appendix V1 added (§A.1–§A.9): v1 baseline content inherited verbatim from [`vector4k.md`](vector4k.md), with explicit `(v1 → v2: 内容未变更, 完整复制自 v1 §X)` markers on each subsection.** The appendix consolidates the v1 sections referenced by the v2 main body but not previously copied in full: §A.1 v1 §3.2 physical strip + min-streaming-latency derivation; §A.2 v1 §4.2 lanes vs strips definitions; §A.3 v1 §4.3 fiber ID + strip read calendar; §A.4 v1 §4.4 epoch-aligned worked examples A–E; §A.5 v1 §5 instruction categories baseline (§A.5.1–§A.5.5 including `TSORT32`, `TMRGSORT` v1 forms, `TGATHER`, `TCI` / `TTRI` / `TPART*`, `TQUANT` / `TDEQUANT`); §A.6 v1 §6 cross-lane / cross-strip summary baseline table; §A.7 v1 §7 row-reduce datapath mermaid; §A.8 v1 §9 legal `(format, R×C)` enumeration with all formulas (§A.8.1 enumeration rules, §A.8.2 row-axis metrics, §A.8.3 column-axis metrics + Acc state, §A.8.4 cycle model, §A.8.5 reconfigurable reduction tree + shape count, §A.8.6 format extrema, §A.8.7 master 23-shape table); §A.9 v1 §8 implementation notes baseline. Each appendix subsection ends with a brief `(v2 注: …)` annotation pointing the reader to the v2 §X that supersedes / extends the v1 content. §1 "Self-contained reading" paragraph added at the end of the §1 introduction summarising the appendix layout and version-marker convention. No architectural, microcode, or cycle-count change — purely a documentation pass making `vector4k_v2.md` fully self-contained so that a reader does not need to consult `vector4k.md` (v1) for any design information. |
+| 0.16 | 2026-04-22 | **§10.1 and §10.2 refreshed to include the §7.5 novel-ISA hardware (`TINV`, `TROWRANGE_MUL`, `TMRGSORT`).** Three new rows added to the §10.1 block-level delta table: (a) **Reciprocal / RSQRT seed unit** (~50 K gate system-wide, per-group LUT+shift sharing the FMA mantissa path, for `TINV` pivot reciprocal in §7.5.1); (b) **Row-range predicate synthesiser** (~100 gate combinational block in the stage-(B) control path, for `TROWRANGE_MUL` in §7.5.2); (c) **Sort primitive** comparison — v1's hard-wired `TSORT32` + fixed-size `TMRGSORT` networks (~130 K gate combined) vs. v2's reconfigurable 256-lane shuffle + 128-way compare-swap primitive (~130 K gate: ~50 K shuffle butterfly + ~70 K CAS units + ~10 K direction/control), **breakeven** in silicon but covering all power-of-2 sort sizes from 32 to 8192 with value+index dual retire and free partial-sort. The Net-area small-summary table gains two new category rows (Stage (B) auxiliary units; Sort primitive) and revises totals: **v1 ~4.82 M gate**, **v2 SRAM-baseline ~3.53 M gate (~27 % smaller, was ~29 % before §7.5 additions)**, v2 FF-alternative ~4.03 M gate (~16 % smaller). All four alternative-config totals in the subsequent table re-tabulated accordingly. Takeaway gains items 7 (+50 K RECIP seed) and 8 (~0 net for the sort primitive, with qualitative flexibility upside). §10.2 routing discussion gains a new bullet for the shuffle+CAS primitive (a *second* structured log-stride butterfly, local to stage (B), activated only during `TMRGSORT` beats, inputs/outputs via existing stage (B) fabric — so **no new cross-block corridor**), a bullet noting the RECIP + predicate synth add no new cross-group wiring, and a note about the 5 b `alu_op` + 3 b `shuffle_stride` adding a handful of control wires in the microcode fan-out. The cross-lane-wires and congestion-classes rows in the §10.2 summary table updated to describe the two butterflies (Acc feedback + shuffle+CAS) as contained regular log-stride patterns that are **never both active on the same cycle**. §10.2 takeaway revised to reflect the ~27 % figure and to note v2 now delivers *strictly greater* functionality than v1 (adds `TINV`, `TROWRANGE_MUL`, subsumes v1's sort networks) while remaining smaller. No architectural, microcode, or externally-visible cycle-count change — a numerical-update pass bringing the cost/routing analysis in sync with v0.14 + v0.15 feature additions. |
+| 0.1 | 2026-04-22 | Initial **VEC-4K-v2** design: three tile operands (`A`/`B`/`C`), one or two tile results (`D0`/`D1`), per-read-port `is_transpose`, explicit tile-register metadata (32 b: shape.x, shape.y, format), explicit staging registers (`SA`/`SB`/`SC` tile, `SX`/`SY` scalar, `SOP` opcode/control), variable-length operand-fetch phase (§6), 512 B compute datapath width unchanged. Formats **FP32, FP16/BF16, FP8, FP4** supported via the `format` metadata field. §10 quantitative comparison vs VEC-4K v1. |
+| 0.2 | 2026-04-22 | **`is_transpose` relocated to the staging-register write path.** Because `SA`/`SB`/`SC` are flip-flop arrays (no bank-conflict constraint on the write port), the transpose is realized by a programmable byte-permutation network on the staging write (§3.3, §4.2.1) rather than by TRegFile col-mode. Consequences: (i) TRegFile-4K runs **row-mode only** for VEC-v2 reads; (ii) [`tregfile4k.md`](tregfile4k.md) §6 **rule R2** (uniform transpose per epoch) no longer constrains VEC scheduling; (iii) the `xpose_mismatch` epoch penalty in §6.2 is removed — any combination of per-operand `is_transpose` values costs the same `T_fetch`; (iv) transpose is now **element-grain** (stronger than §7 col-mode's chunk-grain delivery); (v) §5.2 loses its col-mode resolution step. §7.3 `TCOL*` path rewritten: staging-write transpose turns `TCOL*` into `TROW*` on the transposed tile for all shapes. §9 implementation notes updated with flip-flop area discussion and an option to use a simplified (no-col-mode) TRegFile-4K-v2. §10 comparison table and takeaways updated accordingly. |
+| 0.3 | 2026-04-22 | **Staging Beneš crossbar replaced by diagonal-skew sub-banks + row/col read mode.** Since each staging register absorbs only 512 B/cy (one TRegFile strip) and delivers only 512 B/cy into stage (A), a byte-Beneš is overkill. §4.2.1 redesigned: `SA`/`SB`/`SC` are 64 × 64 B flip-flop sub-banks using the [`tregfile4k.md`](tregfile4k.md) §2 diagonal skew `bank_id = 8·g + ((l + g) mod 8)`; the write side uses one 8-way 64 B rotator (same as [`tregfile4k.md`](tregfile4k.md) §3); the read side picks row-mode or col-mode per compute beat using a bank-select mux + optional output rotator (same as [`tregfile4k.md`](tregfile4k.md) §7.4). Permutation logic per staging register drops from 12 Beneš stages to **2 rotators + 1 bank-select mux**. Trade-off: transpose granularity reverts to **chunk-grain** (64 B sub-chunks, matching [`tregfile4k.md`](tregfile4k.md) §7.5), so §5.2 regains the element-level fixup path for non-aligned `W ∈ {128, 256, 1024, 2048, 4096}` regimes — but that path re-uses the stage-(A) align/unpack/permute logic that already exists for [`vector4k.md`](vector4k.md) §4.1 (A), so no new RTL block is added. §3.3, §4.2, §4.2.1, §5.2, §7.3, §9, §10 revised accordingly. `is_transpose` is now a **per-beat read-side control** (not a per-operand write-side control) — a kernel alternating row-order and column-order views of the same staged operand can flip the bit between beats at no cost. |
+| 0.4 | 2026-04-22 | **Transpose moved back to the TRegFile read port** (restoring the v0.1 approach). Rationale: [`tregfile4k.md`](tregfile4k.md) §7 already provides the chunk-grid col-mode read at full 512 B/cy and is paid for by other clients; duplicating a second transpose path inside VEC (v0.2 Beneš or v0.3 diagonal-skew + bank-select mux in `SA`/`SB`/`SC`) is redundant. §3.3 rewritten: `is_transpose` is per-read-port and **forwarded to the TRegFile read port**; VEC instantiates **no** transpose hardware. §4.2 / §4.2.1 simplified to **linear 8 × 512 B flip-flop buffers** — no diagonal skew, no write rotator, no bank-select mux, no read rotator; only an 8:1 512 B read mux + strip-valid bitmap + metadata. §5.2 keeps the col-mode element-level fixup (now driven by `(format, R, C, is_transpose)` from `SOP`), identical to the downstream fixup [`tregfile4k.md`](tregfile4k.md) §7.5 already requires. §6.2 reinstates the **`xpose_mismatch` term** ([`tregfile4k.md`](tregfile4k.md) §6 rule R2): mixed-`is_transpose` 2-op instructions now cost `N_epoch = 2` (16 cy) instead of 8 cy; 3-op instructions cost 16 cy in all binary-`is_transpose` partitions. §6.3 restored R2-aware epoch packing (ride-sharing one-odd-out with the matching pair). §6.4 / §6.5 give example calendars for uniform and mixed-mode cases. §7.3, §8, §9, §10 revised: TCOL* uses TRegFile §7 directly; "zero new transpose hardware in VEC" replaces the staging-side transpose area discussion; rule R2 is the only scheduling cost. `is_transpose` is per-operand (not per-beat) again — flipping views of the same staged tile across beats requires a re-read, reverting the v0.3 per-beat flip capability. |
+| 0.15 | 2026-04-22 | **Third novel PTO instruction added: `TMRGSORT` — full-tile mergesort / bitonic sort via a reconfigurable 256-lane shuffle + compare-swap primitive.** New §7.5.3 specifies the architecture, algorithm, microcode pseudo-code, and cycle counts. Central idea: replace v1's family of *hard-wired* sort networks (`TSORT32`, `TMRGSORT` at fixed block sizes) with **one reconfigurable primitive** — a 256-lane log-stride perfect-shuffle butterfly feeding 128 parallel value-compare-swap units — and express any Batcher-style bitonic / merge sort as a microcode layer-schedule. The primitive sits inside stage (B), activated by two new `alu_op` values `SHUFFLE_CAS_UP` / `SHUFFLE_CAS_DOWN` (added to §5.4, widening the `alu_op` field from 4 b to 5 b); an extra 3-bit `shuffle_stride` field (reusing the beat word's `xp_A`+`s_A` low bits when `alu_op` is a SHUFFLE_CAS op) selects the butterfly permutation across `{1, 2, 4, 8, 16, 32, 64, 128}`. Each CAS unit carries a 16-b index companion lane so `TMRGSORT` emits value + index in one instruction via dual retire (`D0 = values`, `D1 = indices`). Setting `mask_src = SC_mask` on a shuffle+CAS beat pins masked-out lanes to `±∞`, yielding a **free partial-sort / predicated-sort variant**. Cycle count follows the Batcher depth formula `D(N) = p(p+1)/2` layers with `⌈N/256⌉` beats per layer: **36 beats for N = 256**, **220 beats for N = 1024 FP32** (one tile), **2.9 K beats for N = 8192 FP4** (one tile). Hardware area: ~130 K gate total (≈ 50 K for the shuffle butterfly, ≈ 70 K for 128 compare-swap units, ≈ 10 K for direction-mux + control) — roughly breakeven with the combined silicon of v1's separate `TSORT32` + `TMRGSORT` networks, but supports all power-of-2 sort sizes from 32 to 8192 in one block. §1 feature list updated (item 7 now names three novel instructions), §8 cross-lane table gains a `TMRGSORT` row, §7.5.4 added as a consolidated summary table for all three §7.5 ops. No change to ALU core, Acc storage, Acc feedback MUX, staging store, or operand-fetch prologue. |
+| 0.14 | 2026-04-22 | **Two novel PTO instructions added: `TINV` (tile matrix inverse) and `TROWRANGE_MUL` (column-wise product over row sub-range).** New §7.5 specifies both ops with full algorithm, microcode pseudo-code, cycle counts, and hardware-reuse analysis. **`TINV`** supports square matrix inversion up to 32×32 FP32 / 64×64 FP8 in a single tile, up to **128×128 FP32 in 16 tiles** via multi-tile register-range encoding (2 b `log₂(num_tiles)` in the instruction word). Algorithm: in-place Gauss–Jordan with per-iteration `RECIP` + 2-NR refinement for the pivot reciprocal, row normalisation with `alu_op = MUL`, and elimination with `alu_op = FNMA` — each iteration's pivot-row and reciprocal scalar live in the Acc (parity-butterfly idle in this mode). The column-k gather in Phase Pk.3 uses per-beat `tilelet_xpose = 1` to collapse what would otherwise be 128 single-element reads into one strip. Total beats: ~2·N² · `S_row` + N · `S_col` + 3N (e.g. **~33 K beats for 128×128 FP32 ≈ 33 µs at 1 GHz**, ~100× faster than a naive CPU inverse). **`TROWRANGE_MUL`** computes `out[c] = ∏_{r ∈ [r_start, r_end)} Tsrc[r, c]` in ~10 beats total (1 INIT + ≤ 8 strip-accumulates + 1 retire). The row-range predicate is synthesised combinationally from `SX` / `SY` scalars by a ~100-gate block in the stage-(B) control path — no `SC` payload needed for the range (optional user mask still uses `SC`). `tilelet_xpose = 1` on the `SA` read converts the op to `TCOLRANGE_MUL` (column-range product into a single column) verbatim. Replaces the typical v1 emulation `exp(sum(log(...)))` (3 passes, precision loss) with one tile-resident pass. Hardware deltas: (i) `RECIP` and `RSQRT` added to the `alu_op` enum in §5.4 — implemented as a small per-group LUT+shift block sharing the FMA mantissa path (~50 K gate system-wide, amortised across 128 compute groups); (ii) a ~100-gate combinational row-range-predicate synthesiser in the stage-(B) control path. No change to ALU core, staging store, Acc storage, Acc feedback MUX, microcode ROM size (the two new programs fit in unused ROM capacity), or operand-fetch prologue. §1 feature list updated (new item 7) and §8 cross-lane/cross-strip summary gains two rows for the new ops. No externally-visible cycle-count change for any existing instruction. |
+| 0.13 | 2026-04-22 | **§10.1 gate-count recalibration + SRAM staging promoted to production baseline.** Recalibrated the cell-size ruler at the top of §10.1 with realistic modern-FinFET gate counts: **FP32 adder ≈ 1 500 NAND2** (not ~150 — that was the integer-adder figure), FP32 FMA ≈ 10 000, FP32 min/max/cmp ≈ 400, 32 b 2:1 MUX ≈ 50, FF (+ local mux) ≈ 10, 1R1W SRAM bit (small macro, peripheral-amortised) ≈ 3–4. Under the new ruler the v1 blocks whose dominant cell is a standalone FP32 adder (cross-group reducer network; Acc RMW adder) come out ~10× larger than previously stated (cross-group: 300 K → ~1.1 M; Acc RMW: 25 K → ~250 K); stage (B) ALU whose per-lane cell is dominated by the FMA comes out ~2× larger (~1.0 M → ~1.8 M). Importantly the **v1 – v2 delta stays robust** because both versions share stage (A) / stage (B) core / Acc storage in common — only the *removed* blocks (crossbar, cross-group net, Acc RMW adder) and the *added* blocks (staging, microcode ROM, Acc feedback MUX, wider stage (B) input MUXes) recalibrate. New totals: **v1 ~4.69 M gate**, **v2 with SRAM staging (baseline) ~3.35 M gate (~29 % smaller than v1)**, v2 with FF staging ~3.85 M gate (~18 % smaller). Four alternative-config totals tabulated: +cross-group net, +double-buffered staging, and the permutations between them. The §10.1 per-block delta table now has **three columns for v2 (SRAM baseline / FF alternative / Δ vs v1)** instead of a single v2 column. A new paragraph at the end of §10.1 answers "why is v1 Acc RMW so much smaller than stage (B) ALU at the same 128-lane count" — per-lane breakdown shows ~7× gap coming from the missing multiplier/FMA/narrow-SIMD in the accumulator path (Acc RMW only does ADD / MIN / MAX on post-widener FP32 partials). §10.2 takeaway updated to read "both smaller *and* easier to route" (v0.11 said "area-neutral but easier to route" — the area claim was too conservative). §10.2 final paragraph rewritten to discuss SRAM-baseline macro placement / MBIST / bus-channel planning instead of FF clock-tree planning; FF-alternative clock-tree guidance is still mentioned as a secondary concern. §9 item 2 (staging area budget) and §9.1 opening paragraph updated to call out SRAM as the **production baseline** and FF as an **alternative** (reversing the implicit default from v0.12, which had documented the migration but kept FF as baseline for the rest of the doc). No architectural, microcode, or externally-visible cycle-count change — this is a purely numerical-recalibration + baseline-labelling pass. |
+| 0.12 | 2026-04-22 | **Staging-store FF→SRAM migration analysis added (§9.1).** New subsection works through the functional requirements the staging payload store must satisfy (64 B-granularity col-mode addressability, three-way concurrent read across `SA`/`SB`/`SC`, same-cycle strip write + compute read, strip-valid tracking, `SC` mask-port compatibility) and maps them to an SRAM implementation. Conclusion: a **1R1W SRAM per row-group** (8 macros × 3 staging registers = 24 macros, 4 Kb × 1R1W each) is a **functional equivalent** of the v2 flip-flop baseline — per-beat `tilelet_xpose`, strip-valid bitmap, three-way concurrent read, mask unpack, and out-of-order strip write all preserved. Diagonal skew is an addressing property, unchanged. Only microarchitectural delta: one optional pipeline stage after the SRAM output (Option (b), §9.1.3), transparent to microcode, adds 1 cycle to first-beat fetch-to-retire latency but not to steady-state throughput. Area: **~0.4–0.6 M gate saved** (40–60 % of staging area, ~15–25 % of total VEC-4K-v2 area), shifting total from ~2.35 M → ~1.85 M gate single-buffered (v1 is ~2.57 M, so SRAM-staged v2 is ~28 % smaller than v1). Double-buffered v2 on SRAM stays at ~2.35 M — i.e. SRAM migration is a prerequisite that makes double-buffering affordable. Secondary wins: dynamic-power reduction on staging access (SRAM doesn't toggle every bit's CLK tree each cycle). Caveats documented: narrow-aspect `512 b × 8 × 1R1W` may not be a PDK-supplied shape (use `512 b × 16` half-empty, ~common), MBIST overhead ~1–2 %, extra pipeline stage, SRAM timing may eat more of the cycle budget. Flip-flop baseline recommended for early-architecture / FPGA prototypes; 1R1W SRAM recommended for production silicon. Metadata, strip-valid, `SX` / `SY`, `SOP` remain FF regardless. |
+| 0.11 | 2026-04-22 | **Area and routing comparison added (§10.1, §10.2).** Two new subsections quantify v2 vs. v1 at the block level. §10.1 gives an order-of-magnitude gate-count table: v2 adds ~1.0 M gate of staging flip-flop store (`SA`+`SB`+`SC` = 12 KB FF), ~15 K cells of Acc feedback MUX (§5.5.1), ~4 K cells of stage (B) value-input MUXes, ~3 K cells of staging read rotators, ~150 K gate of microcode ROM, and ~5 K gate of strip-valid FSM; v2 removes ~1.0 M gate of input crossbar, ~300 K gate of cross-group reducer network (recommended baseline, §5.8.1), and ~25 K gate of dedicated Acc RMW adder. Net: **v2 is area-neutral within ±10 %** at ~2.35 M vs. v1's ~2.57 M gate. Alternative configs (retained cross-group net: +300 K; double-buffered staging: +1.0 M) are called out. §10.2 gives a routing-difficulty comparison: v1's crossbar is the dominant global-routing hot spot (~24 K bit-granularity wire crossings through block centre, quadratic scaling with datapath width, monolithic floorplan); v2 has **no global hot spot** — the three longest wires are point-to-point 512 B buses from each staging island to stage (A), each with a single source and single sink. v2's remaining cross-lane wiring (Acc feedback butterfly, §5.5.1) is localised to the Acc bank and forms a structured log-stride shuffle with layout-friendly regular pattern. A side-by-side table covers: congestion classes, longest wire bundle, cross-lane wires, timing closure, floorplan modularity, scaling to wider datapaths (v1: O(W²); v2: O(W)), and sensitivity to process node (v2 ages better as wires become slower relative to gates). Bottom line: v2 is approximately area-neutral but **substantially easier to physically implement**, freeing typically one node's worth of frequency headroom relative to v1. No architectural or microcode changes. |
+| 0.10 | 2026-04-22 | **Accumulator feedback MUX specified: `MERGE_STAGE` pair-selection network.** New §5.5.1 pins down the physical layout and MUX tree that supports the `MERGE_STAGE` primitive across all fold strides `2^k, k ∈ 0..7`. Each Acc ping-pong plane is reorganised as a **parity-indexed 2-bank flip-flop store** (`bank(s) = s[7] ⊕ … ⊕ s[0]`; `intra(s) = s[6:0]`), which guarantees that any pair `(s, s ⊕ (1 << merge_bit))` always lands in opposite banks — eliminating bank conflicts for every fold stride. A per-lane **pair-address generator** computes `a_k = insert_bit_zero(k, merge_bit)`, `b_k = a_k ⊕ (1 << merge_bit)`; since `parity(a_k) = parity(k)` the per-lane bank routing is *static* (lane 0 → Bank 0, lane 1 → Bank 1, …), and only the intra-bank address depends on `merge_bit`. The read-side network is an **8-way 128 × 32 b shuffle** per bank (3-bit `merge_bit` select, two-level MUX tree, ≈ 0.5 × the stage-(B) ALU input layer in area) followed by a **per-lane 2:1 swap MUX** keyed on `parity(k)` that presents `(Acc[a_k], Acc[b_k])` as `(LO, HI)` to the ALU. Write-back uses the inverse of the read shuffle, one per-bank 64 × 32 b permutation. `ACCUM` / `INIT` / `READOUT` fall out as the degenerate `merge_bit = 7` case where the read shuffle collapses to identity and `LO / HI` regain their familiar `Acc[0..127] / Acc[128..255]` semantics — so no special-casing in the microcode assembler. §5.4 `acc_slot` field widened from 2 b to 4 b with dual semantics: for non-merge beats the low 2 b retain v0.9's LO/HI × ping-pong selection, for `MERGE_STAGE` beats the field encodes `merge_bit[2:0]` + an optional `merge_base_parity[3]` for partial-plane folds. §5.7.2 / §5.7.3 / §5.7.4 microcode pseudo-code updated to use `acc_slot={merge_bit=N}` notation consistent with §5.5.1. Net area change vs. v1: approximately 0 — the new MUX tree is absorbed by the simultaneous removal of the v1 dedicated Acc RMW adder (§5.8). No change to ALU, staging registers, or compute-group count. |
+| 0.9 | 2026-04-22 | **Scalar GPR register file named as a first-class source for `SX` / `SY` scalar operands.** §4.3 tag list expanded from three kinds (IMM / TILE / ACC) to four, with **Scalar GPR** added as the common case: the PTO instruction word encodes one 5-bit `sx_gpr` / `sy_gpr` index per active slot, and the scalar general-purpose register file (shared with the ordinary scalar pipeline) performs one 64 b read per slot at **issue time**, forwarding the value into `SX` / `SY` before the operand-fetch prologue (§6) begins. Instruction-word layout specified: 2-bit `sx_src ∈ {GPR, IMM, TILE, ACC}` selector + 5-bit GPR index / 16-bit immediate / tile-element pointer / ACC slot id payload (same for `sy_*`). §6.3 gains rule 7 (*Scalar operands are fetched off the TRegFile path entirely*): GPR and IMM scalars cost **0 cy** in `T_fetch` and are always valid by the first microcode beat; TILE-sourced scalars share an idle read-port cycle within a value epoch; ACC-sourced scalars read through the stage (B) → Acc feedback path and do not touch the TRegFile ports at all. §7.2 (Tile–Scalar) rewritten to make the scalar-source orthogonality explicit — cycle count unchanged at ~10 cy end-to-end regardless of scalar source kind. §1 feature 5 and §5.1 high-level block diagram updated to list the four scalar sources and to note that GPR / IMM scalars are captured at issue. No hardware changes — this clarifies the existing interface between the scalar pipeline and VEC staging, which was underspecified in prior versions. |
+| 0.8 | 2026-04-22 | **Recommended baseline: drop the cross-group reducer network; keep the intra-group tree.** §5.8 opening and §5.8.1 restructured to declare the recommended v2 configuration: (i) **retain** the per-compute-group intra-group tree (format widener — FP32 bypass, FP16 2→1, FP8 4→1, FP4 4→1 with sub-beats); (ii) **remove** the cross-group reducer network (`N_tree` trees of [`vector4k.md`](vector4k.md) §5.3.2). All `TROW*` instructions now route through `tilelet_xpose + ACCUM + MERGE_STAGE` (§5.6.2), keeping stage (B) strictly lane-aligned on every beat. §5.6.2 rewritten as the recommended path with an explicit `log₂(W/E)`-beat `MERGE_STAGE` fold phase following the `xp = 1` accumulate loop; §5.6.3 narrow-tile row-reduce discussion updated with the `K_outer` / MERGE path and a note that the scaling penalty is largest at the narrowest extreme. §5.7.2 wide row-reduce rewritten as a **13-beat** MERGE-based recommended path, with the old **9-beat** cross-group-tree version preserved verbatim as §5.7.2-alt (alternative configuration, *not* recommended). §5.7.4 narrow row-reduce (worst-case `R=64, C=16` shape) rewritten as a **49-beat** `K_outer = 8` MERGE-based recommended path, with a scaling table for wider narrow shapes (`W=128 B`: ~28 beats, `W=256 B`: ~16 beats, `W=512 B`: ~13 beats = §5.7.2), and the old **9-beat** tree version preserved as §5.7.4-alt. §5.8.1 reorganised into three named subsections — *Recommended baseline*, *Per-group width by format* (unchanged), *Strict lane-alignment guarantee*, *Alternative configuration* — with explicit quantitative penalties (+44% wide, up to 5.4× narrowest, shape-dependent in between) and the statement that **all non-`TROW*` families are unaffected** by this trade. §4.4 `N_tree_en` annotated as "tied to 0 at synthesis time in the recommended baseline". §10 comparison table row for the cross-group reducer network rewritten to reflect the recommended-removed / alternative-retained split. No microcode field renames, no staging / fetch / mask / Acc changes — this is an implementation-variant choice with full pseudo-code provided for both paths. |
+| 0.7 | 2026-04-22 | **Clarification pass on compute-group structure (no hardware change).** §5.8.1 added — explicitly documents that `N_group = 128` is format-independent (a datapath/Acc alignment constant), while each group's SIMD width `W_ALU,i` varies with format (FP32 → 1, FP16 → 2, FP8 → 4, FP4 → 4 × 2 sub-beats). Adds a per-format table for `W_ALU,i` / intra-group fold / sub-beat count / `W_tree,i`. Distinguishes the two kinds of "tree" in stage (B): (i) the **intra-group tree**, which is really a format widener (narrow-format SIMD → FP32 partial for a 32-b Acc slot) and is *always* needed; (ii) the **cross-group reducer network** (`N_tree` trees), which is only touched by `TROW*` / cross-strip-combine `TCOL*` beats and is **bypassed** on every element-wise / col-reduce-via-`tilelet_xpose` beat. States the strict-lane-alignment guarantee for element-wise beats, and notes that an implementation may drop the cross-group network entirely and route all row-reduce via `tilelet_xpose + ACCUM + MERGE_STAGE` at a cost of ~40–50% more beats for `TROW*` only (all other op families untouched). §10 comparison table gains three new rows (`N_group`, intra-group tree, cross-group network). No microcode / staging / fetch changes. |
+| 0.6 | 2026-04-22 | **Operand `C` re-interpreted as a per-element bitmask (predicate).** Previously `C` was a generic third value tile used for ternary ops like `TFMA D = A·B + C`; v0.6 fixes its semantics as a **per-element enable mask** indicating which lanes of `A` / `B` participate in the ALU op. The packed mask layout (§3.3c) carries 1 bit per logical element of the active format (FP32 → 1 b / 32-b lane, FP4 → 1 b / 4-b lane); total payload is at most 1024 B (≤ 2 strips) for any 4 KB tile. `SC` keeps the 4 KB diagonal-skew physical layout for uniformity (§4.2) but adds a **128 × 1 b lane-predicate read port** (§4.2.2) that feeds the stage (B) ALU mask input (§5.8). Instructions that do not use a mask set `SOP.has_mask = 0` (§4.4), which (a) skips the `SC` fetch entirely — saving all read-port cycles the C operand would otherwise consume — and (b) statically ties the mask MUX to `IMM_ALL_ONES` so every lane participates. When `has_mask = 1`, the mask fetch piggybacks on an idle read-port cycle within a value-tile epoch (§6.2, §6.3) — **masked and unmasked variants of the same op have identical fetch latency** in the common case. Microcode beat word (§5.4) gains `mask_src ∈ {SC_mask, IMM_ALL_ONES, IMM_FROM_SOP}` and `mask_strip`; the three value-input fields are renamed `src_A` / `src_B` / `src_Z` (the `Z` slot carries the FMA addend and typically sources `ACC_READ_*` for FMA-accumulate — not the C tile). Stage (B) (§5.8) adds a per-lane output gate `out[lane] = M[lane] ? alu_core_out[lane] : A[lane]` so `SELECT`-style and masked-`ACCUM` semantics are available on any `alu_op` with no extra hardware. §7.1 adds masked-op cycle rows; §7.4 lists `TSELECT`, masked reductions, and masked gather; §10 comparison row for tile operands rewritten to "2 value + 1 optional bitmask"; §9 updated for `SOP.has_mask` and the `SC` sizing rationale. |
+| 0.5 | 2026-04-22 | **Crossbar removed; compute datapath driven by per-beat microcode.** §5 entirely rewritten. Each compute cycle is now a **microcode beat word** in `SOP.ucode_*` (§5.4) that nominates `{src, strip_index s, tilelet_xpose xp}` per ALU operand slot plus `{alu_op, acc_op, acc_slot, wr_en_{D0,D1}, wr_strip_{D0,D1}}`. Each staging register directly drives one of three 512 B operand buses into stage (A) — no crossbar, no calendar-template decoder. **Two orthogonal transpose points** (§3.3): (a) per-operand `is_xpose_*` forwarded to TRegFile §7 (retained from v0.4), (b) per-beat `tilelet_xpose` at the staging read port (restored from v0.3; the staging register returns to the v0.3 diagonal-skew 64 × 64 B sub-bank layout with one write rotator + one read bank-select mux + one read output rotator). The staging-side `tilelet_xpose` is the **primary** transpose mechanism and costs zero scheduling (independent per operand slot per beat, unaffected by rule R2); the tregfile-side `is_xpose_*` is optional and used when one view of a tile is needed many times. §5.5 introduces `acc_op ∈ {NONE, INIT, ACCUM, MERGE_STAGE, READOUT}` for microcode-driven accumulator control. §5.6 is a new subsection deriving the axis-reduction recipes: col-reduce = straight accumulate loop with `tilelet_xpose = 0`; row-reduce = same loop with `tilelet_xpose = 1`; narrow-tile reductions (`W = C·E < 512 B`) additionally need a `log2(512/W)`-beat **final accumulator merge** phase with `acc_op = MERGE_STAGE`. §5.7 walks through four full microcode pseudo-code examples: `R=8,C=128` wide col-reduce (9 beats); `R=8,C=128` wide row-reduce via `tilelet_xpose` (9 beats); `R=64,C=16` narrow col-reduce with 3-stage merge (12 beats); `R=64,C=16` narrow row-reduce with in-strip reducer-tree (9 beats). §4.2 / §4.2.1 reverted to diagonal-skew staging; §4.4 adds `ucode_{base,len}` to `SOP`. §7.3 and §10 updated to reference the microcode programs instead of calendar templates. §9.3 updated to describe the small amount of transpose hardware inside the staging register (reused from [`tregfile4k.md`](tregfile4k.md) §7). **Accumulator ALU unified with stage (B) ALU** (§5.5, §5.8): the dedicated Acc RMW adder of [`vector4k.md`](vector4k.md) §4.1 is removed; a per-operand-slot input MUX (`{operand_pipe, ACC_READ_LO, ACC_READ_HI, SX, SY}`) plus a split-bank Acc register file (LO = slots 0..127, HI = slots 128..255, each 1R1W) let the main stage (B) ALU serve as accumulator adder, `MERGE_STAGE` cross-slot reducer (both inputs from Acc), `INIT`/`READOUT` load/store, and compute ALU on alternating beats. Any ALU op (ADD / MAX / MIN / OR / FMA) is automatically available at `ACCUM` and `MERGE_STAGE` beats — e.g. `MAX`-reduction merges for argmax, or `FMA`-accumulate with `src_C = ACC_READ_*` for GEMM epilogues — with no new hardware. |
diff --git a/designs/outerCube/vector512.md b/designs/outerCube/vector512.md
new file mode 100644
index 00000000..a848e069
--- /dev/null
+++ b/designs/outerCube/vector512.md
@@ -0,0 +1,764 @@
+# VEC-512: Vector Unit for 512 B PTO Tiles (PTO ISA Subset)
+
+## 1. Purpose and Scope
+
+This document specifies a **vector execution unit (VEC-512)** that implements the same **software-visible subset** of the PTO Tile Lib ISA as [`vector4k.md`](vector4k.md) — elementwise tile–tile ops, tile–scalar ops, axis reduce / expand, and selected **complex** instructions (**TMRGSORT**, **TSORT32**, **TGATHER**, **TCI**) — but paired with a **tile register file (TRegFile-512)** whose **tile size is 512 B** (**8× smaller** than VEC-4K).
+
+**Non-goals (this document, as in [`vector4k.md`](vector4k.md)):** matrix multiply (**TMATMUL** / **TGEMV**), global-memory **TLOAD/TSTORE**, and **comm** collectives.
+
+**Why 512 B tiles?** A 512 B tile is **exactly one physical strip** at the reference **512 B port width**. Collapsing **tile = strip** removes the **8-cycle** strip walker, the **cross-strip** merger, the **strip-pair calendar**, and the **multi-epoch** replay logic that dominates **`TCOL*`** and large-fiber reductions in **VEC-4K**. The comparison of **datapath complexity** and **per-instruction cycle count** against VEC-4K is given in **§10**.
+
+---
+
+## 2. Tile and Format Model
+
+### 2.1 Storage Invariant
+
+Each logical tile occupies exactly **512 bytes** in the TRegFile. The logical shape is **R × C** with **R** and **C** powers of two (and **R·C = N = 512 / E**).
+
+**Row-major** layout (same rule as VEC-4K §2.1).
+
+Let **E** be the **storage bytes per logical element**. **Only two storage widths are supported** — FP32 and FP16 / BF16. Smaller-precision formats (FP8, MXFP4, HiFP4) are explicitly **out of scope** for VEC-512, just as for VEC-4K §2.1:
+
+| Logical format | `E` (B/elem) | Elements per 512 B tile (**N = 512/E**) | Max(R, C) |
+|----------------|--------------|-----------------------------------------|-----------|
+| FP32           | 4            | **128**                                 | 128       |
+| FP16 / BF16    | 2            | **256**                                 | 256       |
+
+Internal ALU / reducer operands are widened to FP32 where required by ISA numerics; pack / unpack is narrow-to-narrow only (FP16 ↔ FP32 cast per `TCVT`). There are **no packed-nibble lanes** anywhere in the datapath.
+
+**Valid shape examples** (each `R·C = N`):
+
+- FP32: 1×128, 2×64, 4×32, 8×16, 16×8, 32×4, 64×2, 128×1 (**8** shapes).
+- FP16 / BF16: 1×256, 2×128, 4×64, 8×32, 16×16, 32×8, 64×4, 128×2, 256×1 (**9** shapes each).
+
+### 2.2 Metadata
+
+Each issued op carries **format**, **R**, **C**, and **opcode**. Microcode derives:
+
+- `strip_count = 512 / 512 = 1` — **one** physical strip per tile. **No strip walker** needed.
+- `elem_per_strip = elem_per_tile = 512 / E`.
+- `row_B = 512 / R`, `col_B = R · E` (both **≤ 512 B**, so both axes fit in one strip).
+
+---
+
+## 3. TRegFile Interface (TRegFile-512)
+
+### 3.1 Ports (Design Assumption)
+
+| Direction | Width | Count | Aggregate |
+|-----------|-------|-------|-----------|
+| Read      | 512 B | 2     | **1024 B/cycle** = **2 tiles/cycle** |
+| Write     | 512 B | 2     | **1024 B/cycle** = **2 tiles/cycle** |
+
+One **read port** delivers **a full tile per cycle**. Dual reads therefore deliver **both operands of a binary elementwise op in one cycle** (vs. **8 cycles** of strip-pair reads in VEC-4K §3.2).
+
+**Read semantics:** Each read port presents the whole 512 B of a tile — **no gather**, same as VEC-4K §3.1. **`TCOL*`** still performs **column selection in VEC** (strip-buffer → column mux), but now against **one** 512 B strip, never across multiple strips of a single tile.
+
+**Minimum streaming latency:**
+
+- **Unary** (both ports idle or merged): **1 cycle** to read a full tile.
+- **Binary elementwise** (`src0`, `src1`): **1 cycle** to ingest both full tiles.
+
+### 3.2 Physical Strip
+
+A **strip** is a contiguous **512-byte** chunk — **identical to a tile**. The strip index **`s`** ranges over **`{0}`** only (**S = 1**). All references in §4–§5 drop strip walks and strip calendars.
+
+### 3.3 Epoch / Calendar
+
+Because **one tile = one port beat**, the TRegFile does **not** need an **8-cycle rotating calendar**. The simplest implementation uses **2R+2W SRAM banks** at **512 B width**, accepting a **new `reg_idx` every cycle** per port. No pending→active promotion is required. Port latency is a single read cycle plus pipeline alignment.
+
+**Optional high-density variant:** repurpose the [`tregfile4k.md`](tregfile4k.md) 64-bank array as **2048 × 512 B tiles** (each tile = 1 bank-group = 64 B × 8). In that mode the rotating calendar is **inverted**: a port can issue a **different `reg_idx` every cycle**, so tile throughput matches the simple variant.
+
+### 3.4 On-Chip Buffers
+
+- **A strip buffer** and **B strip buffer** (512 B each, Rd0 / Rd1). **No double-buffering required** for strip pipelining within a single tile, though double-buffering is still useful to overlap two **independent** instructions.
+- **Acc** — **64 × 32 b × 2** ping-pong (**N_run = 128**, §4.1 / §9.3.2).
+- **Scalar broadcast register**.
+
+---
+
+## 4. Vector Datapath Overview
+
+### 4.1 Block Diagram (dataflow level)
+
+Reference micro-architecture (per-cycle counts for scheduling / §5.3.2 **`N_tree`**):
+
+```text
+  ┌────────────────────────────────────────────────────────────────────┐
+  │                     TRegFile-512 (512 B tiles)                       │
+  │              Rd0 (512 B)              Rd1 (512 B)                  │
+  └──────────────────┬──────────────────────────┬──────────────────────┘
+                     │                          │
+                     └──────────┬───────────────┘
+                                │ 1024 B / cycle = 2 full tiles
+                                ▼
+  ┌────────────────────────────────────────────────────────────────────┐
+  │  Instruction opcode + shape (format, R, C, …)  ──►  CONTROL         │
+  └────────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  CROSSBAR                           │
+              │  1024 B in → distribute to compute  │
+              └─────────────────┬───────────────────┘
+                                │
+              ┌─────────────────▼───────────────────────────────────────────┐
+              │  (A) ALIGN / UNPACK / PERMUTE  (control-selected)          │
+              │  IN:  1024 B / cycle from crossbar                          │
+              │  OUT: 128 slices  (slice i width = W_prep,i bits)           │
+              └─────────────────┬───────────────────────────────────────────┘
+                                │ 128 parallel slice buses
+                                ▼
+              ┌─────────────────▼───────────────────────────────────────────┐
+              │  (B) N_group = 128  INDEPENDENT COMPUTE GROUPS  i = 0…127   │
+              │  ┌─────────────────────────────────────────────────────┐   │
+              │  │  Group i (representative):                          │   │
+              │  │    IN:   W_prep,i                                   │   │
+              │  │    ┌──────────────────┐      ┌────────────────────┐  │   │
+              │  │    │ Elementwise ALU  │ ───► │ Reduction tree     │  │   │
+              │  │    │ OUT:  W_ALU,i    │      │ OUT:  W_tree,i     │  │   │
+              │  │    └──────────────────┘      └─────────┬──────────┘  │   │
+              │  │    Tree depth D_lane,i ≤ ⌈log₂ 1024⌉ = 10           │   │
+              │  │    NO cross-strip stage (S = 1)                     │   │
+              │  └─────────────────────────────────────────────────────┘   │
+              │  Typical: W_tree,i = 32 b (FP32-shaped partial → Acc)        │
+              └─────────────────┬───────────────────────────────────────────┘
+                                │ 128 × W_tree,i to Acc per beat
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  ACCUMULATOR (DFF, ping-pong)       │
+              │  64 × 32 b × 2 halves  ≈ 512 B      │
+              │  per slot: DFF + optional combine    │
+              │    • RMW: adder (new ⊕ feedback DFF)  │
+              │    • BYPASS: new → DFF (no combine) │
+              │  N_run = 128 logical slots (§9.3.2) │
+              └─────────────────┬───────────────────┘
+                                │ mux: one 64-word half
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  Pack (FP32 → FP16 / BF16 cast, per TCVT) │
+              └─────────────────┬───────────────────┘
+                                │
+                    ┌───────────▼───────────┐
+                    │ Wr0 (512 B)  Wr1 (512 B) │
+                    │  = 1024 B retire / phase   │
+                    └───────────────────────────┘
+```
+
+**Flow:** identical modules to **VEC-4K §4.1** — the **cross-strip merger** block is **removed** (there is no cross-strip dimension). **Control** still drives crossbar routing, **(A)** unpack / permute masks, **(B)** per-group **ALU** opcode and tree depth (or **bypass**), and **Acc** addressing. The dominant change vs. VEC-4K is the **Acc** size (**512 B vs 2048 B**, **`N_run = 128` vs 512**) and the **absence** of a strip calendar.
+
+### 4.2 "Lanes" vs "Strips"
+
+- **SIMD lane**: one parallel datapath processing **one logical element** after unpack.
+- **Strip**: 512 B — **equals the entire tile**. SIMD width = `elem_per_strip = elem_per_tile`.
+- **Cross-lane** (within strip): reductions along a dimension that fits in one strip — **always** the case for VEC-512, on **either axis**.
+- **Cross-strip**: **none** within a single tile. Reduce / expand ops that combine **multiple tiles** are expressed at the **ISA** level (two instructions) — not as hardware cross-strip machinery.
+
+### 4.3 Fiber ID and strip read calendar
+
+**`fiber_id`** retains the same meaning as VEC-4K §4.3:
+
+| Opcode family | `fiber_id` | Range |
+|---------------|------------|--------|
+| **`TROW*`** (row reduce) | `r` | `0 … R−1` |
+| **`TCOL*`** (column reduce) | `c` | `0 … C−1` |
+| **`TROWEXPAND*`** | `r` | `0 … R−1` |
+| **`TCOLEXPAND*`** | `c` | `0 … C−1` |
+
+**Strip read calendar:** **collapses to a single beat per operand** — a one-row calendar in the shape of VEC-4K §4.3. Each cycle:
+
+| Field | Content |
+|-------|---------|
+| **`t`** | Cycle index in the micro-sequence |
+| **Rd0 / Rd1** | `src0` / `src1` (full 512 B tile each) or **idle** |
+| **`s`** | **`0`** (only value) |
+| **Lane → `(r, c)`** | §2.1 row-major map over the full tile |
+| **`fiber_id`s updated this cycle** | **All rows** (if `TROW*`) or **all columns** in the active band (if `TCOL*`) |
+| **Acc** | **RMW** at decoded slot for each retiring partial |
+
+There is **no notion of "strip `s` contributes to fiber `r`"** — **every fiber that exists in the tile** is visited in the **single** compute cycle (modulo `TCOL*` banding, §5.3.2 / §4.4 Example F-512).
+
+### 4.4 Epoch-aligned fiber calendars vs TRegFile-512 (four worked examples)
+
+TRegFile-512 does **not** carry an 8-cycle calendar; **there is no `e`**. Each read port delivers a full tile on cycle **`t`**. The tables below therefore collapse to a **single** operand-ingest cycle followed by **compute / retire** stages (the number of retire cycles depends on **`#W`** for `TCOL*`; **§5.3.2**).
+
+**Port binding for the tables:**
+
+| Logical name | Read port | Delivered per cycle |
+|--------------|-----------|---------------------|
+| **Port A**   | **Rd0**   | one **512 B** tile  |
+| **Port B**   | **Rd1**   | one **512 B** tile  |
+
+**Column shorthand:**
+
+| Column | Meaning |
+|--------|---------|
+| **`t`** | Core cycle from op start. |
+| **Port A / B** | **`T`** = tile `reg_idx` T (full 512 B), **`—`** = port idle. |
+| **Fibers (this beat)** | `fiber_id` values touched this cycle. |
+| **#elem** | Logical elements per fiber contributing this cycle. |
+| **Reduce / expand** | Arithmetic summary. |
+
+---
+
+#### Example A-512 — `TROWSUM`, **FP32**, **8×16** (`C = 16`, single tile)
+
+**Geometry:** `row_B = 512/8 = 64 B`, **8** rows per tile, **16** FP32 elements per row.
+
+| `t` | Port A | Port B | Fibers | #elem | Reduce |
+|----:|--------|--------|--------|------:|--------|
+| 0 | `src` | — | `r = 0…7` | 16 each | **8** lane-trees (`K = 16`, `D_lane = 4`) → **Acc** `r = 0…7` |
+
+**Unique ingest complete at `t = 0`**. Compare: VEC-4K Example B (32×32) needs **8 cycles** of strip reads.
+
+---
+
+#### Example B-512 — `TROWSUM`, **FP32**, **32×4** (`C = 4`, narrow row)
+
+**Geometry:** `row_B = 16 B`, **32** rows per tile, **4** elements per row.
+
+| `t` | Port A | Port B | Fibers | #elem | Reduce |
+|----:|--------|--------|--------|------:|--------|
+| 0 | `src` | — | `r = 0…31` | 4 each | **32** lane-trees (`K = 4`, `D_lane = 2`) → **Acc** `r = 0…31` |
+
+**All 32** output fibers finalized in one beat. Compare: equivalent VEC-4K `TROWSUM` on 32×32 uses **8 strip cycles** + per-strip horizontal trees.
+
+---
+
+#### Example C-512 — `TROWEXPANDADD`, **FP32**, **8×16**
+
+**`v` tile:** 8 FP32 row scalars = **32 B** at byte offset 0. Loaded once (pre-cycle, or on Port B of same cycle that Port A reads `src`).
+
+| `t` | Port A | Port B | Fibers | #elem | `v[fiber_id]` | Expand |
+|----:|--------|--------|--------|------:|---------------|--------|
+| −1 *(pre)* | — | `v@0` | — | — | latch `v[0…7]` | prefetch |
+| 0 | `src` | — | `r = 0…7` | 16 each | latched | **`src + v[r]`** across all 128 lanes |
+
+Alternative **zero-prefetch** schedule: Port A = `src`, Port B = `v` in **the same** cycle — the unpack stage (A) extracts `v[r]` from Port B's 32 B window while Port A feeds lanes; **1 cycle** total. VEC-4K (Example C) needs a **pre-epoch** beat + **4 cycles** of strip-pair ingest.
+
+---
+
+#### Example D-512 — `TROWSUM`, **FP16**, **32×8**
+
+**Geometry:** `row_B = 8 · 2 = 16 B` = **8** FP16 elements per row, **32** rows per tile (`N = 256`). `K = 8`, `D_lane = 3`.
+
+| `t` | Port A | Port B | Fibers | #elem | Reduce |
+|----:|--------|--------|--------|------:|--------|
+| 0 | `src` | — | `r = 0…31` | 8 each | **32** lane-trees (`K = 8`, `D_lane = 3`) → **Acc** `r = 0…31` |
+
+Every fiber retired in **one** beat (contrast VEC-4K Example D which, even after the 128×16 FP16 rewrite, still takes **4** strip-pair cycles across dual ports). **BF16 32×8** behaves identically (same `E`, same byte layout).
+
+---
+
+#### Example E-512 — `TCOLSUM`, **FP16**, **8×32** (`R = 8`, `C = 32`, native row-major, single-wave)
+
+**Geometry:** `row_B = 32 · 2 = 64 B` = **32** FP16 elements/row; **8** rows per tile. Column reduce produces **32** fibers.
+
+**Hardware parallelism (inherits §5.3.2 symbols):**
+- **`N_tree = 128`** trees, **`N_acc = 128`** Acc slots, **`N_run = 128`**, **`P_beat = 128`**.
+- **`#W = max(⌈C / N_acc⌉, ⌈C / (N_tree · f)⌉, ⌈C / N_run⌉)`**. With `C = 32 ≤ N_run`, `f ≥ 1`: **`#W = 1`** — **one** pass suffices.
+
+| `t` | Port A | Rows | `fiber_id` band | Samples/c this `t` | Acc |
+|----:|--------|------|------------------|-------------------:|-----|
+| 0 | `src` | **0–7** | `c ∈ [0, 31]` (all) | **8 per c** | `Acc[c] += sum_r M[r,c]` over all 8 rows |
+
+**1 ingest + 1 retire cycle** covers the whole `TCOLSUM`. Compare VEC-4K Example E (FP16 16×128, `#W = 2`): **2** waves × **4** strip-pair cycles = **≥ 8** beats, plus epoch turnarounds.
+
+---
+
+#### Example F-512 — `TCOLSUM`, **FP16**, **1×256** (extreme: `C = 256 > N_run`)
+
+**Geometry:** `row_B = 256 · 2 = 512 B` = 256 FP16 elements/row; **1** row per tile. **`C = 256 > N_run = 128`** ⇒ `TCOL*` needs **`#W = ⌈256 / 128⌉ = 2`** waves — the **only** wave-generating shape in the FP32 + FP16 / BF16 set on VEC-512. (With `R = 1`, the column reduce is degenerate — one addend per fiber — so the Acc RMW is essentially a straight copy, but it still exercises the wave-splitter in control.)
+
+**Calendar per wave `k` (column band `c ∈ [128·k, 128·k + 127]`):**
+
+| `t` | Port A | Rows | `fiber_id` band | Samples/c this `t` | Acc |
+|----:|--------|------|------------------|-------------------:|-----|
+| 0 | `src` | **0** | `[128·k, 128·k + 127]` | **1 per c** | `Acc[c] ← M[0, c]` for that band (R = 1 ⇒ direct write) |
+| 1 | — (or next wave) | — | — | — | retire band-k to Wr0 / Wr1 |
+
+**Total:** **`#W × 2 = 4`** cycles. This is the **worst-case** `TCOL*` latency under the simplified format set; removing FP8 / FP4 caps `N` at **256** and therefore `#W` at **2**. **BF16 1×256** behaves identically.
+
+---
+
+## 5. Instruction Categories and Cycle Sketches
+
+Same operand-count conventions as VEC-4K §5; **S = 1** everywhere.
+
+### 5.1 Elementwise (Tile–Tile)
+
+**Representative:** `TADD`, `TMUL`, `TAND`, `TCMP`, `TCVT`.
+
+**Dataflow (non-pipelined sketch):**
+
+1. `read_pair` loads `src0` and `src1` in **one cycle** → unpack → SIMD op → pack into `dst` staging.
+2. Retire `dst` on **the next** cycle via `write_pair` (or through same-cycle pass-through on simpler pipelines).
+
+**Latency (typical):** **~3 cycles** (read → compute → write), vs. **~10–12 cycles** on VEC-4K. Pipelined throughput: **1 binary op / cycle**.
+
+**Cross-lane:** none. **Cross-strip:** **N/A** (S = 1).
+
+```mermaid
+flowchart LR
+  subgraph per_tile [Per tile]
+    R0[Read src0 512B] --> U0[Unpack]
+    R1[Read src1 512B] --> U1[Unpack]
+    U0 --> ALU[Elementwise ALU]
+    U1 --> ALU
+    ALU --> P[Pack]
+    P --> W[Write dst 512B]
+  end
+```
+
+### 5.2 Tile–Scalar / Tile–Immediate
+
+**Representative:** `TADDS`, `TMULS`, `TCMPS`, `TEXPANDS`.
+
+- Scalar loaded into broadcast register once; each cycle applies `lane_i = f(tile_i, scalar)`.
+- **Latency:** **~2 cycles** (unary read + retire). Throughput: **1 op / cycle**.
+
+### 5.3 Axis Reduce
+
+**Key geometric fact (new):** for a 512 B tile, **every row and every column fits inside one strip**, so **`rS = cS = 1`** for all shapes. There is **no cross-strip merger**, **no staged partials across strips**, and **no partial-state SRAM beyond Acc**.
+
+#### 5.3.1 Row-wise reduce (e.g. `TROWSUM`)
+
+For each **row `r`**, compute `acc[r] = reduce_c M[r, c]`.
+
+**Phase A (only phase):** after unpack, a horizontal SIMD tree of depth **`D_lane = ⌈log₂ C⌉`** reduces the row segment directly into **Acc[`r`]**. Every FP32 shape has `R ≤ 128 = N_run`; every FP16 / BF16 shape has `R ≤ 256`, so only the single shape `256×1` exceeds `N_run`. **No waves** are needed for any other shape.
+
+| Phase | Action |
+|-------|--------|
+| 1 | Read tile on Rd0. |
+| 2 | Unpack → **`R`** parallel lane trees of width `C` → **`R`** Acc writes (one per row). |
+| 3 | For `dst = R×1` vector tile, **Pack** + **Wr0** emit the R-way result. |
+
+**`TROWARGMAX` / `TROWARGMIN`:** each lane tree carries **(value, col_index)**; same single-cycle structure.
+
+**When `R > N_run`:** the only legal shape that hits this branch under the simplified format set is FP16 / BF16 **`256×1`**, where `K_outer = ⌈256 / 128⌉ = 2`. Microcode re-reads the same tile on each pass, as in VEC-4K §5.3.2. **Cost: `K_outer` re-reads of a 512 B tile**, which is **one cycle each**.
+
+#### 5.3.2 Column-wise reduce (e.g. `TCOLSUM`, `TCOLMAX`)
+
+**Architectural rule (carries over from VEC-4K §5.3.2):** `TCOL*` runs on the operand's **native row-major** layout. **No transpose scratchpad**. TRegFile has **no gather** (§3.1). Column selection happens in **VEC** (strip buffer → unpack → column mux).
+
+**Parallelism symbols** (same definitions as VEC-4K §5.3.2):
+
+| Symbol | VEC-512 value |
+|--------|---------------|
+| **`N_tree`** | **128** |
+| **`N_acc`** | **128** (= `N_run`) |
+| **`N_run`** | **128** |
+| **`P_beat`** | `min(N_tree, N_acc) = 128` |
+
+**Wave count:**
+
+**`#W = max(⌈C / N_acc⌉, ⌈C / (N_tree · f)⌉, ⌈C / N_run⌉)`**
+
+Because **tile = 1 strip**, **`f ≥ 1`** is trivially achievable: a single tile read exposes **all R row samples** of every column in the band simultaneously, so a single tree invocation can sustain one commit per column per scan. For the reference values above:
+
+| `C` | `⌈C / N_acc⌉` | `⌈C / N_run⌉` | `#W` | Example legal shapes (FP32 / FP16–BF16) |
+|----:|--------------:|--------------:|-----:|-----------------------------------------|
+| ≤ 128 | 1 | 1 | **1** | All FP32 shapes (`C ≤ 128`); all FP16 / BF16 shapes with `C ≤ 128` |
+| 256 | 2 | 2 | **2** | FP16 / BF16 `1×256` only |
+
+Larger `C` values are not reachable because `N ≤ 256` under the two supported storage widths.
+
+**No tile replay from RF is required** for the lane-tree pass itself — each wave **re-reads the same `reg_idx` once** (1 cycle per wave), against **VEC-4K** where **each wave is 4 strip-pair cycles + epoch turnaround**.
+
+**Row-axis mirror (`TROW*`):**
+
+**`#W_trow = max(⌈R / N_acc⌉, ⌈R / (N_tree · f)⌉, ⌈R / N_run⌉)`**
+
+Same table with `R` substituted for `C`.
+
+**Cycle lower bound:** **`#W × 1 + retire_tail`**, typically **`#W + 1…2`** cycles total.
+
+### 5.4 Axis Expand / Broadcast
+
+**Row expand** (`TROWEXPAND*`): load **per-row `v[r]`** from a narrow vector tile (or Acc after an in-place reduce); splat across column lanes; combine with `src`; pack; write.
+
+**Column expand:** same **no-transpose-scratch** rule as `TCOL*` (§5.3.2). Single-strip column addressing is **one** mux stage, not a multi-strip scatter.
+
+**Cycles:** **~2 cycles** (read + write) when `v` fits alongside `src` on dual ports, or **~3 cycles** with a prefetch beat.
+
+### 5.5 Complex Instructions
+
+#### 5.5.1 `TSORT32`
+
+- With **128** FP32 or **256** FP16 / BF16 elements in one tile, the number of **32-element blocks** per tile is **4** (FP32) or **8** (FP16 / BF16) respectively.
+- All blocks sort **in parallel** via pipelined bitonic / odd-even networks of depth O(log² 32).
+- **No cross-strip stitching** (every block is fully in-tile).
+- **Cycles:** depth-dominated (tens of cycles for sort-network pipeline); ingest is **1 cycle**.
+
+#### 5.5.2 `TMRGSORT`
+
+Multi-list merge. Because each list-head tile is 512 B, **list headers can be loaded in one port beat each**; the merge front has **lower fanout pressure** per cycle, but throughput per emitted output tile remains merge-depth bound.
+
+- **Cross-lane:** per-element compare in k-way tree.
+- **Cross-strip:** **none within a tile**; the global merge state spans **multiple tiles** in RF as before.
+
+#### 5.5.3 `TGATHER` / `TGATHERB` / `TSCATTER`
+
+- Index-driven byte mux within a **single 512 B** buffer → **one-level** crossbar (no cross-strip fanout).
+- **Cycles:** **~2–4 cycles** typical (1 index + 1 data + pack), vs. worst-case per-element serialization on VEC-4K.
+
+#### 5.5.4 `TCI`, `TTRI`, `TPART*`
+
+- **`TCI`:** `base + stride` index generation — **1 cycle**.
+- **`TTRI`:** row/col counter compare to mask — **1 cycle**.
+- **`TPART*`:** elementwise + predicate gating — **1–2 cycles**.
+
+#### 5.5.5 `TQUANT` / `TDEQUANT`
+
+Two-phase: (1) reduce for scale/exp (**1 cycle**, §5.3); (2) elementwise scale (**1 cycle**, §5.1). **Total ~3–4 cycles** vs. **~16–20 cycles** on VEC-4K.
+
+---
+
+## 6. Cross-Lane and Cross-Strip Summary
+
+| Category | Cross-lane (within 512 B strip = tile) | Cross-strip (within a tile) |
+|----------|-----------------------------------------|------------------------------|
+| Elementwise tile–tile | Independent lanes | **N/A** |
+| Tile–scalar | Independent | **N/A** |
+| Row reduce | Horizontal tree across `C` lanes (`D_lane = ⌈log₂ C⌉`) | **N/A** |
+| Column reduce | **VEC column mux** across `R` row samples in buffer (`D_lane = ⌈log₂ R⌉`) | **N/A** |
+| Row / column expand | Splat across row / column segment | **N/A** |
+| `TSORT32` | Sort network per 32-block, all in-tile | **N/A** |
+| `TMRGSORT` | Per-element compare in merge tree | **N/A** within a tile (cross-tile lives at ISA level) |
+| `TGATHER` | Mux selected elements within 512 B | **N/A** |
+
+**Contrast VEC-4K §6:** every **"Heavy"** cross-strip entry in VEC-4K becomes **N/A** here.
+
+---
+
+## 7. Datapath Diagram — Row Reduce (single strip)
+
+```mermaid
+flowchart TB
+  subgraph rf [TRegFile-512]
+    RD0[Rd0 512B]
+    RD1[Rd1 512B]
+  end
+  subgraph ctl [Control]
+    C[Opcode + shape]
+  end
+  subgraph xb [Crossbar]
+    XB[1024 B in]
+  end
+  subgraph modA ["(A) Align unpack permute"]
+    A["IN 1024 B OUT 128 x W_prep,i"]
+  end
+  subgraph modB ["(B) 128 groups N_group"]
+    G["Per i: slice W_prep,i to ALU to tree W_ALU,i then W_tree,i 32b typical NO cross-strip stage"]
+  end
+  subgraph acc [Acc ping-pong]
+    ACCMEM["64x32bx2 RMW or bypass to DFF"]
+  end
+  subgraph out [Retire]
+    PACK[Pack]
+    WR[Wr0 Wr1 512B]
+  end
+  RD0 --> XB
+  RD1 --> XB
+  C --> XB
+  C --> A
+  C --> G
+  C --> ACCMEM
+  XB --> A
+  A --> G
+  G -->|W_tree to Acc rho| ACCMEM
+  ACCMEM -->|64 words| PACK --> WR
+```
+
+Compare with VEC-4K §7: the **8-strip serial walk**, the **cross-strip tree**, and the **`256×32b×2`** Acc all shrink — the **cross-strip tree vanishes** and Acc is **4×** smaller.
+
+---
+
+## 8. Implementation Notes
+
+1. **Opcode decode** produces control for the crossbar, **(A)** unpack/permute, **(B)** 128 groups, Acc ping-pong (RMW / bypass-to-DFF), Wr half-select, and a **single-beat** operand calendar (no strip walker, no epoch). Parameters: **`TCOL*`** wave count **`#W`** (§5.3.2), `K_outer` when **`max(R, C) > N_run = 128`**, splat / merge `k`, §9 `r*` / `c*` template id.
+2. **Determinism:** PTO ops retire atomically as in VEC-4K §8.
+3. **Resource conflicts:** with 2R+2W, two **independent** instructions can overlap if they do not share Acc halves or write ports. No multi-epoch ping-pong is needed within a single instruction.
+4. **Numerics:** FP16 / BF16 reductions are evaluated with an FP32-widened accumulator and rounded per `TCVT` / ISA rules on retire, same as VEC-4K §8.
+
+---
+
+## 9. Legal `(format, R×C)` enumeration and axis-reduce complexity
+
+### 9.1 Enumeration rules
+
+- Tile storage: **512 B**, row-major, **R** and **C** powers of two.
+- **N = R·C = 512 / E** (**only two supported storage widths**):
+  - **FP32:** `E = 4`, `N = 128`, **8** shapes.
+  - **FP16** and **BF16:** `E = 2`, `N = 256`, **9** shapes each (**18** rows).
+
+**Master table rows:** **26** (vs. **35** in VEC-4K §9.1). **Unique `(E, R, C)` geometries:** **17** (vs. **23**).
+
+`elem_per_strip = elem_per_tile = 512 / E` — **128** FP32 elements or **256** FP16 / BF16 elements per tile.
+
+### 9.2 Row-axis metrics (`TROW*`)
+
+For each **row** fiber, reduce **C** elements. **Bytes per row:** `row_B = 512 / R`.
+
+| Sym | Definition | Value range |
+|-----|------------|-------------|
+| **rS** | Strips per row `= ⌈row_B / 512⌉` | **= 1 always** |
+| **rK** | Elements in lane segment: `C` (since `rS = 1`) | `1 … 256` |
+| **rDl** | Cross-lane depth `= max(0, ⌈log₂ rK⌉)` | `0 … 8` |
+| **rDc** | Cross-strip depth `= max(0, ⌈log₂ rS⌉)` | **= 0 always** |
+| **rW** | Per-strip serial work: `rDl` (since `row_B ≤ 512`) | `0 … 8` |
+| **rLB** | `1 + rDl` (ingest 1 cycle + lane tree) | `1 … 9` |
+| **rUB** | `1 + rW + R_wave_tail` (single shared tree pipelined over packed fibers) | `1 … (8 + K_outer)` |
+| **rAccB** / **rStgUB** | `4·R` / `4·R·rS = 4·R` (no strip staging) — **logical**; physical running = **`min(R, N_run = 128)`** |
+
+### 9.3 Column-axis metrics (`TCOL*`)
+
+For each **column** fiber, reduce **R** elements. **Logical bytes per column:** `col_B = R·E`.
+
+| Sym | Definition | Value range |
+|-----|------------|-------------|
+| **cS** | `⌈col_B / 512⌉` | **= 1 always** (`col_B = R·E ≤ 512`) |
+| **cK** | `R` | `1 … 256` |
+| **cDl** | `⌈log₂ cK⌉` | `0 … 8` |
+| **cDc** | `⌈log₂ cS⌉` | **= 0 always** |
+| **cW** | `cDl` | `0 … 8` |
+| **cLB** | `1 + cDl` | `1 … 9` |
+| **cUB** | `1 + cW + C_wave_tail` | `1 … (8 + #W)` |
+| **cAccB** / **cStgUB** | `4·C` logical; physical running = **`min(C, N_run = 128)`** |
+
+**Row-major hardware path:** single-beat read, lane tree, **VEC column mux** over the 512 B buffer. **No transpose scratch** (§5.3.2). **No multi-epoch replay** from the RF within a wave; multi-wave **re-reads the same `reg_idx`** across cycles when `C > N_run` (1 extra read cycle per wave).
+
+### 9.3.1 Partial accumulator state
+
+Assumption A — associative reduce (max / min / sum):
+
+| Symbol | Formula | Meaning |
+|--------|---------|---------|
+| **rAccB** | `4·R` | Logical per-row state (bytes). Physical = `min(R, N_run = 128) × 4 B`. |
+| **cAccB** | `4·C` | Logical per-column state. Physical = `min(C, N_run = 128) × 4 B`. |
+
+Assumption B — **not applicable**: `rS = cS = 1`, so `rStgUB = rAccB`, `cStgUB = cAccB`. **No strip-staging buffer needed.**
+
+**`TROWARG*` / `TCOLARG*`:** value∥index doubles the per-slot width as in VEC-4K §9.3.1.
+
+### 9.3.2 Accumulator organization (ping-pong DFF, `N_run = 128`)
+
+The running partial store is **two** ping-pong halves of **64 × 32 b** each (**256 B / half**, **512 B total**). Modes: **RMW accumulate** or **bypass-to-DFF** (identical semantics to VEC-4K §9.3.2, scaled to 64 words per half).
+
+| Property | VEC-512 | VEC-4K |
+|----------|---------|--------|
+| DFF slots per half | **64** | 256 |
+| Bytes per half | **256** | 1024 |
+| Total Acc storage | **512 B** | 2048 B |
+| **`N_run`** | **128** | 512 |
+| Peak logical fiber count | **256** (FP16 / BF16) | **2048** (FP16 / BF16) |
+| `N_run` vs peak | **2×** deficit (single `256×1` / `1×256` shape) | **4×** deficit (`R` or `C > 512` shapes) |
+
+### 9.4 Cycle model (both axes)
+
+Both axes assume §3.2 ingest: **1 cycle** to read a full tile (vs. **4 cycles** in VEC-4K).
+
+- **Lower bound (LB):** `1 (read) + ⌈log₂ K⌉ (lane tree) + 1 (retire) ≈ 3 … 10 cycles`.
+- **Upper bound (UB):** `1 (read) + W (serialized tree) + #W (waves for `N_run` overflow) + 1 (retire)`.
+
+Compare VEC-4K §9.4: **LB = 4 + D_lane + D_cross** (= 4 + up to 11 = up to 15 cycles); **UB = 4 + 8·W + R·D_cross** which can still run into thousands for extreme FP16 shapes.
+
+### 9.5 Reconfigurable reduction tree
+
+The tree **simplifies** relative to VEC-4K §9.5:
+
+1. **Unpack:** up to **256** logical lanes (FP16 / BF16) or **128** lanes (FP32) — all fibers are live in one cycle instead of spread over 8 strips as in VEC-4K.
+2. **Cross-lane tree (variable `K`):** `D_lane = 0 … 8` stages.
+3. **Cross-strip merger: DELETED.** No stages. Saves up to **3** compare-stage latches per group vs. VEC-4K.
+4. **Temporal stretch `W`:** still applies when a single shared tree must cover multiple packed fibers (`R > 1` rows in the tile for `TROW*`). With **`N_tree = 128`** trees in parallel, `W` only activates when `R > N_tree`, i.e. the single shape **FP16 / BF16 `256×1`**.
+
+```mermaid
+flowchart LR
+  subgraph tile [Per tile = 512B]
+    U[Unpack] --> MUX[Mux K active lanes]
+    MUX --> LT[log2 K stage lane tree]
+    LT --> P[Partial per fiber ID]
+  end
+  P --> ACC[Acc ping-pong 64x32bx2 RMW]
+  ACC --> OUT[Reduced fiber value]
+```
+
+**Cross-strip tree removed** — not drawn, not present.
+
+#### 9.5.1 How many distinct "shapes" are needed?
+
+| Counting notion | VEC-512 | VEC-4K |
+|-----------------|--------:|--------:|
+| Physical datapaths | **1** | 1 |
+| Unique `(D_lane, W_strip)` tuples (no `D_cross` axis) | **9** | 22 |
+| Unique `(K, D_lane)` doublets | **9** | 15 |
+| Unique paired (row tuple, column tuple) | **17** | 23 |
+
+The **reduction in scheduling-template count** (**9 vs 22**, roughly **2.5×** fewer) is the direct consequence of **removing the cross-strip axis** (`D_cross ∈ {0,1,2,3}` → `{0}`) **and** the **temporal-stretch** axis collapsing to `W = D_lane` (no row-spanning splits across strips). FP16 / BF16 shapes cover `D_lane = 0 … 8` and FP32 shapes cover `D_lane = 0 … 7`, both on the diagonal `W = D_lane`, so the union is **9** tuples.
+
+### 9.6 Summary by format (extrema over all legal shapes)
+
+| Format | N | # shapes | max **K** | max **D_lane** | max **S** | max **D_cross** | min *LB* | max *LB* | max *UB* | max **rAccB** / **cAccB** | max **rStgUB** / **cStgUB** |
+|--------|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| FP32         | 128  | 8  | 128  | 7  | **1** | **0** | 1 | 8  | ≤ 9  + K_outer | 512  | 512  |
+| FP16 / BF16  | 256  | 9  | 256  | 8  | **1** | **0** | 1 | 9  | ≤ 10 + K_outer (K_outer ≤ 2 on `256×1`) | 1024 | 1024 |
+
+**Logical** peak **rAccB** / **cAccB** in the table is `4·R` / `4·C` (up to **1024 B** at `R` or `C = 256`, FP16 / BF16). **VEC-512 silicon:** `N_run = 128` × 4 B = **512 B DFF**; only the single shape `256×1` (FP16 / BF16) exceeds `N_run` and uses **Acc waves** (`K_outer = 2`, §5.3.2 / §9.3.1). Every `*StgUB` equals `*AccB` (no strip staging).
+
+Compare VEC-4K §9.6: peak `*AccB` up to **8 KiB**, peak `*StgUB` **8 KiB**; VEC-512 caps both at **1 KiB logical / 512 B physical**.
+
+### 9.7 Legal `(format, R×C)` enumeration
+
+**26** rows (17 unique `(E, R, C)`; FP16 vs. BF16 duplicate shapes).
+
+| Format | E (B/elem) | N | R×C |
+|--------|------------|---|-----|
+| FP32 | 4   | 128  | 1×128 |
+| FP32 | 4   | 128  | 2×64  |
+| FP32 | 4   | 128  | 4×32  |
+| FP32 | 4   | 128  | 8×16  |
+| FP32 | 4   | 128  | 16×8  |
+| FP32 | 4   | 128  | 32×4  |
+| FP32 | 4   | 128  | 64×2  |
+| FP32 | 4   | 128  | 128×1 |
+| FP16 | 2   | 256  | 1×256 |
+| FP16 | 2   | 256  | 2×128 |
+| FP16 | 2   | 256  | 4×64  |
+| FP16 | 2   | 256  | 8×32  |
+| FP16 | 2   | 256  | 16×16 |
+| FP16 | 2   | 256  | 32×8  |
+| FP16 | 2   | 256  | 64×4  |
+| FP16 | 2   | 256  | 128×2 |
+| FP16 | 2   | 256  | 256×1 |
+| BF16 | 2   | 256  | 1×256 |
+| BF16 | 2   | 256  | 2×128 |
+| BF16 | 2   | 256  | 4×64  |
+| BF16 | 2   | 256  | 8×32  |
+| BF16 | 2   | 256  | 16×16 |
+| BF16 | 2   | 256  | 32×8  |
+| BF16 | 2   | 256  | 64×4  |
+| BF16 | 2   | 256  | 128×2 |
+| BF16 | 2   | 256  | 256×1 |
+
+---
+
+## 10. Comparison: VEC-512 vs VEC-4K
+
+This section compares the two designs directly: **datapath complexity** (§10.1), **per-instruction cycle counts** (§10.2), and **aggregate throughput** per unit of data (§10.3). Metric names inherit from §9 / [`vector4k.md`](vector4k.md) §9.
+
+### 10.1 Datapath complexity
+
+Both designs share the same **reference** compute engine (crossbar → (A) align/unpack → (B) 128 groups of ALU + reduce tree → Acc ping-pong → Pack → Wr0/Wr1). The differences are **how many auxiliary structures** each design needs and **how big** each one is.
+
+| Structure | VEC-4K | VEC-512 | Net change |
+|-----------|--------|---------|------------|
+| Strips per tile (**S**) | 8 | **1** | –8× |
+| Strip walker FSM | required | **removed** | – |
+| Strip read calendar (cycle ↔ `s` mapping) | required, 8-cycle template | **removed** (1-beat ingest) | – |
+| Strip buffers A / B | **512 B × 2** (often double-buffered) | 512 B × 2 (single) | simpler control |
+| Cross-strip reduction tree | `⌈log₂ S⌉ = 3` stages, per-group | **removed** (`D_cross = 0`) | –3 stages × 128 groups |
+| Staged strip-partial SRAM (`rStgUB`/`cStgUB`) | up to **8 KiB** (FP16 / BF16 `256×8` or `8×256`) | **0 B** (equal to Acc) | –8 KiB worst |
+| Accumulator DFF (ping-pong) | **256 × 32 b × 2 = 2048 B** | **64 × 32 b × 2 = 512 B** | –4× |
+| **`N_run`** | 512 | **128** | –4× |
+| **`N_acc`** | ≤ 512 | 128 | proportional |
+| **`N_tree`** | 128 (ref) | 128 (ref) | unchanged |
+| TRegFile port width × count | 2R+2W × 512 B | 2R+2W × 512 B | unchanged |
+| TRegFile 8-cycle rotating calendar | required (tregfile4k §2) | **removed** (1 tile / port / cycle) | large simplification |
+| TRegFile per-port pending/active register | required | **removed** | – |
+| `TCOL*` tile replays `#W` (§5.3.2) | `max(⌈C/N_acc⌉, ⌈C/(N_tree·f)⌉, ⌈C/N_run⌉)` × **4 strip-pair cycles** each | `#W × 1 cycle` each | –4× per wave |
+| `TCOL*` f parameter (commits/tree/scan) | complex, RTL-measured | trivially `f ≥ 1` (single strip) | – |
+| `K_outer` outer-loop nest (§5.3.2) | triggered for `R`/`C > 512` (FP16 / BF16 `1024×2`, `2048×1`, `1×2048`, `2×1024`) | triggered for `R`/`C > 128` (single shape: FP16 / BF16 `256×1` or `1×256`) | more frequent but 8× cheaper per iteration |
+| Write-side staging across outer loops | required for merged `dst` | optional (waves typically write disjoint bands) | – |
+| Distinct scheduling templates (§9.5.1 unique tuples) | **22** | **9** | –~2.5× |
+| Master table rows in §9.7 | 35 (23 unique `(E,R,C)`) | 26 (17 unique) | –26% |
+
+**Qualitative summary:** VEC-512 removes the **two orthogonal temporal axes** that dominate VEC-4K's control logic — (i) the **8-strip walk per tile** and (ii) the **`#W` multi-epoch tile replay** for wide column / row reductions. What remains is essentially the **per-strip** core of VEC-4K, scaled to **¼** Acc capacity. Scheduling templates collapse from **22** to **9** because the `(D_cross, W_strip_packing)` product of axes disappears, leaving only the diagonal **`W = D_lane ∈ {0, …, 8}`** (covering FP32 `D_lane ≤ 7` and FP16 / BF16 `D_lane ≤ 8`).
+
+### 10.2 Per-instruction cycle count (typical representative shapes)
+
+Numbers below are **minimum cycle counts** under the common reference parameters (`N_tree = 128`, `N_acc = 128`, dual-port reads, pipelined tree). "VEC-4K" numbers are taken from [`vector4k.md`](vector4k.md) §4.4 / §5 / §9.
+
+**Notation:** `Cin` = ingest cycles, `Tree` = tree-stage cycles (pipelined, amortized), `Retire` = write-back cycles, **`Total`** = `Cin + Tree + Retire` when no waves, else include wave multiplier.
+
+| Op | Shape (VEC-4K / VEC-512) | VEC-4K `Cin` + `Tree` + `Retire` + waves | **VEC-4K Total** | VEC-512 `Cin` + `Tree` + `Retire` + waves | **VEC-512 Total** | Speedup |
+|----|-------|------------------------------------------|-----------------:|-------------------------------------------|------------------:|--------:|
+| `TADD` (FP32) | 32×32 / 8×16 | 8 + 1 + 1 + 0 | **10** | 1 + 1 + 1 + 0 | **3** | **3.3×** |
+| `TADD` (FP16) | 32×64 / 16×16 | 8 + 1 + 1 + 0 | **10** | 1 + 1 + 1 + 0 | **3** | **3.3×** |
+| `TABS` (FP32, unary) | 32×32 / 8×16 | 4 + 1 + 1 + 0 | **6** | 1 + 1 + 1 + 0 | **3** | **2×** |
+| `TADDS` (scalar) | any | 4 + 1 + 1 + 0 | **6** | 1 + 1 + 1 + 0 | **3** | **2×** |
+| `TROWSUM` (FP32) | 8×128 / 8×16 | 4 + 7 + R-stream ≈ **12** | **12** | 1 + 4 + 1 | **6** | **2×** |
+| `TROWSUM` (FP32) | 32×32 / 32×4 | 4 + 5 + R-stream ≈ **10** | **10** | 1 + 2 + 1 | **4** | **2.5×** |
+| `TROWSUM` (FP16) | 128×16 / 32×8 (VEC-4K Ex. D) | 4 + 4 + R-stream ≈ **10** | **10** | 1 + 3 + 1 | **5** | **2×** |
+| `TROWSUM` (FP16) | 256×8 / 32×8 (R > 128 on 4K, R ≤ 128 on 512) | 4 + 3 + **K_outer = 1** × R-stream ≈ **12** | **12** | 1 + 3 + 1 | **5** | **2.4×** |
+| `TROWSUM` (FP16) | 2048×1 / 256×1 (`R > N_run` both sides) | 4 + 0 + `K_outer ≈ 4` × R-stream ≈ **24+** | **≥ 24** | 1 + 0 + `K_outer = 2` × R-stream ≈ **5** | **~5** | **~4.8×** |
+| `TCOLSUM` (FP32) | 32×32 / 8×16 | 4 × `#W=1` + tree + retire ≈ **10** | **10** | 1 + 4 + 1 | **6** | **1.7×** |
+| `TCOLSUM` (FP16) | 16×128 / 8×32 (VEC-4K Ex. E) | **#W = 2** × 4 + tree + retire ≈ **14** | **14** | 1 + 5 + 1 (single wave, `C = 32 ≤ N_run`) | **7** | **2×** |
+| `TCOLSUM` (FP16) | 8×256 / 1×256 (extreme C at each tile size) | **#W = 4** × 4 + 2 ≈ **20** | **20** | `#W = 2` × (1 + 0 + 1) ≈ **4** | **~4** | **~5×** |
+| `TROWEXPANDADD` | 8×128 / 8×16 | 1 (prefetch) + 4 + 1 ≈ **6–8** | **6–8** | 1 + 1 | **2–3** | **~3×** |
+| `TCVT` (FP16 → FP32, FP32 → BF16) | any | **8–12** | **~10** | **2–3** | **~3** | **~3×** |
+| `TSORT32` | all | **4 + ~20** (sort network pipe) | **~24** | **1 + ~20** | **~21** | **~1.15×** (sort-network-bound) |
+| `TMRGSORT` (k=4) | per emitted tile | many × 4 + k·depth | **high** | many × 1 + k·depth | **lower** | 2–4× typical |
+| `TGATHER` (random) | any | worst ≈ per-element serialization | **high** | single-strip mux, **~4** | **low** | **large** |
+| `TCI` / `TTRI` | any | 4 + 1 ≈ **5** | **5** | 1 + 1 | **2** | **~2.5×** |
+| `TQUANT` / `TDEQUANT` (FP32 ↔ FP16 / BF16) | any | **≈ 16–20** | **~18** | **3–4** | **~4** | **~4.5×** |
+
+**Takeaways:**
+
+- **Elementwise and scalar ops** (≥ 60 % of typical workload mix) run **2×–3.3× fewer** cycles per instruction. The improvement comes almost entirely from collapsing **ingest** from 4–8 strip cycles to **1**.
+- **Axis reduces** see **1.5×–2.5×** speedups for typical shapes, dominated by the removal of the **cross-strip combine** phase and the trivialization of `#W`.
+- **Wide-axis extremes** now only reach `C = 2048` (FP16 `1×2048`) on VEC-4K and `C = 256` (FP16 `1×256`) on VEC-512 — with FP8 / FP4 removed, **VEC-512 never runs slower per instruction than VEC-4K** on any supported shape. The extreme column reduce row in the table demonstrates this: VEC-512's 2-wave replay still beats VEC-4K's 4-wave strip-pair walk.
+- **Complex instructions** (`TGATHER`, `TMRGSORT`, `TQUANT`, `TSORT32`) benefit from **~2–4×** lower per-instruction latency; sort-network depth still dominates `TSORT32`.
+
+### 10.3 Aggregate throughput (per unit of data)
+
+The preceding per-instruction numbers can mislead when workloads span large operand sets. To process **4 KB of data**:
+
+| Scenario | VEC-4K | VEC-512 |
+|----------|--------|---------|
+| Instructions required | **1** tile-op | **8** tile-ops |
+| Cycles per op (typical elementwise) | ~10 | ~3 |
+| Cycles for 4 KB (serial) | **~10** | **~24** |
+| Cycles for 4 KB (pipelined, 1 op/cy throughput) | **~10** | **~8 + 3 = 11** |
+| Read port bandwidth utilized | ≥ 80 % (strip-packed) | ≤ 25 % per isolated op; up to 100 % with back-to-back independent ops |
+
+**Rules of thumb:**
+
+- **Data-bandwidth-bound workloads** (dense elementwise over large activations) favor **VEC-4K**: **fewer instructions**, **higher sustained port utilization**, and **amortized** control overhead. A single `TADD` on 4 KB completes in ~10 cycles vs ~24 cycles of 8 separate `TADD`s on 512 B tiles, **unless** the VEC-512 scheduler can **fully pipeline** (1 op/cy) which brings VEC-512 to **~11 cycles** — competitive but **no throughput advantage**.
+- **Latency-critical, small-operand, or irregular workloads** (short sequences, attention over small K/V tiles, per-token quantization, sort / gather on small blocks) favor **VEC-512**: **faster wall-clock per instruction**, **no `#W` replay**, **smaller Acc DFF** (lower leakage / area).
+- **`TCOL*`-heavy kernels** (layernorm, softmax reductions across features) benefit from VEC-512 when `C ≤ 128`; the only wave-generating shape (FP16 / BF16 `1×256`) still beats the VEC-4K equivalent because the 2 VEC-512 waves each cost 1 ingest cycle instead of 4.
+- **Programming model cost:** VEC-512 needs **8× more tile-level instructions** per unit of data compared to VEC-4K. This shifts pressure onto **dispatch / decode / scheduling**. A narrow-issue core may not sustain 1 op/cycle throughput on VEC-512.
+
+### 10.4 When to choose which
+
+| Design goal | Preferred unit | Reason |
+|-------------|----------------|--------|
+| Peak **throughput** on large dense tensors | **VEC-4K** | 8× fewer instructions, high port utilization |
+| Minimum **per-op latency** | **VEC-512** | 1-cycle ingest, no strip walk, no `#W` replay |
+| **Silicon area / leakage** at similar performance | **VEC-512** | 4× Acc, no cross-strip tree, no calendar FSM |
+| **Worst-case `TCOL*`** on FP16 `1×256` / `2×128` shapes | **VEC-512** | `#W ≤ 2` × 1-cycle ingest beats VEC-4K `#W × 4` strip-pair cycles |
+| Fine-grain **scheduling flexibility** (many small tiles in flight) | **VEC-512** | 8× more tile addresses, 1-cycle port turnaround |
+| Simplified **RTL verification scope** | **VEC-512** | 9 templates vs 22, no multi-epoch corner cases |
+| **TRegFile** simplicity | **VEC-512** | no 8-cycle calendar, no pending/active registers |
+
+### 10.5 Hybrid / intermediate tile sizes
+
+The two designs are **endpoints** of a family parameterized by `S`. Tiles of **1 KB** (`S = 2`) or **2 KB** (`S = 4`) retain some strip-walk cost but reduce Acc and `#W` pressure relative to 4 KB. Those points are not analyzed here; the comparison tables in §10.1 / §10.2 can be linearly extrapolated by `S` for first-order estimates (`S = 2`: halve the strip-related rows of §10.1, halve the ingest columns in §10.2, keep the cross-strip tree at `⌈log₂ 2⌉ = 1` stage).
+
+---
+
+## 11. Related Documents
+
+- [`vector4k.md`](vector4k.md) — 4 KB tile sibling design; §10 of the present document draws numbers and terminology from `vector4k.md` §3–§9.
+- [`tregfile4k.md`](tregfile4k.md) — 4 KB tile RF (8R/8W, 8-cycle epoch). VEC-512 assumes a **TRegFile-512** variant described in §3 above; a detailed spec is out of scope here.
+- [`outerCube.md`](outerCube.md) — MXU / outer-product engine (independent of tile size chosen here).
+- [`PTOISA/README.md`](PTOISA/README.md) — authoritative ISA list.
+
+---
+
+## Document History
+
+| Version | Date | Notes |
+|---------|------|-------|
+| 0.1 | 2026-04-20 | Initial VEC-512 architecture derived from [`vector4k.md`](vector4k.md) v0.23. §1–§9 ported with **S = 1** simplifications (strip walker, cross-strip tree, epoch calendar removed; **Acc = 64×32b×2 ping-pong**, **`N_run = 128`**). §10 quantitative comparison vs VEC-4K across datapath complexity, per-instruction cycle counts, aggregate throughput, and design-goal selection. |
+| 0.2 | 2026-04-20 | **Format simplification**: drop **FP8 / MXFP4 / HiFP4**; supported storage widths are now **FP32 (E=4)** and **FP16 / BF16 (E=2)** only. §2.1 / §4.4 (Example D → FP16 `TROWSUM` 32×8, Example E → FP16 `TCOLSUM` 8×32, Example F → FP16 `TCOLSUM` 1×256 two-wave) / §5.3 / §8 / §9.1 / §9.5.1 / §9.6 / §9.7 (26 rows, 17 unique shapes) / §10.1–§10.4 updated. Peak **K = 256**, **D_lane = 8**, only wave-generating shape is FP16 / BF16 `1×256` (`#W = 2`). |