From 4eb41781af8d98bf5b87f71ad887df50a47b38c2 Mon Sep 17 00:00:00 2001
From: Nick Nassiri <nick@knnlabs.com>
Date: Sat, 20 Jun 2026 22:36:17 -0700
Subject: [PATCH] docs(#856): add Performance section to STATUS.md; document
 #874 cancellation-check inlining

Compiled output now meets/beats Node on 5/7 benchmark workloads; the two
stragglers (count-primes ~1.3x, factorial ~3x) are bounded by separate
non-codegen factors. Records the #874 inline-volatile loop-cancellation win
(1.6x tight loops / 1.12x sieve) and the rejected throttle variant.
---
 STATUS.md | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/STATUS.md b/STATUS.md
index d6152588..496da4ac 100644
--- a/STATUS.md
+++ b/STATUS.md
@@ -2,7 +2,7 @@
 
 This document tracks TypeScript language features and their implementation status in SharpTS.
 
-**Last Updated:** 2026-04-18 (Embedded TypeScript stdlib — 14 Node modules migrated from C#/IL to `.ts`; `@DotNetType` full parity — interpreter + compiled with delegates and events)
+**Last Updated:** 2026-06-20 (Perf epic [#856](https://github.com/nickna/SharpTS/issues/856) — compiled output now meets or beats Node.js on most of the cross-runtime benchmark suite; loop-backedge cancellation check inlined, [#874](https://github.com/nickna/SharpTS/pull/874))
 
 ## Legend
 - ✅ Implemented
@@ -512,6 +512,28 @@ The dominant bucket is now `Fail` — tests that parse and reach the type checke
 
 ---
 
+## 18. PERFORMANCE (compiled output vs Node.js)
+
+Epic [#856](https://github.com/nickna/SharpTS/issues/856) tracks closing the compiled-IL gap to Node.js on the cross-runtime benchmark suite (`benchmarks/scripts/`, run via `benchmarks/run-benchmarks.ps1`), **without** regressing .NET interop or language conformance (Test262 + `microsoft/TypeScript`). Warm steady-state, compiled vs Node at the largest input size:
+
+| Workload | Status | vs Node |
+|---|---|---|
+| fibonacci | ✅ | **faster than Node** — recursion/call core |
+| array-methods | ✅ | **faster than Node** — typed `List<double>` HOF pipeline ([#872](https://github.com/nickna/SharpTS/issues/872)) |
+| strings | ✅ | ≈ parity — `StringBuilder` accumulator promotion ([#870](https://github.com/nickna/SharpTS/issues/870)) + `charCodeAt` box-elision ([#873](https://github.com/nickna/SharpTS/issues/873)) |
+| closures | ✅ | done — non-escaping local arrows de-virtualized to direct calls ([#858](https://github.com/nickna/SharpTS/issues/858)) |
+| objects | ✅ | done — object literals as shape structs ([#862](https://github.com/nickna/SharpTS/issues/862)) |
+| count-primes | ⚠️ | ~1.3× slower (sieve; array-heavy loop) |
+| factorial | ⚠️ | ~3× slower (tight numeric loop; µs-scale at benchmark sizes) |
+
+The original catastrophic gaps (14–117× slower) are closed. Every win came from **re-exposing static types that the naive lowering erased** — boxing, `object`/`List<object>` representations, reflective dispatch, O(n²) string concat — so RyuJIT can optimize typed code. The emitter's job is to choose the algorithm/representation/dispatch and not erase known types; the JIT optimizes the typed ops it's given.
+
+**Loop-backedge cancellation cost ([#874](https://github.com/nickna/SharpTS/pull/874)):** every compiled loop polls a cooperative-cancellation flag at its backedge so the runner can unwind runaway loops (issue [#74](https://github.com/nickna/SharpTS/issues/74)). This was an unconditional `call $Runtime.CheckCancellation()`; RyuJIT won't inline that helper (it contains `newobj`+`throw`), so it sat in every loop body as a per-iteration optimization barrier — ~half the runtime of a tight numeric loop. It is now an inlined `volatile` field test that calls the throwing helper only on the cold cancel path (`volatile.` defeats LICM hoisting the loop-invariant flag read, which would silently break cancellation). Result: **1.6×** on tight numeric loops, **1.12×** on the sieve, cancellation semantics unchanged. A throttle-every-N-iterations variant was tried and **rejected** — it merely ties the inline-volatile version, because a volatile static-field read is nearly free on x86-64 while a per-loop counter adds equal per-iteration cost.
+
+The remaining sub-parity workloads (count-primes, factorial) are dominated by separate, non-codegen factors: the residual per-iteration cancellation poll, non-inlined user-function calls, and boxed top-level `var`s.
+
+---
+
 ## Breaking Changes (2026-04-18)
 
 The embedded-stdlib migration removed implicit global bindings for several classes previously created as compile-time fallbacks. User code must now `import` these from the owning module explicitly (matches ESM-strict semantics and Node's own behavior):