Port SKMatrix math to managed C# to remove native interop (#2779)#4241
Port SKMatrix math to managed C# to remove native interop (#2779)#4241mattleibow wants to merge 11 commits into
Conversation
SKMatrix previously round-tripped through the Skia C API for every math operation (Invert, Concat, MapPoint, MapVector, MapRect, MapRadius, MapPoints/MapVectors). Each call paid the P/Invoke transition cost even though the underlying math is trivial. This reimplements the math in managed C#, faithfully mirroring Skia's SkMatrix algorithms so the result is bit-for-bit identical to the native path (verified against the Skia C API across ~230 matrices including perspective, degenerate, -0, Inf and NaN inputs). The batch MapPoints/MapVectors paths use System.Numerics.Vector4 (two points per lane, matching skvx::float4) with a hardware Vector128.Shuffle swizzle on modern runtimes and a portable fallback elsewhere. No public API changes; only method bodies and private helpers change. Benchmarks (Apple M3 Pro, .NET 10, native vs managed): Invert 25.6 ns -> 9.0 ns (2.8x) MapVector 13.5 ns -> 1.8 ns (7.5x) MapPoint 8.0 ns -> ~0 ns (interop eliminated) MapPointPersp 9.5 ns -> 0.4 ns (24x) MapRect 17.3 ns -> 9.3 ns (1.86x) MapRadius 24.9 ns -> 8.8 ns (2.8x) Concat 18.3 ns -> 11.6 ns (1.6x) MapPoints 1024 269 ns -> 261 ns (on par) Adds SKMatrixManagedTests (native-vs-managed bit-exact equivalence, incl. odd/small batch counts, in-place mapping and Inf/NaN edges) and a before/after SKMatrixBenchmark. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📦 Try the packages from this PRWarning Do not run these scripts without first reviewing the code in this PR. Step 1 — Download the packages bash / macOS / Linux: curl -fsSL https://raw.githubusercontent.com/mono/SkiaSharp/main/scripts/get-skiasharp-pr.sh | bash -s -- 4241PowerShell / Windows: iex "& { $(irm https://raw.githubusercontent.com/mono/SkiaSharp/main/scripts/get-skiasharp-pr.ps1) } 4241"Step 2 — Add the local NuGet source dotnet nuget add source ~/.skiasharp/hives/pr-4241/packages --name skiasharp-pr-4241More options
Or download manually from Azure Pipelines — look for the Remove the source when you're done: dotnet nuget remove source skiasharp-pr-4241 |
|
📖 Documentation Preview The documentation for this PR has been deployed and is available at: 🔗 View Staging Site This preview will be updated automatically when you push new commits to this PR. This comment is automatically updated by the documentation staging workflow. |
Adds the missing native-vs-managed benchmarks that the original PR lacked: MapVectors batch (1024 and small), small-count MapPoints/MapVectors (where the P/Invoke transition is not amortized), and MapRect for the scale and translate fast paths. These confirm 2-2.8x wins on small/scalar workloads. Benchmarking the new scale batch revealed the managed MapScaleBatch/ MapTranslateBatch processed only two points per iteration, which regressed the 1024-point scale case. Reworked both to mirror Skia's Scale_pts/Trans_pts exactly: a scalar lead-in for the odd point, one float4 pair, then a main loop of two float4 (four points) per iteration so the multiplies pipeline. The result stays bit-for-bit identical (verified by the equivalence tests over varied counts) and the scale batch is now on par with / faster than native. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Splits the benchmark into a scalar class (single-value ops) and a parameterised batch class so the array-taking operations (MapPoints affine/scale/translate and MapVectors) are measured across a range of item counts: 4 (low), 64 (medium), 1024 (cache-resident) and 1,000,000 (very high). This shows how the managed path scales with the number of items versus Skia's native procs. Result: the managed advantage is largest at small counts (1.3-3.2x, where the fixed per-call P/Invoke transition is not amortised) and converges to parity at very large counts (memory-bandwidth bound, ~1.0x). Cache-resident sizes (~1024) are dominated by CPU frequency scaling and sit at parity within run-to-run variance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address two faithfulness gaps found in review of the managed SKMatrix port: - MapVector / affine batch tail now compute the y lane as y*scaleY + x*skewY, matching Skia's Affine_vpts proc (the path native mapVector routes through) rather than mapPointAffine. Bit-identical for finite inputs; only NaN-payload propagation differed, but this keeps the managed path bit-exact with native. - Perspective MapVectors now iterates back-to-front to match SkMatrix::mapVectors, so overlapping dst/src spans (dst shifted ahead of src) produce the same result as the native path for all inputs, not just the in-place (dst==src) case. Add an overlapping-span equivalence test (proven to fail with forward iteration) and broaden batch counts to 6/10/11 so the single-pair lead-in and unrolled 4-point loop are exercised together. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend MapVectorsOverlappingSpansMatchNative to exercise both shift directions (dst ahead of src, and dst behind src) across all test matrices. The shifted-overlap topology is the only aliasing layout that makes perspective iteration direction observable; disjoint and same-offset in-place buffers are direction-invariant and cannot catch a wrong loop order. Verified non-vacuous: reverting the perspective loop to forward iteration fails this test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the MapVectors overlap test for MapPoints. Every native mapPoints proc iterates front-to-back (only mapVectors special-cases perspective with a backward walk), so the managed batch path must stay forward for all matrix types. Disjoint and same-offset in-place buffers are direction-invariant and cannot observe this; the shifted-overlap layout can. Guards against a future 'optimize to backward' or buffer-the-source regression that would silently diverge from native on aliased spans. Verified non-vacuous: snapshotting src before mapping fails this test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add extreme-magnitude and determinant-threshold matrices to the shared test set. Native SkMatrix computes some cofactor/concat cross-products in float (scross, rowcol3) and others in double (dcross, muladdmul); the two only diverge near float.MaxValue overflow or at the cubed nearly-zero determinant threshold, which the random +/-100 matrices never reached. These lock the per-path width contract for Determinant, Invert and Concat. Verified non-vacuous: widening rowcol3 to double fails ConcatMatchesNative. Also document why SortAsRect keeps its hand-rolled skvx min/max instead of SKRect.Standardized or MathF.Min/Max: the skvx 'NaN keeps first operand' semantics differ from both, the scale/translate map paths have no finiteness probe so the difference is value-visible, and MathF is absent on net462/net48/netstandard2.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The managed SKMatrix port is bit-for-bit identical to native Skia on every runtime that rounds each float operation to single precision (x64 .NET Framework and all .NET Core / .NET 5+ on x86, x64, ARM). It diverges only on x86 .NET Framework, whose legacy JIT targets the x87 FPU and keeps 80-bit extended-precision intermediates (permitted by ECMA-334 §8.3.7). In cancellation-prone paths that widening amplifies sub-ULP product-rounding differences into thousands of ULP, and 0*Inf style inputs differ categorically (native SSE NaN vs x87 finite), so no numeric tolerance can bridge it without changing rendering output. To stay byte-for-byte identical there, gate the diverging public math methods (TryInvert/IsInvertible, Concat/PreConcat/PostConcat/Concat(ref), MapPoint, MapVector, MapPoints, MapVectors, MapRect, MapRadius) on a single static readonly UseNativeMath flag that detects x86 .NET Framework and routes through the original native C API, mirroring the pre-port implementation. The flag is a predictable always-false branch on every modern runtime, so the managed fast path (and its benchmark wins) is unaffected. No public API changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
90461a4 to
e1c88f5
Compare
…2779-skmatrix-managed-math # Conflicts: # benchmarks/SkiaSharp.Benchmarks/Program.cs
…2779-skmatrix-managed-math
There was a problem hiding this comment.
Pull request overview
This PR ports SKMatrix math operations from native Skia C API calls to a managed C# implementation to eliminate P/Invoke overhead on hot paths, while preserving native behavior (including a runtime gate that keeps x86 .NET Framework on the native path to avoid x87 precision divergences). It also adds parity tests and benchmarks to validate/measure the change.
Changes:
- Replaced native interop in
SKMatrixoperations (invert/concat/map/map-batch/map-radius/map-rect fast paths) with managed implementations intended to be bit-for-bit identical to Skia. - Added
SKMatrixManagedTeststhat compare managed results against direct calls intoSkiaApi.sk_matrix_*across a broad corpus, including aliasing/overlap cases. - Added BenchmarkDotNet benchmarks to compare old (direct native C API) vs new (managed) paths for scalar and batch operations.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| tests/Tests/SkiaSharp/SKMatrixManagedTests.cs | Adds native-vs-managed equivalence tests for SKMatrix math, including edge cases and aliasing/overlap scenarios. |
| binding/SkiaSharp/SKMatrix.cs | Implements managed SKMatrix math + runtime UseNativeMath gate; replaces prior per-call native interop for most operations. |
| benchmarks/SkiaSharp.Benchmarks/Benchmarks/SKMatrixBenchmark.cs | Adds before/after benchmarks comparing direct C API calls (“Native”) to the new managed path (“Managed”). |
| private static bool NativeTryInvert (SKMatrix m, out SKMatrix inverse) | ||
| { | ||
| SKMatrix result; | ||
| var ok = SkiaApi.sk_matrix_try_invert (&m, &result); | ||
| inverse = result; | ||
| return ok; | ||
| } |
| private static bool BitEqual (float a, float b) => | ||
| (float.IsNaN (a) && float.IsNaN (b)) || | ||
| FloatBits (a) == FloatBits (b); |
| // Match the native shim: on failure the result is left as the | ||
| // default-constructed (identity) matrix. | ||
| inverse = Identity; |
| if ((n & 1) != 0) { | ||
| *(Vector4*)d = *(Vector4*)s + trans; | ||
| s += 4; d += 4; | ||
| } | ||
| n >>= 1; | ||
| for (var i = 0; i < n; i++) { | ||
| *(Vector4*)(d + 0) = *(Vector4*)(s + 0) + trans; | ||
| *(Vector4*)(d + 4) = *(Vector4*)(s + 4) + trans; | ||
| s += 8; d += 8; | ||
| } |
| if ((n & 1) != 0) { | ||
| *(Vector4*)d = *(Vector4*)s * scale + trans; | ||
| s += 4; d += 4; | ||
| } | ||
| n >>= 1; | ||
| for (var i = 0; i < n; i++) { | ||
| *(Vector4*)(d + 0) = *(Vector4*)(s + 0) * scale + trans; | ||
| *(Vector4*)(d + 4) = *(Vector4*)(s + 4) * scale + trans; | ||
| s += 8; d += 8; | ||
| } |
| for (var i = 0; i < pairs; i++) { | ||
| var o = i << 2; | ||
| var v = *(Vector4*)(s + o); | ||
| *(Vector4*)(d + o) = v * scale + SwapXY (v) * skew + trans; | ||
| } |
Address review feedback on the managed SKMatrix port. The SIMD batch procs (translate/scale/affine) read and write Vector4 via `*(Vector4*)ptr`, which the CLI treats as a naturally-aligned ldobj/stobj. SKPoint[] buffers are only 8-byte aligned, so this is technically undefined behaviour and could fault on strict-alignment targets. RyuJIT happens to emit unaligned moves on x64/ARM64 so it works today, but route the loads/stores through Unsafe.ReadUnaligned/WriteUnaligned to be correct by construction. The lowering is identical on x64/ARM64, so it is bit-exact and zero-cost (verified: batch benchmark ratios unchanged within noise; all 12 equivalence + 20 existing matrix tests still pass). Also tighten two test/comment items flagged in review: - NativeTryInvert left its out matrix uninitialised. The native shim always writes it (identity on the non-invertible path, since its C++ out matrix is default-constructed to identity), so this was benign, but initialise to Identity so the test never depends on that. - Document that BitEqual deliberately treats any two NaNs as equal: native and managed legitimately reach NaN by different routes and disagree on the sign bit (e.g. MapRect of a rotation over an infinite rect: native 0x7FC00000 vs managed 0xFFC00000), which Skia never inspects. Reword the misleading "default-constructed (identity)" comment that read as C++ semantics in a C# file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes #2779.
Problem
SKMatrixis a pure managed struct, but every math operation (Invert,Concat,MapPoint,MapVector,MapRect,MapRadius,MapPoints/MapVectors) round-tripped through the Skia C API. Each call paid the full P/Invoke transition cost — pinning, marshalling and a managed→native→managed hop — even though the underlying math is a handful of floating-point operations. For hot paths (mapping points, inverting/concatenating transforms in a render loop) this interop dominates the cost.What this does
Reimplements the math in managed C#, faithfully mirroring Skia's
SkMatrixalgorithms so results are bit-for-bit identical to the native path. The implementation follows the exact C++ code Skia runs through the C shim (AsMatrix/MakeAll→ proc-table dispatch), matching float-vs-double usage, operation order, type-mask computation, the±0/Inf/NaNedge behaviour, and theskvxmin/max/shuffle semantics.The batch
MapPoints/MapVectorspaths useSystem.Numerics.Vector4(two points per vector lane, mirroring Skia'sskvx::float4procsTrans_pts/Scale_pts/Affine_vpts), with a hardwareVector128.Shuffleswizzle on modern runtimes (net8+) and a portableVector4fallback on older TFMs (netstandard2.0, net4x, net6). The scale/translate batch procs unroll to four points per iteration (twofloat4) exactly like Skia'sScale_pts/Trans_ptsso the multiplies pipeline. Perspective mapping stays scalar (as it is in Skia), and perspectiveMapRectstill delegates to native (Skia uses a non-trivial path-clipping routine there).No public API changes — only method bodies changed, plus one new
private static readonly boolgate. ABI is unchanged: everypublicsignature is byte-identical tomain(verified by diff); the onlypublic-line edits are expression-body ↔ block-body reformatting.Correctness
Bit-exactness is the hard requirement (SkiaSharp must not change rendering output). New
SKMatrixManagedTestscall the Skia C API directly (SkiaApi.sk_matrix_*) and assert bit-equality against the managed result across ~230 matrices (static + 200 seeded-random), including:-0,Inf,NaN,MaxValue,Epsilonmemberssort_as_rectNaN semantics{0,1,2,3,5,6,8,10,11,12}to hit the SIMD pair loop, the unrolled 4-point loop and the scalar tail (incl. counts that combine the single-pair lead-in with the unrolled loop)dst == src) for both points and vectorsdstone element ahead ofsrcin a shared buffer) for vectors, which pins the reverse-iteration order of the perspective pathInf/NaNpoints and rectsAll 12 equivalence tests pass bit-exact, the 20 existing
SKMatrixTests stay green, and 151 additional matrix-consuming tests (SKMatrix44,SKPath,SKCanvas,SKShader,SKRoundRect,SKRotationScaleMatrix) pass.This was independently reviewed against the pinned Skia submodule sources (
SkMatrix.cpp,SkRect.cpp,SkVx.h,SkFloatingPoint.h) and by two AI reviewers (GPT-5.5 and Opus 4.8). Two narrow native-parity gaps were found and fixed in this PR:MapVectoraffine operand order — native singlemapVectorroutes through the SIMDAffine_vptsproc (y' = y*scaleY + x*skewY), notmapPointAffine(y' = x*skewY + y*scaleY). The managed vector path and the affine batch tail now use theAffine_vptsorder. Bit-identical for finite inputs; this only affects NaN-payload propagation but makes the "bit-for-bit identical" claim hold for the vector path too.MapVectorsiteration direction — native walks the perspective case back-to-front; the managed path now does the same, so overlapping shifted spans match native for all inputs (previously only thedst == srcin-place case matched). Covered by a new test that is proven to fail with forward iteration.Platform note: x86 .NET Framework (x87 FPU)
The managed math is bit-for-bit identical to native Skia on every runtime that rounds each floating-point operation to single precision: x64 .NET Framework, and all .NET Core / .NET 5+ on x64 / x86 / ARM64. Those use SSE2 / NEON scalar math, exactly like native Skia's compiled C++, and the equivalence tests assert strict bit-for-bit equality there.
The one exception is x86 .NET Framework, whose legacy JIT targets the x87 FPU. x87 evaluates
floatexpressions with 80-bit intermediates and only rounds tofloat32on store — behaviour explicitly permitted by ECMA-334 §8.3.7: "Floating-point operations may be performed with higher precision than the result type of the operation." In cancellation-prone paths (e.g.x*skewY + y*scaleYwhere the result nearly cancels) that widening amplifies sub-ULP product-rounding differences into thousands of ULP, and0*∞-style inputs diverge categorically (native SSE yieldsNaN, x87's wider range yields a finite value). No numeric tolerance can bridge that without effectively disabling the test and shipping different rendering output on that one runtime.So on x86 .NET Framework only, the diverging public methods (
TryInvert/IsInvertible,Concat/PreConcat/PostConcat/Concat(ref),MapPoint,MapVector,MapPoints,MapVectors,MapRect,MapRadius) keep routing through the original native C API, so output stays byte-for-byte identical to previous releases. The routing is a singlestatic readonly bool UseNativeMath(detected once viaRuntimeInformation) — a predictable, always-false branch on every modern runtime, so the managed fast path and its benchmark wins are unaffected. The equivalence tests therefore assert strict bit-equality on all platforms.Benchmarks
Before/after harness added (
SKMatrixBenchmark,SKMatrixMapBatchBenchmark): the*_Nativemethods call the Skia C API directly (the old code path), the*_Managedmethods use the new math. Apple M3 Pro, .NET 10, default job. All ops allocate zero in both before and after.Scalar / single-value ops
Interop dominated the old cost, so removing it is a clear, reproducible win:
¹ The scalar single-point ops are now cheap enough that the JIT folds the constant-input benchmark away; the point is that the P/Invoke transition is gone.
Batch ops vs item count
The array-taking ops are parameterised by count — 4 (low), 64 (medium), 1,024 (cache-resident) and 1,000,000 (very high) — to show how the managed SIMD path scales against Skia's native procs (
affine → Affine_vpts,scale → Scale_pts,translate → Trans_pts). Ratio = managed / native (< 1.0 = managed faster):Takeaway: the managed advantage is largest at small counts and converges to parity as the array grows — exactly as expected, since the removed cost is a fixed per-call P/Invoke transition amortised over more elements. At 1,000,000 points the work is memory-bandwidth bound and the two paths are identical (0.98–1.01×).
² The 1,024-point arrays are 8 KB and stay L1-resident, so absolute times are tens-to-low-hundreds of ns and dominated by CPU frequency scaling on this laptop. The native baseline for the same op swings run-to-run (e.g. translate@1024 native measured 117 ns isolated vs 187 ns in the full sweep → managed ratio 1.48× vs 0.75× for identical code; scale@1024 swung the other way, 0.68× vs 1.46×). This size is a noisy transition point and is effectively on par either way.
Future improvements (out of scope here)
On every modern runtime the single remaining interop call in
SKMatrixis perspectiveMapRect(SKMatrix.cs, thehasPerspectivebranch); everything else is fully managed. (On x86 .NET Framework, all of the ported math additionally routes back to native — see the platform note above.) Tracked follow-ups:MapRect(the last interop) — tracked in [perf] Port perspective SKMatrix.MapRect to managed (remove last SKMatrix native interop) #4249. For a perspective matrix, mapping the four corners and taking their bounds is not correct: the transform can push corners onto or behind thew = 0plane, and edges crossing that plane must be clipped. Skia builds a path from the rect and runsSkPathBuilder::transform→SkPathPriv::PerspectiveClip→SkHalfPlane+SkEdgeClipper::ClipPath(a full edge-clipping engine). A faithful port pulls in hundreds of lines of path-clipping infrastructure to remove one P/Invoke on the rare perspective path, so it stays native for now.System.Numericsbridges (additive API) — tracked in Add System.Numerics bridges: SKMatrix↔Matrix3x2, SKColorF↔Vector4, and related #4250.SKPoint↔Vector2,SKPoint3↔Vector3andSKMatrix44↔Matrix4x4already exist; the useful gaps areSKMatrix↔Matrix3x2(the 2×3 affine equivalent; explicitSKMatrix→Matrix3x2drops perspective, implicit the other way) andSKColorF↔Vector4(layout is already identicalR,G,B,A). Purely additive, so ABI-safe, but expands public surface and needs API-review sign-off. (SKRect↔Vector4,Plane,Vector<T>were considered and recommended against.)Vector4(two points per 128-bit lane). On x64 with AVX,Vector256could pack four points per lane. Measured on ARM64/NEON it is a non-starter:Vector256is emulated and ran 5.7–6.5× slower, while ILP-unrolling (4/8 points per iteration) gave no gain because large arrays are memory-bandwidth bound (the 1,000,000-point cases are already at parity above). Any benefit is x64-only, would need anAvx-gated second bit-exact path, and is unverified on the dev hardware — so it is deliberately not done.Vector4swizzle codegen (affects the affineSwapXY): Add Matrix3x3 dotnet/runtime#16226.Deliberate non-goal — FMA: the managed math intentionally does not use fused multiply-add (
MathF.FusedMultiplyAdd/Vector*FMA). FMA roundsa*b + conce instead of twice, so it would diverge from Skia's separate multiply/add and break the bit-exactness guarantee (the equivalence tests confirm the native build does not contract these either). The scale/affine inner loops should not be "optimised" with FMA.(Note:
SKMatrix44is already fully managed — backed bySystem.Numerics.Matrix4x4with no interop — so it needs no similar work.)Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com