Skip to content

Initialize SKObject ownedObjects/keepAliveObjects ConcurrentDictionary with concurrency level and capacity#4182

Open
nietras wants to merge 5 commits into
mono:mainfrom
nietras:patch-1
Open

Initialize SKObject ownedObjects/keepAliveObjects ConcurrentDictionary with concurrency level and capacity#4182
nietras wants to merge 5 commits into
mono:mainfrom
nietras:patch-1

Conversation

@nietras

@nietras nietras commented Jun 17, 2026

Copy link
Copy Markdown

Just a suggestion on what a change might be for #4181
Must be evaluated by whether normal usage matches this etc.
I am assuming normal/default use case is 1 UI thread, like in Avalonia. This does not prevent usage with more threads. But is having more buckets/lock objects really necessary for that anyway, as simple "pointer" storage.

Just a suggested change based on mono#4181 
Must be evaluated by whether normal usage does not match this
@github-actions

Copy link
Copy Markdown
Contributor

📦 Try the packages from this PR

Warning

Do not run these scripts without first reviewing the code in this PR.

Step 1 — Download the packages

bash / macOS / Linux:

curl -fsSL https://raw.githubusercontent.com/mono/SkiaSharp/main/scripts/get-skiasharp-pr.sh | bash -s -- 4182

PowerShell / Windows:

iex "& { $(irm https://raw.githubusercontent.com/mono/SkiaSharp/main/scripts/get-skiasharp-pr.ps1) } 4182"

Step 2 — Add the local NuGet source

dotnet nuget add source ~/.skiasharp/hives/pr-4182/packages --name skiasharp-pr-4182
More options
Option Description
--successful-only / -SuccessfulOnly Only use successful builds
--force / -Force Overwrite previously downloaded packages
--list / -List List available artifacts without downloading
--build-id ID / -BuildId ID Download from a specific build

Or download manually from Azure Pipelines — look for the nuget artifact on the build for this PR.

Remove the source when you're done:

dotnet nuget remove source skiasharp-pr-4182

@mattleibow

Copy link
Copy Markdown
Contributor

📊 Benchmark: allocations for SKSurface.Canvas

I added a BenchmarkDotNet benchmark to benchmarks/SkiaSharp.Benchmarks that exercises the exact path from #4181 — accessing SKSurface.Canvas, which lazily creates the owner's OwnedObjects ConcurrentDictionary. A fresh surface is created per invocation so the dictionary allocation happens on every operation. [MemoryDiagnoser] reports managed allocations per op.

Both runs are identical except for SKObject.cs: before = main (default new ConcurrentDictionary<…>()), after = this PR (concurrencyLevel: 1, capacity: 1).

Before (default constructor — main)

BenchmarkDotNet=v0.13.5, OS=macOS 26.5 (25F71) [Darwin 25.5.0]
Apple M3 Pro, 1 CPU, 12 logical and 12 physical cores
.NET SDK=10.0.201
  [Host] : .NET 10.0.5 (10.0.526.15411), Arm64 RyuJIT AdvSIMD

Toolchain=InProcessEmitToolchain  

|           Method |     Mean |     Error |    StdDev |   Median |   Gen0 | Allocated |
|----------------- |---------:|----------:|----------:|---------:|-------:|----------:|
| GetSurfaceCanvas | 3.746 us | 0.2367 us | 0.6677 us | 3.936 us | 0.1869 |   1.54 KB |

After (this PR — concurrencyLevel: 1, capacity: 1)

BenchmarkDotNet=v0.13.5, OS=macOS 26.5 (25F71) [Darwin 25.5.0]
Apple M3 Pro, 1 CPU, 12 logical and 12 physical cores
.NET SDK=10.0.201
  [Host] : .NET 10.0.5 (10.0.526.15411), Arm64 RyuJIT AdvSIMD

Toolchain=InProcessEmitToolchain  

|           Method |     Mean |     Error |    StdDev |   Gen0 | Allocated |
|----------------- |---------:|----------:|----------:|-------:|----------:|
| GetSurfaceCanvas | 3.496 us | 0.0444 us | 0.0371 us | 0.0687 |     600 B |

Result

Allocations for SKSurface.Canvas drop from 1.54 KB → 600 B (~62% less, ≈977 B/op) and Gen0 collections fall from 0.1869 → 0.0687 per 1000 ops, with no measurable change in mean time. The dictionary's default concurrencyLevel equals the CPU core count (12 here, so 12 lock objects), so the saving grows on machines with more cores — this is why @nietras observed ~31 object allocations.

Benchmark source
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Toolchains.InProcess.Emit;

namespace SkiaSharp.Benchmarks;

[MemoryDiagnoser]
[Config(typeof(Config))]
public class SKObjectBenchmark
{
	private class Config : ManualConfig
	{
		public Config() =>
			AddJob (Job.Default.WithToolchain (InProcessEmitToolchain.Instance));
	}

	private readonly SKImageInfo info = new SKImageInfo (256, 256);

	[Benchmark]
	public SKCanvas GetSurfaceCanvas ()
	{
		using var surface = SKSurface.Create (info);
		return surface.Canvas;
	}
}

Run with: dotnet run -c Release --project benchmarks/SkiaSharp.Benchmarks -- --filter '*SKObjectBenchmark*'

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tweaks SKObject’s lazy-initialized OwnedObjects and KeepAliveObjects dictionaries to use explicit ConcurrentDictionary constructor parameters, aiming to reduce the allocations caused by the default capacity/concurrency settings (related to #4181’s allocation observations).

Changes:

  • Initialize OwnedObjects with ConcurrentDictionary(concurrencyLevel: 1, capacity: 1) instead of the default constructor.
  • Initialize KeepAliveObjects with ConcurrentDictionary(concurrencyLevel: 1, capacity: 1) instead of the default constructor.

Comment thread binding/SkiaSharp/SKObject.cs Outdated
lock (locker) {
ownedObjects ??= new ConcurrentDictionary<IntPtr, SKObject> ();
ownedObjects ??= new ConcurrentDictionary<IntPtr, SKObject> (
concurrencyLevel: 1, capacity: 1);
Comment thread binding/SkiaSharp/SKObject.cs Outdated
lock (locker) {
keepAliveObjects ??= new ConcurrentDictionary<IntPtr, SKObject> ();
keepAliveObjects ??= new ConcurrentDictionary<IntPtr, SKObject> (
concurrencyLevel: 1, capacity: 1);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattleibow thanks for looking at this, and I have no problem as such committing changes suggested by copilot, but would it perhaps be better to have these parameters forwarded from different children of SKObject so parameters match usage? SKCanvas keeps capacity 1, others can have more?

@mattleibow

Copy link
Copy Markdown
Contributor

Have you run this in a real app and have data that shows improvement? Also, SkiaSharp can run in a web server in multiple threads. There typically should not be cross-thread usages of these fields.

However, do you have a scenario where these changes help besides the allocations?

What are the downsides of this PR and what scenarios will it impact?

@ramezgerges

Copy link
Copy Markdown
Contributor

I've tested before and after this PR with an Uno Platform sample app, once just measuring the startup allocations and once measuring a steady state animation playing on repeat. The difference is very minor and is indistinguishable from noise, so I'm not sure if it's worth it unless we have a concrete scenario where we're seeing these allocations causing significant GC time or something similar. @nietras did you encounter a real scenario where this PR would decently affect the allocations and/or gc time?

@mattleibow

Copy link
Copy Markdown
Contributor

📉 Perf impact note: capacity: 1 trades a smaller common case for a costlier "many children" case

I wanted to quantify the one downside of hard-coding capacity: 1: the OwnedObjects / KeepAliveObjects dictionaries start with a single bucket, so an owner that accumulates several children pays repeated resize + rehash, where the old default pre-sized 31 buckets. I benchmarked the exact lifecycle SKObject uses — create → insert N children via the indexer → enumerate + Clear() on dispose — comparing the current default, this PR's (concurrencyLevel: 1, capacity: 1), and a middle-ground (1, 4).

Note: this is a synthetic microbenchmark of the dictionary itself (no native Skia allocation), to isolate the resize cost. Allocations are deterministic; the time columns are noisy in this ShortRun config (errors frequently exceed the means), so I'm only drawing conclusions from Allocated.

BenchmarkDotNet=v0.13.5, OS=macOS 26.5 (25F71) [Darwin 25.5.0]
Apple M3 Pro, 1 CPU, 12 logical and 12 physical cores
  [Host] : .NET 8.0.23 (8.0.2325.60607), Arm64 RyuJIT AdvSIMD
Job=ShortRun  Toolchain=InProcessEmitToolchain  IterationCount=3  LaunchCount=1  WarmupCount=3

|     Method |  N |   Gen0 | Allocated | Alloc Ratio |
|----------- |--- |-------:|----------:|------------:|
|    Default |  1 | 0.1702 |    1424 B |        1.00 |
| Cap1_Conc1 |  1 | 0.0861 |     720 B |        0.51 |
| Cap4_Conc1 |  1 | 0.0899 |     752 B |        0.53 |
|    Default |  2 | 0.1760 |    1472 B |        1.00 |
| Cap1_Conc1 |  2 | 0.0918 |     768 B |        0.52 |
| Cap4_Conc1 |  2 | 0.0956 |     800 B |        0.54 |
|    Default |  4 | 0.1874 |    1568 B |        1.00 |
| Cap1_Conc1 |  4 | 0.1459 |    1224 B |        0.78 |
| Cap4_Conc1 |  4 | 0.1068 |     896 B |        0.57 |
|    Default |  8 | 0.2098 |    1760 B |        1.00 |
| Cap1_Conc1 |  8 | 0.2441 |    2048 B |        1.16 |
| Cap4_Conc1 |  8 | 0.2050 |    1720 B |        0.98 |
|    Default | 16 | 0.2556 |    2144 B |        1.00 |
| Cap1_Conc1 | 16 | 0.2899 |    2432 B |        1.13 |
| Cap4_Conc1 | 16 | 0.2508 |    2104 B |        0.98 |
|    Default | 32 | 0.3471 |    2912 B |        1.00 |
| Cap1_Conc1 | 32 | 0.5341 |    4472 B |        1.54 |
| Cap4_Conc1 | 32 | 0.4940 |    4144 B |        1.42 |
|    Default | 64 | 0.9155 |    7688 B |        1.00 |
| Cap1_Conc1 | 64 | 1.0338 |    8656 B |        1.13 |
| Cap4_Conc1 | 64 | 0.9956 |    8328 B |        1.08 |

Reading the numbers (Allocated):

  • N ≤ 2 (the dominant real case) — most owners keep exactly one child (surface→canvas, document→stream, colorspace→profile), so this is what actually happens in practice. (1,1)49% less (720 B vs 1424 B). This is the win this PR is about. ✅
  • Crossover at N ≈ 8 — once an owner holds ~8+ children, (1,1) flips to a regression because it resizes repeatedly while the default's 31 buckets absorb the inserts.
  • Worst case at N = 32 — right around the default capacity boundary, (1,1) allocates +54% (4472 B vs 2912 B). (The ShortRun timings are noisy, but this row was also consistently ~1.5× slower.)
  • N = 64 — both resize repeatedly, so the gap narrows again (+13%).

Takeaway: for the case this PR targets (0–1 children), it's a clear, safe win and there's no threading downside — each SKObject has its own dictionaries, so concurrencyLevel: 1 only ever matters if the same owner is mutated from multiple threads at once, which is already unsupported (sharing a non-thread-safe Skia object concurrently). The only real cost is per-object allocation for owners that accumulate many children — uncommon, but a measurable regression in that tail.

Suggestion: capacity: 4 captures essentially the full small-N win (752 B vs 720 B at N=1) while removing the N=8–16 regression (back to parity) and halving the N=32 penalty. If we expect any owners to hold more than a couple of children, (concurrencyLevel: 1, capacity: 4) looks like the safer pick than capacity: 1.

@nietras

nietras commented Jun 29, 2026

Copy link
Copy Markdown
Author

A little detail on one example app. Simple app showing one Bitmap, bitmap pixels are updated (simple fill byte of all) at some fixed rate e.g. 50 Hz. Using Visual Studio ".NET Object Allocation Tracking" with in code UserMarks I then select a period of 1000 updates after warmup.

image

There are a LOT of allocations for this simple scenario in the libraries used (not my/user code). But for single type the OwnedObjects object allocations are significant. Note how there are exactly 1000 allocations for some things, and that 31000 is exactly 31 x 1000 as I've written about.

image

difference is very minor and is indistinguishable from noise

This depends on scenario but this is also the main question here, there are so many allocations here in many parts that the sum of it is so much that any given part perhaps is viewed small. If one always dismissed those "small" parts as "very minor" the amount of allocations will never get small. That is why I asked whether this was viewed as priority as there are so many allocs, to reduce we would need to address each one in turn. :)

Here there is about 140 "reference type" allocations per image update. 140 is a lot, I think 😅 around 2500 bytes per update. This includes dispatcher timer stuff, though. Not all just bitmap but most of it is the bitmap display related code.

image

I hope to open source this simple benchmark app at some point as I am comparing different .NET UI libraries, many use SkiaSharp so any improvement here would help all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants