Skip to content

Fixed type instability in array_cache for LazyArray#1303

Open
JordiManyer wants to merge 2 commits into
masterfrom
optimisations
Open

Fixed type instability in array_cache for LazyArray#1303
JordiManyer wants to merge 2 commits into
masterfrom
optimisations

Conversation

@JordiManyer

@JordiManyer JordiManyer commented May 19, 2026

Copy link
Copy Markdown
Member

I've been tracking this type instability for some time: with the current implementation, we have

function array_cache(dict::Dict,a::LazyArray)
  cache = _get_cache(dict,a)
  if cache === nothing
    _cache = _array_cache!(dict,a)
    dict[objectid(a)] = (a,_cache)
  else
    _cache = cache
  end
  _cache
end

The issue here is that, when we hit the cache and have !isnothing(cache), the only thing the compiler can see is that the cache comes from a Dict{UInt,Any} and thus sets the return type to Any.

This is HORRIBLE, since it generates a type instability that hinders the evaluation of every single subsequent call to getindex!, since that Any typing is baked into the cache.

After exploring several solutions, I think we should just pay the price of re-computing the caches. We keep the memoization and the memory savings (the recomputed cache gets thrown away), but we are now able to tell the compiler what type it should expect. The final version is:

function array_cache(dict::Dict,a::LazyArray)
  cache = _array_cache!(dict,a)
  existing = _get_cache(dict,a)
  T = typeof(cache)
  if isnothing(existing)
    dict[objectid(a)] = (a,cache)
    return cache::T
  else
    dict[objectid(a)] = (a,existing)
    return existing::T
  end
end

This is quite delicate, so I would appreciate a review @amartinhuertas @Antoinemarteau

@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.88%. Comparing base (e8e64da) to head (88003c4).
⚠️ Report is 44 commits behind head on master.

Files with missing lines Patch % Lines
src/Arrays/LazyArrays.jl 94.44% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1303   +/-   ##
=======================================
  Coverage   88.87%   88.88%           
=======================================
  Files         228      228           
  Lines       29988    29982    -6     
=======================================
- Hits        26653    26650    -3     
+ Misses       3335     3332    -3     
Flag Coverage Δ
drivers 39.53% <94.44%> (+<0.01%) ⬆️
extensions 5.03% <0.00%> (+<0.01%) ⬆️
unit-adaptivity 39.75% <94.44%> (-0.01%) ⬇️
unit-basics 14.49% <77.77%> (+<0.01%) ⬆️
unit-celldata 20.95% <94.44%> (-0.01%) ⬇️
unit-fespaces-1 32.79% <94.44%> (-0.01%) ⬇️
unit-fespaces-2 39.35% <94.44%> (-0.01%) ⬇️
unit-fields 17.56% <94.44%> (-0.01%) ⬇️
unit-geometry 28.52% <94.44%> (+<0.01%) ⬆️
unit-multifield 30.62% <94.44%> (-0.01%) ⬇️
unit-odes 28.47% <94.44%> (-0.01%) ⬇️
unit-referencefes 34.08% <94.44%> (-0.01%) ⬇️
unit-visualization 11.76% <55.55%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Antoinemarteau

Antoinemarteau commented May 20, 2026

Copy link
Copy Markdown
Collaborator

My though about this:

  • This PR trades a dynamic dispatch on getindex! with a type assert in array_cache (that will always be dynamic/verified at runtime, since typeof(existing) is inferred as Any), I genuinely don't know which is cheaper, did you benchmark it ? I suspect they have similar costs.
  • if there is a function barrier between getting the cache and using getindex! on it, the type stability of array_cache is possibly not important, you get one dynamic dispatch for the barrier, and inside the type of the cache is known.

But since there are tons of instances of not having a function barrier in the code, and with several or lots of getindex! call for a single array_cache, I think that this PR is a good idea.

@Antoinemarteau

Antoinemarteau commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Hold on a second, there's zero instance of calling array_cache(dict,a) in the whole code outside Arrays module, are you sure that this memoization is ever used? It seems to me that all array_cache(a) calls (on the "outermost" LazyArray) creates a new dict, put their cache in there, and it's never re-used, or am I missing something ?

Other avenue: since array_cache(a) specializes on typeof(a), if we assume that typeof(cache) is unique for each typeof(a), then we can change this generic

_default_array_cache(a,s::Val{true}) = array_cache(Dict{UInt,Any}(),a)

to actually create a first cache, and put typeof(cache) instead of Any for that dict, to recover type stability of geting in the dict.

@Antoinemarteau

Copy link
Copy Markdown
Collaborator

More info after some discussion with @JordiManyer and some work:

  • the point of the dict is to keep only one cache for each repeated identical operation in the operation tree of the parent array, especially to mutualize the memoized IndexItemPairvalue in the lazy arrays caches.
  • the goal here is to make the concrete typeof(cache) inferable at compile time (to avoid dynamic dispatches in getindex! in the context of the caller of array_cache), it is not only about making array_cache type stable.
  • the current solution above achieves this, but at the price of re-computing all the caches that could already be stored. I benchmarked array_cache(dict, lazyarray) for a large domain-contribution LazyArray from the PoissonDGTests, and get
    -- 2.138 μs (8 allocations: 5.36 KiB), and inferred type Any for the existing code
    -- 44.919 μs (1688 allocations: 153.92 KiB), and correctly inferred typeof(cache) from the solution above
    -- 3.720 μs (10 allocations: 6.98 KiB), and correctly inferred typeof(cache), with the very hacky solution
    -- 1.449 μs (1 allocation: 576 bytes), and correctly inferred typeof(cache), with the very hacky solution and removing same_branch.
    -- 24.496 μs (1445 allocations: 118.17 KiB), and correctly infered typeof(cache), with the above solution and removing same_branch.
    (these benchmarks were made with @btime, so it's the best case where all caches are already in the dict)

Hacky solution: _array_cache_type and using aggressive @assume_effect to constant fold ( it's a Core.Compiler.return_type taking argument value instead of their type, on steroid)

function _array_cache_type(dict,a::LazyArray)
  cache = _array_cache!(dict, a)
  return typeof(cache)
end

function array_cache(dict::Dict,a::LazyArray)
  T = begin Base.@assume_effects :foldable :nothrow _array_cache_type(dict,a) end
  existing = _get_cache(dict,a)
  if isnothing(existing)
    cache = _array_cache!(dict,a)
    dict[objectid(a)] = (a,cache)
    return cache::T
  else
    existing
    dict[objectid(a)] = (a,existing)
    return existing::T
  end
end

We also discussed the opportunity to get rid of the same_branch logic which seems to be not-alway-good micro-optimization that complexifies the code. It was added in #701, but unfortunately there is no benchmark added there to detect performance regression of ditching that. If we remove same_branch, we can replace the Dict with an IdDict and elegantly rewrite the function as:

function array_cache(dict::IdDict, a::LazyArray)
  T = begin Base.@assume_effects :foldable :nothrow _array_cache_type(dict,a) end
  # T = _array_cache_type(dict,a) # this actually allocates all the children's caches..
  # exchange the two lines above to get the "existing solution with removed same_branch benchmark"

  cache = get!(dict, a) do
    _array_cache!(dict, a)
  end

  return cache::T
end

The Arrays and FESpaces tests passed after I removed same_branch.

@Antoinemarteau

Antoinemarteau commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Ok, I think I found an ideal solution: caches will only be computed once, and the only dynamic thing happening is the type assertion cache::T at the end of array_cache. Code is as follows:

function array_cache(dict::IdDict, a::LazyArray)
  T = _allocate_cache_and_compute_its_type(dict,a)
  cache = get(dict, a) do
    @unreachable "The cache should be stored already."
  end
  return cache::T
end

function _allocate_cache_and_compute_its_type(dict, a::LazyArray)
  cache = get!(dict, a) do
    _array_cache!(dict, a)
  end
  return typeof(cache)
end

This code assumes that same_branch was removed, and that the Dict has been replaced with an IdDict, but the trick can be adapted even if we keep same_branch.

Some benchmarks using @btime (caches are in the dict):

  • If the cache is not put in the dict while computing T : 24.446 μs (1445 allocations: 118.17 KiB) (=== to solution at the top)
  • If the cache is stored in the dict while computing T :  2.863 μs (2 allocations: 1.12 KiB) (this solution)
  • hacky solution with Base.@infer_effects :  1.447 μs (1 allocation: 576 bytes)

Benchmark using @time (caches are not in the dict)

  • previous solution: 200-300 μs (6.59 k allocations: 533.250 KiB)
  • new solution:        150-170 μs (1.49 k allocations: 126.172 KiB)
  • super hacky:          110-130 μs (1.45 k allocations: 121.281 KiB)

@Antoinemarteau

Copy link
Copy Markdown
Collaborator

@JordiManyer I think I should implement the last solution on this branch, do we agree that we can remove same_branch ?

@JordiManyer

Copy link
Copy Markdown
Member Author
function array_cache(dict::IdDict, a::LazyArray)
  T = _allocate_cache_and_compute_its_type(dict,a)
  cache = get(dict, a) do
    @unreachable "The cache should be stored already."
  end
  return cache::T
end

function _allocate_cache_and_compute_its_type(dict, a::LazyArray)
  cache = get!(dict, a) do
    _array_cache!(dict, a)
  end
  return typeof(cache)
end

Oh I like this, I like it very much indeed. Let me look at this tomorrow to test myself.

Concerning same_branch, let me also have a look. In the PR you mention, the conclusion seems to be that it matters. Maybe @ericneiva remembers in which cases it mattered?

@ericneiva

Copy link
Copy Markdown
Member

Hello, @JordiManyer and @Antoinemarteau,

Maybe @ericneiva remembers in which cases it mattered?

I barely remember and I feel bad about this, but I do know the short answer is it matters in weak forms with a large number of terms.

IIRC, here is the example that motivated this optimization.

transient_navierStokes.txt

It's an implementation of an Embedded ALE formulation of the time-dependent Navier-Stokes equations.

Hope this gives you some hints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants