feat: implement array_exists with lambda support via JVM UDF bridge by andygrove · Pull Request #4223 · apache/datafusion-comet

andygrove · 2026-05-05T08:39:33Z

Which issue does this PR close?

Part of #4193

Rationale for this change

This PR adds a new Comet JVM UDF feature, where Comet can have JVM implementations of expressions that operate on Arrow data.

array_exists is implemented as the first example.

The advantage of this approach is that we can quickly implement these features with 100% Spark compatibility without re-implementing the expressions in native code -we just call existing Java/Spark code, but operator on Arrow data, and avoid an expensive transition falling back to Spark.

Performance is 1.8x of Spark.

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 26.2
Apple M3 Max
array_exists - int array (x -> x < 0):    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               584            591           7          7.2         139.3       1.0X
Comet (Scan)                                        588            623          40          7.1         140.1       1.0X
Comet (Scan + Exec)                                 322            329           3         13.0          76.8       1.8X

What changes are included in this PR?

Experimental support for Spark's exists(array, x -> predicate(x)) — the first lambda-based expression accelerated by Comet.

CometLambdaRegistry: static concurrent map bridging plan-time lambda expressions to execution-time UDF lookup
ArrayExistsUDF: iterates ListVector elements, evaluates the lambda predicate via Spark's NamedLambdaVariable, implements three-valued null logic
CometArrayExists serde: registers the lambda, emits JvmScalarUdf proto
Scope: single-argument lambdas referencing only the array element; primitive + string element types

How are these changes tested?

5 end-to-end tests in CometArrayExpressionSuite covering integer predicates, string predicates, null elements with three-valued logic, all-match case, and empty arrays. All use checkSparkAnswerAndOperator to verify both correctness and native execution.

andygrove · 2026-05-05T15:06:21Z

@hsiang-c fyi

Adds a new JVM UDF bridge framework that allows Spark expressions to be evaluated on the JVM side via Arrow C Data Interface, while keeping the native execution pipeline intact. Includes array_exists as the first lambda-based expression using this framework.

comphead · 2026-05-05T17:00:22Z

Are we planning to merge it asap or wait DF 54.0?

comphead · 2026-05-05T17:02:40Z

we can try use apache/datafusion#21903 directly or create datafusion-spark counterparty

andygrove · 2026-05-05T18:44:11Z

Are we planning to merge it asap or wait DF 54.0?

I would love to get the JVM UDF framework in (once reviewed).

There are many applications where it can help us get acceleration by default rather than opt-in

lambdas
JSON
regex

What would be the advantage of waiting for DF 54? Does that give us 100% compatibility for array_exists with lambdas?

andygrove · 2026-05-05T19:04:32Z

Are we planning to merge it asap or wait DF 54.0?

I would love to get the JVM UDF framework in (once reviewed).

There are many applications where it can help us get acceleration by default rather than opt-in

lambdas

JSON

regex

What would be the advantage of waiting for DF 54? Does that give us 100% compatibility for array_exists with lambdas?

I could split the JVM UDF work out into a separate PR but there would be no tests if we don't have an example of an expression using it

comphead · 2026-05-05T21:15:16Z

Are we planning to merge it asap or wait DF 54.0?

I would love to get the JVM UDF framework in (once reviewed).
There are many applications where it can help us get acceleration by default rather than opt-in

lambdas

JSON

regex

What would be the advantage of waiting for DF 54? Does that give us 100% compatibility for array_exists with lambdas?

I could split the JVM UDF work out into a separate PR but there would be no tests if we don't have an example of an expression using it

Having no tests for lambda is fine IMO as we do not expose the feature to users right away.

Does that give us 100% compatibility for array_exists with lambdas?

Thats the entire intention of datafusion-spark module, but the compatibility is not always true unfortunately.

My main concern we could end up with multiple lambda implementation in DF and in Comet and might cause confusion and conflicts. The small poc PR shown the array_exists works with basic lambda(no column capture, no nested lambdas) https://github.com/apache/datafusion-comet/pull/4127/changes#diff-7411f5845a2488bb1509b95d8ad1e014422e21d70e0b802bd7624eabc4621c66

For customers we can build another branch on top of DF54 migration branch and including lambda functions there, so they can test it, WDYT?

andygrove · 2026-05-05T21:28:46Z

Having no tests for lambda is fine IMO as we do not expose the feature to users right away.

Ok, here is new PR with just the framework - #4232

Moving this PR to draft

andygrove mentioned this pull request May 5, 2026

Support higher-order array functions via JVM UDF bridge #4224

Open

andygrove marked this pull request as ready for review May 5, 2026 15:06

andygrove requested a review from comphead May 5, 2026 15:06

andygrove force-pushed the array-exists-lambda branch from a57dd14 to f1ece6c Compare May 5, 2026 16:50

andygrove mentioned this pull request May 5, 2026

feat: Add Comet UDF framework, and accelerate RLike with 100% Spark compatibility #4170

Draft

style: fix import ordering in jni-bridge lib.rs

ba15abe

andygrove marked this pull request as draft May 5, 2026 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement array_exists with lambda support via JVM UDF bridge#4223

feat: implement array_exists with lambda support via JVM UDF bridge#4223
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:array-exists-lambda

andygrove commented May 5, 2026 •

edited

Loading

Uh oh!

andygrove commented May 5, 2026

Uh oh!

comphead commented May 5, 2026

Uh oh!

comphead commented May 5, 2026

Uh oh!

andygrove commented May 5, 2026

Uh oh!

andygrove commented May 5, 2026

Uh oh!

comphead commented May 5, 2026 •

edited

Loading

Uh oh!

andygrove commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented May 5, 2026

Uh oh!

comphead commented May 5, 2026

Uh oh!

comphead commented May 5, 2026

Uh oh!

andygrove commented May 5, 2026

Uh oh!

andygrove commented May 5, 2026

Uh oh!

comphead commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented May 5, 2026 •

edited

Loading

comphead commented May 5, 2026 •

edited

Loading