feat: implement array_exists with lambda support via JVM UDF bridge#4223
feat: implement array_exists with lambda support via JVM UDF bridge#4223andygrove wants to merge 2 commits intoapache:mainfrom
Conversation
|
@hsiang-c fyi |
Adds a new JVM UDF bridge framework that allows Spark expressions to be evaluated on the JVM side via Arrow C Data Interface, while keeping the native execution pipeline intact. Includes array_exists as the first lambda-based expression using this framework.
a57dd14 to
f1ece6c
Compare
|
Are we planning to merge it asap or wait DF 54.0? |
|
we can try use apache/datafusion#21903 directly or create |
I would love to get the JVM UDF framework in (once reviewed). There are many applications where it can help us get acceleration by default rather than opt-in
What would be the advantage of waiting for DF 54? Does that give us 100% compatibility for array_exists with lambdas? |
I could split the JVM UDF work out into a separate PR but there would be no tests if we don't have an example of an expression using it |
Having no tests for lambda is fine IMO as we do not expose the feature to users right away.
Thats the entire intention of My main concern we could end up with multiple lambda implementation in DF and in Comet and might cause confusion and conflicts. The small poc PR shown the For customers we can build another branch on top of DF54 migration branch and including lambda functions there, so they can test it, WDYT? |
Ok, here is new PR with just the framework - #4232 Moving this PR to draft |
Which issue does this PR close?
Part of #4193
Rationale for this change
This PR adds a new Comet JVM UDF feature, where Comet can have JVM implementations of expressions that operate on Arrow data.
array_existsis implemented as the first example.The advantage of this approach is that we can quickly implement these features with 100% Spark compatibility without re-implementing the expressions in native code -we just call existing Java/Spark code, but operator on Arrow data, and avoid an expensive transition falling back to Spark.
Performance is 1.8x of Spark.
What changes are included in this PR?
Experimental support for Spark's
exists(array, x -> predicate(x))— the first lambda-based expression accelerated by Comet.CometLambdaRegistry: static concurrent map bridging plan-time lambda expressions to execution-time UDF lookupArrayExistsUDF: iterates ListVector elements, evaluates the lambda predicate via Spark'sNamedLambdaVariable, implements three-valued null logicCometArrayExistsserde: registers the lambda, emitsJvmScalarUdfprotoHow are these changes tested?
5 end-to-end tests in
CometArrayExpressionSuitecovering integer predicates, string predicates, null elements with three-valued logic, all-match case, and empty arrays. All usecheckSparkAnswerAndOperatorto verify both correctness and native execution.