feat(experimental): ScalaUDF and Java UDF support via Janino codegen by mbutrovich · Pull Request #4267 · apache/datafusion-comet

mbutrovich · 2026-05-08T15:36:39Z

Which issue does this PR close?

Closes #.

Rationale for this change

Builds on the JVM UDF bridge (#4232) and the per-task-attempt-id bridge cache (#4345). Adds CometScalaUDFCodegen, a CometUDF that compiles a specialized Arrow batch kernel per bound ScalaUDF and input schema via Janino. Without it, any plan containing a ScalaUDF falls back to Spark for the enclosing operator.

Any ScalaUDF whose argument and return types are in the supported surface routes through native, no hand-written CometUDF required.
The whole argument tree binds together, so Catalyst sub-expressions (upper(s), concat(c1, c2), monotonically_increasing_id(), HOFs like transform / filter / array_max) compile into the same per-row loop as the user function.
Surrounding native operators stay native. The UDF is no longer a whole-operator fallback boundary.

Opt-in via spark.comet.exec.scalaUDF.codegen.enabled (default false, experimental).

Iceberg ships ScalaUDFs that real workloads hit: IcebergSpark.registerBucketUDF / registerTruncateUDF for partition-aligned predicates, and RewriteDataFiles with sort-strategy=zorder for compaction. With this PR enabled, those run natively and the surrounding project / exchange / sort stay on the Comet path. Without it, the operator falls back and the shuffle gets demoted from CometExchange to CometColumnarExchange.

The dispatcher is one of potentially many CometUDF implementations the bridge can route to. Hand-written CometUDFs for specific expression families remain a parallel path, dispatched by class name from the proto.

What changes are included in this PR?

Area	Where
Core codegen + dispatcher + serde	`spark/src/main/scala/org/apache/comet/codegen/`, `.../udf/codegen/`, `.../serde/CometScalaUDF.scala`
Tests	4 suites + shared assertions under `spark/src/test/scala/org/apache/comet/CometCodegen*.scala`
Cross-version shims	`spark/src/main/spark-{3.x,4.0,4.1,4.2,4.x}/.../shims/`
Native (Rust)	FFI / planner / jni-bridge cleanup
Docs	`docs/source/user-guide/latest/scala_java_udfs.md`
CI	New suite names in `pr_build_*.yml`

Where to focus review

Codegen template: CometBatchKernelCodegen plus the input/output emitters.
Dispatcher lifetime, caching, synchronization: CometScalaUDFCodegen.
Plan-time gating and bound-tree serialization: CometScalaUDF.
Test helpers: CometCodegenAssertions.

How are these changes tested?

CometCodegenSourceSuite: generated-source assertions for each optimization, complex-type shapes, null-guard contract per Struct / Array / Map element and field, and CacheKey discrimination on ArrowColumnSpec.nullable.
CometCodegenSuite: end-to-end correctness across scalar and complex type surfaces, composed UDF trees, subquery reuse, TaskContext propagation, per-task cache isolation, kernel-cache reuse across batches, ScalaUDF as a child of a native Spark expression, maxFields plan-time gate, null-guard contract via array_max(flatten(...)) for Binary / String / Decimal short / Decimal long.
CometCodegenHOFSuite: ArrayTransform / ArrayFilter regressions plus a per-task isolation regression that runs the same HOF query twice and asserts each matches Spark.
CometCodegenFuzzSuite: schema-driven fuzz over random parquet: identity ScalaUDF on every primitive column, cardinality probe on every complex column, per-column array_max element fuzz, array_max(flatten(...)) over Array<Array<primitive>>, array_max(map_keys / map_values(...)) over Map<primitive, primitive>, array_distinct over Array<Struct<primitives>>, randomized decimal sweep across the MAX_LONG_DIGITS=18 boundary at varying null densities.

…UDFs

mbutrovich · 2026-05-08T21:49:51Z

There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward.

…r JNI

…ted body" on Spark 3.5

…scala_udf

# Conflicts: # dev/diffs/3.4.3.diff # dev/diffs/3.5.8.diff

…scala_udf

andygrove

I did a first pass review without using AI and this is looking great! It is disabled by default so I'd be fine with merging and keep iterating. I would like to run some benchmarks at scale.

I'll review again with AI assistance next.

andygrove · 2026-05-19T15:28:51Z

AI review:

On CometScalaUDFCodegen.scala:149 (ensureKernel):
"Could you walk me through what happens to Rand and MonotonicallyIncreasingID state when a partition has batches with different nullability profiles? The cache key includes per-batch nullable(v) =
v.getNullCount != 0, so a nullability flip inside the partition will land on a different activeKey, drop the existing activeKernel, and re-init(partitionId) the new one. If batches flip back and forth,
each transition reseeds XORShiftRandom and resets the ID counter, which could produce duplicate rand sequences or overlapping IDs. Is there a test that pins a single partition's rand /
monotonically_increasing_id output across a nullability flip?"
On CometScalaUDFCodegen.scala:213 (nullable(v)):
"Arrow Java's getNullCountcan return-1for vectors whose null count hasn't been computed yet (depending on how the vector was constructed). On the FFI import path from native, is there a
guaranteegetNullCountis materialized before we observe it here? A-1would compare!= 0 and the kernel would compile as nullable, which is harmless. The opposite direction (0` returned for a vector that does
have nulls) is what would worry me."
On CometScalaUDF.scala:62 (expr.collect { case a: AttributeReference => a }.distinct):
"AttributeReference equality includes exprId so this should dedupe correctly, but it might be worth a comment noting that the resulting ordinal order is determined by tree traversal order (so the executor
side has to recompute the same bound positions from the same tree)."
On CometBatchKernelCodegen.scala:212-218 (freshReferences):
"The freshReferences closure captures boundExpr and inputSchema and re-runs generateSource each time. The doc says this is microseconds versus milliseconds for Janino compile, which is fine, but if a
partition flips between two kernels often this becomes per-flip work. Worth a follow-up TODO to cache the references array per cache entry and only regenerate when ScalaUDF stateful encoders force it?"
On the PR description (Spark test diffs bullet):
"The PR description mentions dev/diffs/*.diff updates totaling 198 lines, but I don't see them in the current diff (looks like commit a159357 rolled them back). If Matt's earlier comment about "4 Spark
SQL test failures" still applies, those diff updates might want to come back. If those failures resolved themselves with the rebase, the bullet in the description can drop."
General (no specific anchor):
"All CI green, very thorough test coverage, well-documented internal invariants. Disabled by default makes the experimental landing low-risk. Once the rand/MonotonicallyIncreasingID lifetime question is
settled (either with a test or with a code clarification) I think this is good to merge and iterate from main."

mbutrovich · 2026-05-19T18:41:56Z

On the PR description (Spark test diffs bullet):
"The PR description mentions dev/diffs/*.diff updates totaling 198 lines, but I don't see them in the current diff (looks like commit a159357 rolled them back). If Matt's earlier comment about "4 Spark
SQL test failures" still applies, those diff updates might want to come back. If those failures resolved themselves with the rebase, the bullet in the description can drop."

This is old from when I was testing with the feature enabled by default. I'll update the description and address the other feedback shortly. Thanks @andygrove!

mbutrovich · 2026-05-19T22:17:46Z

Thanks @andygrove for the careful review and the AI pass, both turned up real issues.

Changes pushed:

Doc: scalar UDF intro now links to Spark's scalar UDF guide and avoids the temporally fragile "no longer falls back" phrasing.
Dispatcher: stopped deriving spec nullability from per-batch getNullCount. The cache key is now a function of the bound expression bytes and the schema-stable input shapes only. BoundReference.nullable (Catalyst's schema-tracked flag, baked into the serialized bytes on the driver) is the sole source of nullability information, so schema-declared non-null columns still get full isNullAt elision via Spark's own doGenCode.
Dispatcher: replaced the single activeKernel slot with a kernel instance stashed in each CacheEntry. The instance is init(partitionId)'d once at compile time and reused for every batch of that (expression, schema). Removed ensureKernel, rewriteBoundReferences, and the active-slot bookkeeping.

The AI review caught two real correctness bugs the previous design had:

A nullability flip mid-partition (one batch has nulls, the next does not) reset the kernel and replayed any per-partition stateful counters (MonotonicallyIncreasingID, Rand's XORShiftRandom). Reproduced with a 200-row single-partition parquet at batch size 8 with a null range; the new test pins the invariant.
A plan with two distinct ScalaUDFs in one operator thrashed the single active-kernel slot per batch and reset state on every flip. Reproduced with two UDFs each wrapping monotonically_increasing_id(); new test pins it. Confirmed the same shape also fires when one UDF is applied to two columns of differing schema nullability (different BoundReference ordinals produce different cache keys), covered by a third test.

The other AI suggestions turned out to be moot:

getNullCount == -1 no longer matters since we no longer read getNullCount.
The freshReferences per-flip cost is no longer a concern since the kernel is instantiated exactly once per cache entry rather than per partition-flip.
The AttributeReference.collect.distinct ordering note: the existing comment already documents the load-bearing invariant (ordinals align with the data args we ship), and the framing the AI suggested would have been slightly misleading about what the executor does.

PR description updated to drop the stale dev/diffs/*.diff bullet (artifact of when this was enabled by default).

…, update user guide

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala…

1746bcc

…UDFs

This was referenced May 8, 2026

feat: add experimental support for Spark regexp expressions via JVM UDF framework #4239

Open

feat: add user-facing CometUDF registration for custom JVM UDFs #4233

Draft

mbutrovich and others added 9 commits May 8, 2026 11:44

prettier, add new suites to CI checks.

08d6b78

make format, fix shims for 4.0+

557752e

make format, fix shims for 4.0+

896f61f

Merge branch 'main' into codegen_scala_udf

a82e160

strengthen tests for composed expressions

2a158f4

make format, again.

654bbad

fix pr_benchmark_check.yml

10df7e0

fix arrow shading issue in CI.

7afe69f

fix Spark 4.0 collation expression shim

0dc5855

mbutrovich and others added 17 commits May 8, 2026 19:44

apply common subexpression elimination, add tests for subqueries in UDFs

43a7b0c

make format

9640897

decimal fast path. document 64KB limitation right now

f0c8296

pass through task context to get around tokio worker pool calling ove…

2173f40

…r JNI

fix compilation on scala 2.12, fix format issue

2f9585b

Merge branch 'main' into codegen_scala_udf

582cd17

decimal output, utf8 output, non-nullable output optimizations

22f3256

optimization menu

7666715

estimate binaryview and binary size

0a34636

fix "CSE collapses a repeated subtree to one evaluation in the genera…

e94b6db

…ted body" on Spark 3.5

Merge remote-tracking branch 'origin/codegen_scala_udf' into codegen_…

d0f1f27

…scala_udf

add some complex type support, remove apache#4239 code. update docs.

07e37ea

split codegen input and output, basic struct WIP

ebf77c4

split massive codegen file, handle recursive nested types

6836c30

map input

5d91a8f

more struct support

2a28aaf

revert some benchmark changes

0c6586a

mbutrovich and others added 15 commits May 18, 2026 09:03

Merge branch 'main' into codegen_scala_udf

b1fbbb8

# Conflicts: # dev/diffs/3.4.3.diff # dev/diffs/3.5.8.diff

upmerge main, regenerate diffs

2be5f73

Merge branch 'main' into codegen_scala_udf

4d471e1

cleanup round 1

e19683e

cleanup round 2

ec42809

remove benchmark

9089fa1

remove cast from JNI layer that was a bandaid for List types

2259ff6

Merge branch 'main' into codegen_scala_udf

83096e7

fix scala 2.12

5ee1ddf

Merge remote-tracking branch 'origin/codegen_scala_udf' into codegen_…

2102f62

…scala_udf

set config to false by default since it's experimental

e98164c

Update fallback message.

ca4cd41

c12096e

roll back diff changes

a159357

Merge branch 'main' into codegen_scala_udf

a68ba53

andygrove reviewed May 19, 2026

View reviewed changes

Comment thread docs/source/user-guide/latest/scala_java_udfs.md Outdated

andygrove reviewed May 19, 2026

View reviewed changes

mbutrovich added 2 commits May 19, 2026 16:02

Merge branch 'main' into codegen_scala_udf

8a651e5

address PR feedback

63573ba

mbutrovich requested a review from andygrove May 20, 2026 00:59

mbutrovich and others added 6 commits May 20, 2026 07:26

tighten comments, fix planner.rs builder changes to align to codebase…

c9d2960

…, update user guide

Merge branch 'main' into codegen_scala_udf

41ea025

swap init and process in CometBatchKernel

3edba99

fix format

79a4e98

update shading comments after apache#4325

58757cb

clean up more comments

0b57f11

mbutrovich marked this pull request as ready for review May 20, 2026 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(experimental): ScalaUDF and Java UDF support via Janino codegen#4267

feat(experimental): ScalaUDF and Java UDF support via Janino codegen#4267
mbutrovich wants to merge 93 commits into
apache:mainfrom
mbutrovich:codegen_scala_udf

mbutrovich commented May 8, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

Uh oh!

andygrove left a comment

Uh oh!

andygrove commented May 19, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 19, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbutrovich commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Where to focus review

How are these changes tested?

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented May 8, 2026 •

edited

Loading

andygrove commented May 19, 2026 •

edited

Loading

mbutrovich commented May 19, 2026 •

edited

Loading