feat: add native support for pmod expression#4277
Conversation
Implement Spark's pmod (positive modulo) expression natively in Comet. This avoids falling back to Spark for pmod operations on all numeric types.
TRY mode behaves identically to LEGACY (returns NULL on division by zero), so there was no reason to reject it. Also adds boundary value, NaN/Infinity, negative zero, and high-precision decimal test cases from the audit.
…e enabled When spark.comet.exec.strictFloatingPoint=true and the input type is float/double, pmod now falls back to Spark. This avoids the negative zero incompatibility where Comet returns 0.0 but Spark returns -0.0.
mbutrovich
left a comment
There was a problem hiding this comment.
First round of feedback, thanks @andygrove!
| .map_err(|e| arrow::error::ArrowError::ComputeError(e.to_string()))? | ||
| .to_array_of_size(left.len()) | ||
| .map_err(|e| arrow::error::ArrowError::ComputeError(e.to_string()))?; | ||
| let result = rem(left, right)?; |
There was a problem hiding this comment.
Could the negative-zero incompatibility be eliminated by branching with zip instead of doing an unconditional add/rem?
The current shape
let result = rem(left, right)?;
let neg = arrow::compute::kernels::cmp::lt(&result, &zero)?;
let plus = arrow::compute::kernels::zip::zip(&neg, right, &zero)?;
let result = arrow::compute::kernels::numeric::add(&plus, &result)?;
rem(&result, right)normalizes -0.0 through add(+0.0, -0.0) = +0.0, which is the source of the documented strict-floating-point caveat. Spark's reference implementation is if (r < 0) (r + n) % n else r, i.e. the non-negative branch returns r untouched.
A direct vectorized translation:
let result = rem(left, right)?;
let neg = arrow::compute::kernels::cmp::lt(&result, &zero)?;
let adjusted = rem(&arrow::compute::kernels::numeric::add(&result, right)?, right)?;
arrow::compute::kernels::zip::zip(&neg, &adjusted, &result)That preserves the original r for r >= 0, including -0.0, and matches Spark exactly. Quick sanity check on the interesting inputs:
| input | r |
neg |
result |
|---|---|---|---|
pmod(-0.0, 3.0) |
-0.0 |
false | -0.0 ✓ |
pmod(NaN, 3.0) |
NaN |
false (NaN cmp) | NaN ✓ |
pmod(-10, 3) |
-1 |
true | 2 ✓ |
pmod(-10, -3) |
-1 |
true | -1 ✓ |
If this works, CometPmod can drop getIncompatibleReasons / getSupportLevel and just be Compatible().
| let left_data_type = lhs.data_type(); | ||
| let right_data_type = rhs.data_type(); | ||
|
|
||
| if left_data_type.is_nested() { |
There was a problem hiding this comment.
The nested-type branch dispatches to apply_cmp_for_nested(Operator::Modulo, ...), but two things look off:
Pmodis numeric-only in Spark, andCometPmod.supportedDataTypealready rejects non-numeric types. This branch should be unreachable.- Even if it were reached, it would compute Modulo, not Pmod.
Suggest dropping the whole if left_data_type.is_nested() block, or replacing the body with internal_err!("spark_pmod does not support nested types") to make the invariant explicit.
| } | ||
| } | ||
|
|
||
| override def convert( |
There was a problem hiding this comment.
The PR description says "rejecting TRY eval mode", but CometPmod.convert doesn't have the expr.evalMode == EvalMode.TRY guard that CometRemainder has at line 270.
There's no try_pmod in Spark's function registry, so the gap is unlikely to be reachable today. But matching CometRemainder's shape keeps the serde defensive in case Pmod ever gets constructed with a TRY context, and aligns the code with the PR description.
- Use a zip-based `if (r < 0) (r + n) % n else r` so the non-negative branch (including -0.0) is returned untouched. Mask the right-hand side to zero on non-negative rows so the `add` does not overflow on values whose adjusted branch is discarded by the zip (e.g. pmod(2L, Long.MaxValue)). - Arrow's `cmp::lt` uses total ordering and reports `lt(-0.0, 0.0)` as true, so compare against -0.0 for floats to recover IEEE semantics. - Drop the unreachable nested-type branch; Pmod is numeric-only. - Add the EvalMode.TRY guard that CometRemainder has. - Drop the strict-FP incompatibility now that -0.0 round-trips, and unignore the negative-zero SQL test.
Which issue does this PR close?
N/A - new expression support
Rationale for this change
Spark's
pmod(positive modulo) expression currently falls back to Spark when running under Comet. Adding native support avoids the fallback overhead for this commonly used function.What changes are included in this PR?
Implements
pmodend-to-end through the Comet stack:MathExpr pmod = 71to theExproneof inexpr.protoCometPmodobject inarithmetic.scalausing theMathExpr/MathBasepattern, rejecting TRY eval modeclassOf[Pmod] -> CometPmodinQueryPlanSerde.scalaspark_pmodinmodulo_expr.rs— computes((a % b) + b) % bto ensure positive results, with proper null/error handling for division by zerocreate_pmod_exprwith null-if-zero wrapping for non-ANSI mode and Decimal256 overflow handling (mirrorscreate_modulo_expr)PmodBuilderandExpressionType::Pmodvariantpmodas supported inspark_expressions_support.mdThe
implement-comet-expressionskill was used to scaffold this implementation.How are these changes tested?
pmod.sql: Tests all supported numeric types (int, long, float, double, decimal, short, byte) with positive/negative dividends and divisors, NULL inputs, and division by zero in non-ANSI mode (returns NULL). UsesConfigMatrix: parquet.enable.dictionary=false,true.pmod_ansi.sql: Tests that division by zero raises an error in ANSI mode (spark.sql.ansi.enabled=true).