Skip to content

[SPARK-56940][SQL] Extend OptimizeRand optimizer to support arithmetic expressions#55982

Open
xuzifu666 wants to merge 1 commit into
apache:masterfrom
xuzifu666:rand_opt_1
Open

[SPARK-56940][SQL] Extend OptimizeRand optimizer to support arithmetic expressions#55982
xuzifu666 wants to merge 1 commit into
apache:masterfrom
xuzifu666:rand_opt_1

Conversation

@xuzifu666
Copy link
Copy Markdown
Member

@xuzifu666 xuzifu666 commented May 19, 2026

What changes were proposed in this pull request?

  1. Support for Arithmetic Expression Optimization
  • Previously: Only direct comparisons like rand() < 1 were optimized
  • Now: Supports arithmetic operations like rand() * 2 < 1, rand() + 1 < 2, rand() - 1 < 0, rand() / 2 < 1
  • Also supports nested expressions: 2 * rand() + 1 < 3
  1. Support for Equality Comparisons
  • Added optimization for == (EqualTo) operator
  • Example: rand() == 2 optimizes to false (since rand() range is [0,1))
  1. Implementation Details

    Added 3 key helper functions:

  • extractDouble(): Safely extracts Double values from various Literal types
    case DoubleLiteral(v) => Some(v)
    case Literal(v: Double, _) => Some(v)
    case Literal(v: java.lang.Number, _) => Some(v.doubleValue())
  • extractRandCoeffOffset(): Recursively extracts coefficients and offsets, normalizing any arithmetic expression into canonical form coeff * rand() + offset
    • Supports: Rand, Multiply, Add, Subtract, Divide
  • optimizeWithCoeffOffset(): Performs mathematical transformation and constant folding
    • Handles zero coefficient: direct comparison
    • Handles positive coefficient: computes threshold t = (value - offset) / coeff
    • Handles negative coefficient: correctly reverses inequality direction

Why are the changes needed?
Before optimization (without this change):

  Query: SELECT * FROM large_table WHERE 2 * rand() < 1
  ├─ Parser: Creates expression tree
  ├─ Analyzer: Type checking
  ├─ Optimizer: Cannot simplify (no arithmetic support)
  └─ Executor: Evaluates 2 * rand() for every row (millions of times!)
      ├─ Generates random seed
      ├─ Computes rand() value
      ├─ Multiplies by 2 
      └─ Compares with 1
  Result: INEFFICIENT - unnecessary computation on every row

After optimization (with this change):

  Query: SELECT * FROM large_table WHERE 2 * rand() < 1
  ├─ Parser: Creates expression tree
  ├─ Analyzer: Type checking
  ├─ Optimizer: Recognizes pattern, applies mathematical transformation
  │   └─ Calculates: t = (1 - 0) / 2 = 0.5
  │   └─ Since 0 ≤ 0.5 < 1, result is always TRUE
  └─ Executor: Skips rand() completely, evaluates to TRUE for all rows
  Result: EFFICIENT - compile-time constant folding

Does this PR introduce any user-facing change?
No

Was this patch authored or co-authored using generative AI tooling?
No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant