fix: Prefer numeric in type coercion for comparisons#20426
fix: Prefer numeric in type coercion for comparisons#20426neilconway wants to merge 1 commit intoapache:mainfrom
Conversation
|
TODOs:
|
|
One example of a weird behavior that we get with the new coercion rules proposed here: This follows from the "coerce string to numeric" change in general, but it's a bit weird. I think we should probably reject queries like this entirely, on the basis of a type mismatch. If we did that and nothing else, we'd break a ton of queries (e.g., anything comparing a numeric column with a string literal). But we could fix that up by treating comparisons with string literals specially (e.g., "if x OP y mismatch in types, check if either x or y is a string literal; if so, try to coerce it to a numeric type"). So that's an alternative approach worth considering. |
|
The ClickBench queries are just weird: they have expressions like |
|
Thanks for putting this PR together @neilconway! Regarding your dilemma about |
|
@AlonSpivack The challenge with rejecting If we want to reject
(1) is probably more principled, but I think it's too big of a change to undertake on short notice. (2) sounds plausible to me at first blush, but it needs more consideration. |
|
Hey @neilconway , Regarding the bad query plans you mentioned (like CAST(text_col AS Int64) < 5) and the ClickBench issues: I think we should tackle this in two separate PRs to respect the project's architecture: PR 1 (This PR - Type Level): Your current approach (coercing string→numeric for comparisons) is correct at the type-abstraction level. Generating a CAST here is the right logical step to ensure mathematical correctness. As @alamb noted in the past (e.g., on PR #15482), type coercion rules should be based strictly on (DataType, DataType) pairs, not on checking for Expr::Literal. So mixing literal evaluation into this specific phase would be architecturally incorrect. PR 2 (Follow-up - Optimizer Level): Once this PR lands, I'd like to submit a follow-up PR targeting the simplifier (specifically extending cast_literal_to_type_with_op in unwrap_cast.rs). Since the simplifier is expected to be expression-aware, we can safely intercept the CAST there: Early Literal Folding: Fold Literal(Utf8("10")) → Literal(Int64(10)) at planning time when compared against a numeric column. This gives zero runtime cost and restores performance/predicate pushdown. Fail-Fast for Invalid Literals: Throw an instant plan_err! for unparseable literals like int_col < 'hello' instead of deferring to a runtime CAST failure. Expanded Coverage: Extend this logic beyond =/!= to handle <, >, <=, >=, and include Float types. This two-phase approach gets the critical bug (#15161) fixed quickly, keeps the type coercion pure, and handles the performance/AST-awareness in the correct optimizer pass. Happy to help with the follow-up PR. Thanks again for driving this fix! |
Which issue does this PR close?
Rationale for this change
In a comparison between a numeric column and a string literal (e.g.,
WHERE int_col < '10'), we previously coerced the numeric column to be a string type. This resulted in doing a lexicographic comparison, which results in incorrect query results.Instead, we split type coercion into two situations: type coercion for comparisons, where we want string->numeric coercion, and type coercion for places like UNION or IN lists, where DataFusion's traditional behavior has been to tolerate type mismatching by coercing values to strings.
What changes are included in this PR?
comparison_coercionto coerce strings to numericscomparison_coercion_numericfunctiontype_union_coercion, and use it when appropriate (UNION, CASE, NVL, IN lists, and struct fields)Are these changes tested?
Yes. New tests added, existing tests pass. (TODO: ClickBench queries need to be fixed.)
Are there any user-facing changes?
Yes: