Skip to content

fix: Prefer numeric in type coercion for comparisons#20426

Open
neilconway wants to merge 1 commit intoapache:mainfrom
neilconway:neilc/type-coercion-str-numeric
Open

fix: Prefer numeric in type coercion for comparisons#20426
neilconway wants to merge 1 commit intoapache:mainfrom
neilconway:neilc/type-coercion-str-numeric

Conversation

@neilconway
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

In a comparison between a numeric column and a string literal (e.g., WHERE int_col < '10'), we previously coerced the numeric column to be a string type. This resulted in doing a lexicographic comparison, which results in incorrect query results.

Instead, we split type coercion into two situations: type coercion for comparisons, where we want string->numeric coercion, and type coercion for places like UNION or IN lists, where DataFusion's traditional behavior has been to tolerate type mismatching by coercing values to strings.

What changes are included in this PR?

  • Update comparison_coercion to coerce strings to numerics
  • Remove previous comparison_coercion_numeric function
  • Add a new function, type_union_coercion, and use it when appropriate (UNION, CASE, NVL, IN lists, and struct fields)
  • Add unit and SLT tests for new coercion behavior
  • Update existing SLT tests for changes in coercion behavior

Are these changes tested?

Yes. New tests added, existing tests pass. (TODO: ClickBench queries need to be fixed.)

Are there any user-facing changes?

Yes:

  • Comparing a numeric column to a string literal will now behave differently (the previous behavior was wrong)
  • Comparing a string column with a numeric literal will now attempt to cast the string column to the numeric type (instead of coercing the numeric literal to string and then doing lexicographic comparison); if the column contains any non-numeric values, the query will fail at runtime
  • Comparing a string column with a numeric column will now cast the string column to numeric, similar behavior to the previous case

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate functions Changes to functions implementation labels Feb 19, 2026
@neilconway
Copy link
Contributor Author

neilconway commented Feb 19, 2026

TODOs:

@neilconway
Copy link
Contributor Author

One example of a weird behavior that we get with the new coercion rules proposed here:

> create table t1 (a text, b int, c float);
0 row(s) fetched.
Elapsed 0.018 seconds.

> explain select * from t1 where a < 5;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │    CAST(a AS Int64) < 5   │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │          bytes: 0         │ |
|               | │       format: memory      │ |
|               | │          rows: 0          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.016 seconds.

This follows from the "coerce string to numeric" change in general, but it's a bit weird. I think we should probably reject queries like this entirely, on the basis of a type mismatch.

If we did that and nothing else, we'd break a ton of queries (e.g., anything comparing a numeric column with a string literal). But we could fix that up by treating comparisons with string literals specially (e.g., "if x OP y mismatch in types, check if either x or y is a string literal; if so, try to coerce it to a numeric type"). So that's an alternative approach worth considering.

@neilconway
Copy link
Contributor Author

The ClickBench queries are just weird: they have expressions like "EventDate" >= '2013-07-01', where EventDate is a uint16. Previously, EventDate would be cast to a string and the comparison would have been done lexicographically, which is wrong but didn't fail. Now we'll try to coerce the string literal to a numeric, which fails. So this is intended behavior -- the ClickBench queries need fixing up some other way.

@AlonSpivack
Copy link

Thanks for putting this PR together @neilconway!
Splitting out type_union_coercion is definitely the right architectural move.

Regarding your dilemma about text_col < 5:
I strongly agree with your intuition that we should reject these queries entirely with a clear type-mismatch/planning error, rather than trying to add special-case fallback logic. If a user actually wants to compare a string column to a number, they should be required to add an explicit CAST.

@neilconway
Copy link
Contributor Author

@AlonSpivack The challenge with rejecting text_col < 5 under this approach is that in comparison_coercion, we've already lost the fact that we're comparing a column with a literal. All we see is a comparison between a string-typed value and a numeric-typed value, which this approach would allow.

If we want to reject text_col < 5 and similar expressions, I think we need to remove the comparison_coercion logic for coercing string -> numeric and the like. We'd then have two options:

  1. Introduce an "unknown" type. All string literals would have unknown type, and then would be coerced into a concrete type depending on the context in which they are used. This is roughly what Postgres does.
  2. Don't introduce a distinct type, but just add some (perhaps ad-hoc) rules for coercing binary expressions where exactly one operand is a string literal and the other operand is a numeric-typed column reference.

(1) is probably more principled, but I think it's too big of a change to undertake on short notice.

(2) sounds plausible to me at first blush, but it needs more consideration.

@AlonSpivack
Copy link

Hey @neilconway ,
thanks for moving so quickly on this. Splitting the coercion rules between comparison_coercion and type_union_coercion is definitely the right architectural foundation, and it aligns with what @alamb suggested previously about context-aware coercion.

Regarding the bad query plans you mentioned (like CAST(text_col AS Int64) < 5) and the ClickBench issues:

I think we should tackle this in two separate PRs to respect the project's architecture:

PR 1 (This PR - Type Level): Your current approach (coercing string→numeric for comparisons) is correct at the type-abstraction level. Generating a CAST here is the right logical step to ensure mathematical correctness. As @alamb noted in the past (e.g., on PR #15482), type coercion rules should be based strictly on (DataType, DataType) pairs, not on checking for Expr::Literal. So mixing literal evaluation into this specific phase would be architecturally incorrect.

PR 2 (Follow-up - Optimizer Level): Once this PR lands, I'd like to submit a follow-up PR targeting the simplifier (specifically extending cast_literal_to_type_with_op in unwrap_cast.rs). Since the simplifier is expected to be expression-aware, we can safely intercept the CAST there:

Early Literal Folding: Fold Literal(Utf8("10")) → Literal(Int64(10)) at planning time when compared against a numeric column. This gives zero runtime cost and restores performance/predicate pushdown.

Fail-Fast for Invalid Literals: Throw an instant plan_err! for unparseable literals like int_col < 'hello' instead of deferring to a runtime CAST failure.

Expanded Coverage: Extend this logic beyond =/!= to handle <, >, <=, >=, and include Float types.

This two-phase approach gets the critical bug (#15161) fixed quickly, keeps the type coercion pure, and handles the performance/AST-awareness in the correct optimizer pass.

Happy to help with the follow-up PR. Thanks again for driving this fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect cast of integer columns to utf8 when comparing with utf8 constant

2 participants

Comments