Skip to content

fix: HashJoin panic with String dictionary keys (don't flatten keys)#20505

Merged
alamb merged 4 commits intoapache:mainfrom
alamb:alamb/cleanup_inlist_builder
Mar 4, 2026
Merged

fix: HashJoin panic with String dictionary keys (don't flatten keys)#20505
alamb merged 4 commits intoapache:mainfrom
alamb:alamb/cleanup_inlist_builder

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 23, 2026

Which issue does this PR close?

Rationale for this change

#20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all

What changes are included in this PR?

  1. Remove the special case dictionary handling

Are these changes tested?

Yes by CI

Are there any user-facing changes?

No (though maybe some queries get faster)

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Feb 23, 2026
@adriangb
Copy link
Contributor

I think the question is going to be if it still prunes all the same (including stats). I'd guess it should but we also probably added it there for a reason. I'd guess it was me and I did not add a failing test, which is not good.

@alamb alamb force-pushed the alamb/cleanup_inlist_builder branch from 42fa12b to 5d17c18 Compare February 24, 2026 12:11
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 24, 2026
@alamb
Copy link
Contributor Author

alamb commented Feb 24, 2026

I think the question is going to be if it still prunes all the same (including stats). I'd guess it should but we also probably added it there for a reason. I'd guess it was me and I did not add a failing test, which is not good.

Makes sense to me -- what type of pruning do you have in mind? I can write a test.

Like it would be "join a dictionary column and show that the predicate that got pushed down could prune"?

Or is the concern that when reading a string column from a parquet file as a dictionary it won't be able to prune 🤔

@alamb alamb marked this pull request as ready for review February 24, 2026 13:49
@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2026

It actually turns out that we hit a bug in our testing at InfluxDB that this PR fixes

SO it isn't just a cleanup, it is actually a bug fix.

I have added a reproducer to this PR

@alamb alamb changed the title Avoid flattening dictionaries in Join InLists fix: HashJoin panic with String dictionary keys Mar 4, 2026
statement ok
set datafusion.execution.parquet.pushdown_filters = true;

query RRR
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test fails with

Parquet error: External: Compute error: Error evaluating filter predicate: ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))

Prior to this fix

@alamb alamb changed the title fix: HashJoin panic with String dictionary keys fix: HashJoin panic with String dictionary keys (don't flatten keys) Mar 4, 2026
@alamb alamb added the bug Something isn't working label Mar 4, 2026
pub(super) fn build_struct_inlist_values(
join_key_arrays: &[ArrayRef],
) -> Result<Option<ArrayRef>> {
// Flatten any dictionary-encoded arrays
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite pleased that the "fix" is to delete code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mark of a great engineer 😉

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a bug and I can't immediately think of any bugs this would cause / why this was needed in the first place, and there are no failing tests. Let's move forward with this and if we find a future bug caused by this change we can come back and think of a better way to fix it than the current code which is buggy.

@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2026

This fixes a bug and I can't immediately think of any bugs this would cause / why this was needed in the first place, and there are no failing tests. Let's move forward with this and if we find a future bug caused by this change we can come back and think of a better way to fix it than the current code which is buggy.

Thanks @adriangb -- I will also spend some time trying to research when it came in and why

I think the question is going to be if it still prunes all the same (including stats). I'd guess it should but we also probably added it there for a reason. I'd guess it was me and I did not add a failing test, which is not good.

@adriangb
Copy link
Contributor

adriangb commented Mar 4, 2026

Thanks @adriangb -- I will also spend some time trying to research when it came in and why

You should check with the author of #18393 that introduced that code.

I think maybe I just added that under the premise that "what's the point of having a dictionary array for InList if we are building a HashMap in front of it anyway?"

@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2026

You should check with the author of #18393 that introduced that code.

😆

Indeed it was added in c0e8bb5 and as you say there is no test coverage / failure for it.

I think maybe I just added that under the premise that "what's the point of having a dictionary array for InList if we are building a HashMap in front of it anyway?"

Yeah, I also asked Codex about the PR and it didn't seem to find any problems

@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2026

So I'll merge this one and then make a backport to branch-53

@alamb alamb added this pull request to the merge queue Mar 4, 2026
Merged via the queue into apache:main with commit d025869 Mar 4, 2026
34 of 36 checks passed
alamb added a commit to alamb/datafusion that referenced this pull request Mar 4, 2026
apache#20505)

## Which issue does this PR close?

- Fixes apache#20696
- Follow on to apache#20441

## Rationale for this change


apache#20441 (review)
fixes the special case DictionaryArray handling in Joins. However, I
don't think we need to special case DictionaryArrays at all

## What changes are included in this PR?

1. Remove the special case dictionary handling

## Are these changes tested?

Yes by CI
## Are there any user-facing changes?

No (though maybe some queries get faster)
@alamb alamb deleted the alamb/cleanup_inlist_builder branch March 4, 2026 20:55
alamb added a commit to alamb/datafusion that referenced this pull request Mar 4, 2026
apache#20505)

- Fixes apache#20696
- Follow on to apache#20441

apache#20441 (review)
fixes the special case DictionaryArray handling in Joins. However, I
don't think we need to special case DictionaryArrays at all

1. Remove the special case dictionary handling

Yes by CI

No (though maybe some queries get faster)
@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

INNER JOIN with dictionary keys fails when run on parquet with pushdown_filters = true.

2 participants