fix: HashJoin panic with String dictionary keys (don't flatten keys)#20505
fix: HashJoin panic with String dictionary keys (don't flatten keys)#20505alamb merged 4 commits intoapache:mainfrom
HashJoin panic with String dictionary keys (don't flatten keys)#20505Conversation
|
I think the question is going to be if it still prunes all the same (including stats). I'd guess it should but we also probably added it there for a reason. I'd guess it was me and I did not add a failing test, which is not good. |
42fa12b to
5d17c18
Compare
Makes sense to me -- what type of pruning do you have in mind? I can write a test. Like it would be "join a dictionary column and show that the predicate that got pushed down could prune"? Or is the concern that when reading a string column from a parquet file as a dictionary it won't be able to prune 🤔 |
|
It actually turns out that we hit a bug in our testing at InfluxDB that this PR fixes SO it isn't just a cleanup, it is actually a bug fix. I have added a reproducer to this PR |
HashJoin panic with String dictionary keys
| statement ok | ||
| set datafusion.execution.parquet.pushdown_filters = true; | ||
|
|
||
| query RRR |
There was a problem hiding this comment.
this test fails with
Parquet error: External: Compute error: Error evaluating filter predicate: ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))
Prior to this fix
HashJoin panic with String dictionary keysHashJoin panic with String dictionary keys (don't flatten keys)
| pub(super) fn build_struct_inlist_values( | ||
| join_key_arrays: &[ArrayRef], | ||
| ) -> Result<Option<ArrayRef>> { | ||
| // Flatten any dictionary-encoded arrays |
There was a problem hiding this comment.
I am quite pleased that the "fix" is to delete code
There was a problem hiding this comment.
The mark of a great engineer 😉
adriangb
left a comment
There was a problem hiding this comment.
This fixes a bug and I can't immediately think of any bugs this would cause / why this was needed in the first place, and there are no failing tests. Let's move forward with this and if we find a future bug caused by this change we can come back and think of a better way to fix it than the current code which is buggy.
Thanks @adriangb -- I will also spend some time trying to research when it came in and why
|
You should check with the author of #18393 that introduced that code. I think maybe I just added that under the premise that "what's the point of having a dictionary array for InList if we are building a HashMap in front of it anyway?" |
😆 Indeed it was added in c0e8bb5 and as you say there is no test coverage / failure for it.
Yeah, I also asked Codex about the PR and it didn't seem to find any problems |
|
So I'll merge this one and then make a backport to |
apache#20505) ## Which issue does this PR close? - Fixes apache#20696 - Follow on to apache#20441 ## Rationale for this change apache#20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all ## What changes are included in this PR? 1. Remove the special case dictionary handling ## Are these changes tested? Yes by CI ## Are there any user-facing changes? No (though maybe some queries get faster)
apache#20505) - Fixes apache#20696 - Follow on to apache#20441 apache#20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all 1. Remove the special case dictionary handling Yes by CI No (though maybe some queries get faster)
Which issue does this PR close?
INNER JOINwith dictionary keys fails when run on parquet withpushdown_filters = true. #20696Rationale for this change
#20441 (review) fixes the special case DictionaryArray handling in Joins. However, I don't think we need to special case DictionaryArrays at all
What changes are included in this PR?
Are these changes tested?
Yes by CI
Are there any user-facing changes?
No (though maybe some queries get faster)