perf: Optimize regexp match and not match for .*foo.* cases#20610
perf: Optimize regexp match and not match for .*foo.* cases#20610petern48 wants to merge 3 commits intoapache:mainfrom
.*foo.* cases#20610Conversation
7b367d2 to
053dc9b
Compare
|
built on top of #20581, so wait for it to merge |
I wish githhub had stacked PR's. |
|
A few notes:
|
Just merged |
| datafusion-common = { workspace = true, default-features = true } | ||
| datafusion-expr = { workspace = true } | ||
| datafusion-expr-common = { workspace = true } | ||
| datafusion-functions = { workspace = true } |
There was a problem hiding this comment.
Please let's not add this dependency -- doing so makes it harder for others to extend the datafusion optimizer and functions as the optimizer now assumes the built in functions
There was a problem hiding this comment.
got it, i guess if we were to do this optimization, we'd implement the rewrite inside of fn simplify() { } inside of regexpmatch.rs (or maybe in the like implementation). I'll look into it
| use datafusion_common::tree_node::Transformed; | ||
| use datafusion_common::{DataFusionError, Result}; | ||
| use datafusion_expr::{BinaryExpr, Expr, Like, Operator, lit}; | ||
| use datafusion_functions::expr_fn::contains; |
There was a problem hiding this comment.
I thought that contains is the same as the `LIKE '%foo%' implementation -- they both use the same underlying optimized arrow kernel don't they?
There was a problem hiding this comment.
hmm, i hadn't checked that actually. sorry, I honestly just blindly followed the idea from the issue. Though now that I take a quick skim at the arrow code, I actually don't quite think that's the case (they are separate kernels that were placed in the same file). I'll investigate a little more to check that there's not some logic somewhere else that effectively does this optimization already.
Yeah, even after I tried copying PR changes over, GitHub still gives me merge conflicts when trying to re-apply the changes again 🤦 |
|
Will revisit this in a few days and see if it still makes sense to implement this. I also pulled out the bug fix to a separate PR #20702, so that we can isolate the behavior change separately. |
Which issue does this PR close?
.*foo.*#20579Rationale for this change
Improved query performance by optimizing the logical plan
What changes are included in this PR?
Added optimization rules to perform the following logic
s ~ '.*foo.*'->contains(s, foo)s !~ '.*foo.*'->not(contains(s, foo))s ~ '.*.*'->is_not_null(s)s !~ '.*.*'->falseAre these changes tested?
Added tests.
Are there any user-facing changes?