Skip to content

perf: Optimize regexp match and not match for .*foo.* cases#20610

Draft
petern48 wants to merge 3 commits intoapache:mainfrom
petern48:regexp_simplify_optim
Draft

perf: Optimize regexp match and not match for .*foo.* cases#20610
petern48 wants to merge 3 commits intoapache:mainfrom
petern48:regexp_simplify_optim

Conversation

@petern48
Copy link
Contributor

@petern48 petern48 commented Feb 27, 2026

Which issue does this PR close?

Rationale for this change

Improved query performance by optimizing the logical plan

What changes are included in this PR?

Added optimization rules to perform the following logic

  • s ~ '.*foo.*' -> contains(s, foo)
  • s !~ '.*foo.*' -> not(contains(s, foo))
  • s ~ '.*.*' -> is_not_null(s)
  • s !~ '.*.*' -> false

Are these changes tested?

Added tests.

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 27, 2026
@petern48 petern48 force-pushed the regexp_simplify_optim branch from 7b367d2 to 053dc9b Compare March 1, 2026 21:32
@petern48 petern48 marked this pull request as ready for review March 2, 2026 00:13
@petern48
Copy link
Contributor Author

petern48 commented Mar 2, 2026

built on top of #20581, so wait for it to merge

@Omega359
Copy link
Contributor

Omega359 commented Mar 2, 2026

built on top of #20581, so wait for it to merge

I wish githhub had stacked PR's.

@Omega359
Copy link
Contributor

Omega359 commented Mar 2, 2026

A few notes:

@alamb
Copy link
Contributor

alamb commented Mar 3, 2026

built on top of #20581, so wait for it to merge

Just merged

datafusion-common = { workspace = true, default-features = true }
datafusion-expr = { workspace = true }
datafusion-expr-common = { workspace = true }
datafusion-functions = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's not add this dependency -- doing so makes it harder for others to extend the datafusion optimizer and functions as the optimizer now assumes the built in functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, i guess if we were to do this optimization, we'd implement the rewrite inside of fn simplify() { } inside of regexpmatch.rs (or maybe in the like implementation). I'll look into it

use datafusion_common::tree_node::Transformed;
use datafusion_common::{DataFusionError, Result};
use datafusion_expr::{BinaryExpr, Expr, Like, Operator, lit};
use datafusion_functions::expr_fn::contains;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that contains is the same as the `LIKE '%foo%' implementation -- they both use the same underlying optimized arrow kernel don't they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i hadn't checked that actually. sorry, I honestly just blindly followed the idea from the issue. Though now that I take a quick skim at the arrow code, I actually don't quite think that's the case (they are separate kernels that were placed in the same file). I'll investigate a little more to check that there's not some logic somewhere else that effectively does this optimization already.

@petern48
Copy link
Contributor Author

petern48 commented Mar 4, 2026

I wish githhub had stacked PR's.

Yeah, even after I tried copying PR changes over, GitHub still gives me merge conflicts when trying to re-apply the changes again 🤦

@petern48
Copy link
Contributor Author

petern48 commented Mar 4, 2026

Will revisit this in a few days and see if it still makes sense to implement this.

I also pulled out the bug fix to a separate PR #20702, so that we can isolate the behavior change separately.

@petern48 petern48 marked this pull request as draft March 4, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expr. simplification / rewrite: regex .*foo.*

3 participants