Skip to content

POC: CTE materialization for multi-referenced CTEs#22551

Draft
nathanb9 wants to merge 6 commits into
apache:mainfrom
nathanb9:cte-materialization-poc
Draft

POC: CTE materialization for multi-referenced CTEs#22551
nathanb9 wants to merge 6 commits into
apache:mainfrom
nathanb9:cte-materialization-poc

Conversation

@nathanb9
Copy link
Copy Markdown
Contributor

@nathanb9 nathanb9 commented May 27, 2026

Which issue does this PR close?

Rationale for this change

This PR adds a materialized CTE path so selected CTEs can be computed once and reused.

The heuristic respects explicit MATERIALIZED / NOT MATERIALIZED hints, keeps single-reference CTEs inline, and generally materializes multi-reference CTEs unless they are cheap to inline or are consumed below a top-level limit. Aggregate, distinct, window, and complex multi-scan CTEs remain materialization candidates.

What changes are included in this PR?

  • Materializes all partitions of the CTE input and exposes the materialized reader as a single-partition stream.
  • Adds logical and physical extension nodes for materialized CTE producers/readers.
  • Adds a shared once-only cache for materialized CTE batches.
  • Refreshes reader schemas after optimizer rewrites so type coercion is preserved.
  • Avoids counting nested alias internals as extra CTE references.
  • Uses the automatic multi-reference CTE decision described above.
  • Adds datafusion.execution.enable_materialized_ctes, defaulting to true.
  • Updates config documentation and SHOW ALL sqllogictest expectations.

Are these changes tested?

Yes

Additional tests cover multi-partition CTE reuse, semi-join/schema cases that previously exposed CI failures, and heuristic behavior for reused table-scan CTEs, cheap literal CTEs, and top-level LIMIT consumers.

Benchmark notes

Compared materialized CTEs enabled vs disabled on the same branch/build, 10 iterations each.

TPC-DS SF1

  • Overall: 0.913x enabled vs disabled by summed average runtime (7904.18ms vs 8655.19ms), about 1.095x faster.
  • Largest improvements: Q47 2.72x, Q57 2.63x, Q2 2.41x, Q74 2.37x, Q59 1.86x, Q64 1.61x, Q4 1.59x, Q75 1.58x faster.

Are there any user-facing changes?

Yes. This adds CTE materialization behavior controlled by datafusion.execution.enable_materialized_ctes, and SQL hints can opt individual CTEs in or out where supported by the dialect.

Add support for materializing Common Table Expressions (CTEs) that are
referenced more than once in a query. When a CTE ends in an expensive
operation (Aggregate, Distinct, Window, or Union), the CTE is computed
once and its results are cached in memory for reuse by multiple consumers.

This implements a DuckDB-inspired heuristic: only materialize CTEs that
end in expensive operations, avoiding regressions where predicate pushdown
through the CTE would be more beneficial.

The implementation uses Extension nodes (UserDefinedLogicalNode) to avoid
modifying the core LogicalPlan enum, and introduces:
- MaterializedCteProducer/Reader logical nodes
- MaterializedCteExec/ReaderExec physical operators
- MaterializedCtePlanner extension planner
- Dependency-ordered execution for nested materialized CTEs

Benchmarked on TPC-DS SF1 (10 iterations):
- Q47: 2.85x speedup (401ms → 141ms)
- Q57: 2.67x speedup (112ms → 42ms)
- Q2:  1.58x speedup (101ms → 64ms)
- Q74: 1.90x speedup (311ms → 164ms)

Relates to: apache#17737
@github-actions github-actions Bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate common Related to common crate physical-plan Changes to the physical-plan crate labels May 27, 2026
@nathanb9 nathanb9 changed the title feat: support CTE materialization for multi-referenced CTEs POC: CTE materialization for multi-referenced CTEs May 27, 2026
@github-actions github-actions Bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels May 27, 2026
@neilconway
Copy link
Copy Markdown
Contributor

@nathanb9 Cool! Can you at-me when this is ready for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-plan Changes to the physical-plan crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make datafusion support materializing option for CTE

2 participants