Preserve ORDER BY in Unparser for projection -> order by pattern by adriangb · Pull Request #19483 · apache/datafusion

adriangb · 2025-12-24T23:42:49Z

Because of #15886 a parse -> unparse -> parse loop changed the query so that it would give incorrect results.

adriangb · 2025-12-25T16:06:42Z

@alamb @goldmedal @y-f-u could you folks take a look at this since you originally added this bit of code in #11527? As far as I can tell this has kept all of those tests passing and only produced some formatting changes in one test's SQL, but I'm not familiar with the Unparser code in general so this needs some critical thought.

adriangb · 2025-12-25T16:10:14Z

+        SELECT
+          col * 2 as x_bucket,
+          count(*)
+        FROM t1
+        GROUP BY x_bucket
+        ORDER BY x_bucket, count(*)


We can probably move this to a test in plan_to_sql.rs but I struggled a bit translating it since there's limited functions available (e.g. count(*)). I do also think e2e tests with data are useful in that they don't require a specific SQL representation as long as query semantics are maintained. But I will try to port again once we get some initial feedback here.

adriangb · 2025-12-26T05:25:44Z

        assert_snapshot!(
            sql,
-            @"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id END ASC NULLS LAST) LIMIT 100"
+            @r#"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string)) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id END ASC NULLS LAST LIMIT 100"#


Formatted difference:

- @"SELECT - j1.j1_id, - j1.j1_string, - lochierarchy - FROM ( - SELECT - j1.j1_id, - j1.j1_string, - (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, - grouping(j1.j1_string), - grouping(j1.j1_id) - FROM j1 - GROUP BY ROLLUP (j1.j1_id, j1.j1_string) - ORDER BY - lochierarchy DESC NULLS FIRST, - CASE - WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id - END ASC NULLS LAST - ) - LIMIT 100" + @r#"SELECT + j1.j1_id, + j1.j1_string, + lochierarchy + FROM ( + SELECT + j1.j1_id, + j1.j1_string, + (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, + grouping(j1.j1_string), + grouping(j1.j1_id) + FROM j1 + GROUP BY ROLLUP (j1.j1_id, j1.j1_string) + ) + ORDER BY + lochierarchy DESC NULLS FIRST, + CASE + WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id + END ASC NULLS LAST + LIMIT 100"#

As you can see the ORDER BY got moved outside of the subquery, which is what we want.

adriangb · 2025-12-26T05:31:31Z

I've added a property based test that asserts the property that results should be the same after unparsing and re-parsing a query given the same input data*. I think this is a good test because:

It uses real world queries and data
It's a property based test on the thing users care about in general (correct results) instead of e.g. asserting the unparsed SQL matches some shape

*: Not all queries have a deterministic sort order. I check if the original query has a known output ordering and if it doesn't I sort both outputs.

These tests show that without these fixes there are two issues for ClickBench queries:

Column name quoting is missing for columns with uppercase letters
The ORDER BY bug

Here is the failure output (also relevant to judge since the tests are being added):

3 Clickbench test(s) failed:

Results mismatch for q15.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

---

Results mismatch for q16.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  hits."SearchPhrase",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      hits."SearchPhrase",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID", hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

---

Results mismatch for q18.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  m,
  hits."SearchPhrase",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      date_part('MINUTE', to_timestamp_seconds(hits."EventTime")) AS m,
      hits."SearchPhrase",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID", date_part('MINUTE', to_timestamp_seconds(hits."EventTime")), hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

kosiew

LGTM

kosiew · 2025-12-28T13:55:45Z

+const BENCHMARKS_PATH_1: &str = "../../benchmarks/";
+
+/// Fallback path to benchmark query files (when running from different working directories).
+const BENCHMARKS_PATH_2: &str = "./benchmarks/";


const BENCHMARK_PATHS: &[&str] = &["../../benchmarks/", "./benchmarks/"];

and you won't have to
let paths = [BENCHMARKS_PATH_1, BENCHMARKS_PATH_2];
in clickbench_queries, tpch_queries.

…che#19483) Because of apache#15886 a parse -> unparse -> parse loop changed the query so that it would give incorrect results.

- Rewrite PostgreSQL regex operators (~, ~*, !~, !~*) to regexp_like() calls since Spark doesn't support the ~ operator that DF 52's unparser now generates - Sort DataFrames before comparison in Spark e2e tests to handle non-deterministic GROUP BY ordering from DF 52's changed unparser output (see apache/datafusion#19483) - Add unit test for the regex rewrite

github-actions Bot added the core Core DataFusion crate label Dec 24, 2025

adriangb mentioned this pull request Dec 24, 2025

[DISCUSSION] Sorts being removed from subqueries #15886

Closed

github-actions Bot added the sql SQL Planner label Dec 25, 2025

adriangb changed the title ~~Demonstarte that Unparser inserts subquery which looses order~~ Preserve ORDER BY in Unparser for projection -> order by pattern Dec 25, 2025

adriangb mentioned this pull request Dec 25, 2025

Preserve ordering from subqueries #19484

Closed

adriangb marked this pull request as ready for review December 25, 2025 16:05

adriangb requested review from alamb and goldmedal December 25, 2025 16:05

adriangb commented Dec 25, 2025

View reviewed changes

adriangb added 2 commits December 25, 2025 20:44

add roundtrip tests for Unparser using clickbench / tpch

062e170

fixed

3019e75

adriangb force-pushed the orderby-bug branch from 7c40d24 to 3019e75 Compare December 26, 2025 05:21

adriangb commented Dec 26, 2025

View reviewed changes

kosiew approved these changes Dec 28, 2025

View reviewed changes

consolidate benchmark paths

4d6b179

kosiew added this pull request to the merge queue Dec 29, 2025

Merged via the queue into apache:main with commit 6ac7b89 Dec 29, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve ORDER BY in Unparser for projection -> order by pattern#19483

Preserve ORDER BY in Unparser for projection -> order by pattern#19483
kosiew merged 3 commits intoapache:mainfrom
pydantic:orderby-bug

adriangb commented Dec 24, 2025 •

edited

Loading

Uh oh!

adriangb commented Dec 25, 2025

Uh oh!

adriangb Dec 25, 2025 •

edited

Loading

Uh oh!

adriangb Dec 26, 2025

Uh oh!

adriangb commented Dec 26, 2025

Uh oh!

kosiew left a comment

Uh oh!

kosiew Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adriangb commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Dec 25, 2025

Uh oh!

adriangb Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Dec 26, 2025

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriangb commented Dec 24, 2025 •

edited

Loading

adriangb Dec 25, 2025 •

edited

Loading