Preserve ORDER BY in Unparser for projection -> order by pattern#19483
Preserve ORDER BY in Unparser for projection -> order by pattern#19483kosiew merged 3 commits intoapache:mainfrom
Conversation
|
@alamb @goldmedal @y-f-u could you folks take a look at this since you originally added this bit of code in #11527? As far as I can tell this has kept all of those tests passing and only produced some formatting changes in one test's SQL, but I'm not familiar with the Unparser code in general so this needs some critical thought. |
| SELECT | ||
| col * 2 as x_bucket, | ||
| count(*) | ||
| FROM t1 | ||
| GROUP BY x_bucket | ||
| ORDER BY x_bucket, count(*) |
There was a problem hiding this comment.
We can probably move this to a test in plan_to_sql.rs but I struggled a bit translating it since there's limited functions available (e.g. count(*)). I do also think e2e tests with data are useful in that they don't require a specific SQL representation as long as query semantics are maintained. But I will try to port again once we get some initial feedback here.
7c40d24 to
3019e75
Compare
| assert_snapshot!( | ||
| sql, | ||
| @"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id END ASC NULLS LAST) LIMIT 100" | ||
| @r#"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string)) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id END ASC NULLS LAST LIMIT 100"# |
There was a problem hiding this comment.
Formatted difference:
- @"SELECT
- j1.j1_id,
- j1.j1_string,
- lochierarchy
- FROM (
- SELECT
- j1.j1_id,
- j1.j1_string,
- (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy,
- grouping(j1.j1_string),
- grouping(j1.j1_id)
- FROM j1
- GROUP BY ROLLUP (j1.j1_id, j1.j1_string)
- ORDER BY
- lochierarchy DESC NULLS FIRST,
- CASE
- WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id
- END ASC NULLS LAST
- )
- LIMIT 100"
+ @r#"SELECT
+ j1.j1_id,
+ j1.j1_string,
+ lochierarchy
+ FROM (
+ SELECT
+ j1.j1_id,
+ j1.j1_string,
+ (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy,
+ grouping(j1.j1_string),
+ grouping(j1.j1_id)
+ FROM j1
+ GROUP BY ROLLUP (j1.j1_id, j1.j1_string)
+ )
+ ORDER BY
+ lochierarchy DESC NULLS FIRST,
+ CASE
+ WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id
+ END ASC NULLS LAST
+ LIMIT 100"#As you can see the ORDER BY got moved outside of the subquery, which is what we want.
|
I've added a property based test that asserts the property that results should be the same after unparsing and re-parsing a query given the same input data*. I think this is a good test because:
*: Not all queries have a deterministic sort order. I check if the original query has a known output ordering and if it doesn't I sort both outputs. These tests show that without these fixes there are two issues for ClickBench queries:
Here is the failure output (also relevant to judge since the tests are being added): |
| const BENCHMARKS_PATH_1: &str = "../../benchmarks/"; | ||
|
|
||
| /// Fallback path to benchmark query files (when running from different working directories). | ||
| const BENCHMARKS_PATH_2: &str = "./benchmarks/"; |
There was a problem hiding this comment.
const BENCHMARK_PATHS: &[&str] = &["../../benchmarks/", "./benchmarks/"];
and you won't have to
let paths = [BENCHMARKS_PATH_1, BENCHMARKS_PATH_2];
in clickbench_queries, tpch_queries.
…che#19483) Because of apache#15886 a parse -> unparse -> parse loop changed the query so that it would give incorrect results.
- Rewrite PostgreSQL regex operators (~, ~*, !~, !~*) to regexp_like() calls since Spark doesn't support the ~ operator that DF 52's unparser now generates - Sort DataFrames before comparison in Spark e2e tests to handle non-deterministic GROUP BY ordering from DF 52's changed unparser output (see apache/datafusion#19483) - Add unit test for the regex rewrite
Because of #15886 a parse -> unparse -> parse loop changed the query so that it would give incorrect results.