Introduce way to customize prefix of multi file outputs#19262
Introduce way to customize prefix of multi file outputs#19262DoumanAsh wants to merge 3 commits intoapache:mainfrom
Conversation
ethan-tyler
left a comment
There was a problem hiding this comment.
@DoumanAsh Thanks for adding this - being able to prefix output filenames has come up before and this is a clean approach.
Found one bug and a test gap:
Bug: The format string arguments in plan_to_parquet, plan_to_csv, and plan_to_json are reversed. With an empty prefix this produces paths like /part-0.ext (leading slash, no separator before part-). With a prefix like "foo" it produces foo/part-0.ext (prefix becomes a directory). See inline comments for the fix.
Test gap: The test uses with_partition_by() which goes through demux.rs (that code is correct), but doesn't exercise the plan_to_* functions where the bug is. A test using SessionContext::write_parquet or DataFrame::write_parquet without partitioning would catch this.
The demux.rs changes look good.
Let me know if you want me to put together a test case for the other code path.
|
@ethan-tyler Thank you for review! |
|
Ah, nevermind my question, I didn't understand original behavior of parquet writer as I always used |
d0c052d to
8a909fe
Compare
|
@ethan-tyler I figured out how to make |
| use async_trait::async_trait; | ||
| use datafusion_catalog::Session; | ||
|
|
||
| #[derive(Clone)] |
There was a problem hiding this comment.
Convenient for test code, but generally there seems to be nothing wrong with having it clonable?
There was a problem hiding this comment.
Agreed, nothing wrong with it. It's a plain data struct with no resources or
invariants that cloning would violate. Makes the builder pattern nicer to
work with too.
ethan-tyler
left a comment
There was a problem hiding this comment.
LGTM. The named format parameters are cleaner than the helper function
I suggested.
Tests cover all three write paths.
One small thing: the configs.md entry has false in the default column but
it should be empty string. Not blocking, can be fixed in a follow-up.
Thanks for working through all the feedback.
a175fbb to
033761c
Compare
033761c to
f12160a
Compare
|
Squashed commits, rebased on latest master branch |
f12160a to
b78b4b0
Compare
|
I kicked off the CI checks |
|
Looks like there are some ci failures. Marking as draft |
|
Just for reference I'm unable to run |
a12980d to
1b8f203
Compare
4d75f52 to
a691668
Compare
|
I was able to find out hidden dependencies of |
a691668 to
98620e1
Compare
Add test to illustrate prefixed parquet files Update docs with new execution's parameter partitioned_file_prefix_name
98620e1 to
cb9a581
Compare
Which issue does this PR close?
Rationale for this change
As per issue, this is most simple approach to allow user to have control over file outputs when writing partitioned parquet/csv
I'm not certain if it would be useful for
part-{idx}or not as I do not understand code base well enough to see context of where it is used (for my part I'm mostly interested in making sure randomised file names have unique prefix that I can use to identify these files)What changes are included in this PR?
Introduces new option
partitioned_file_prefix_namewithinExecutionOptionswith default value empty to retain current behavior by defaultThis option is used to generate prefix of the file name in writes of
datasource' anddatasource-*cratesAre these changes tested?
I included basic test to illustrate behaviour of partitioned file output
Are there any user-facing changes?
These changes do not change existing behaviour