Skip to content

Introduce way to customize prefix of multi file outputs#19262

Open
DoumanAsh wants to merge 3 commits intoapache:mainfrom
DoumanAsh:customize_writer_file_name_gen
Open

Introduce way to customize prefix of multi file outputs#19262
DoumanAsh wants to merge 3 commits intoapache:mainfrom
DoumanAsh:customize_writer_file_name_gen

Conversation

@DoumanAsh
Copy link

Which issue does this PR close?

Rationale for this change

As per issue, this is most simple approach to allow user to have control over file outputs when writing partitioned parquet/csv

I'm not certain if it would be useful for part-{idx} or not as I do not understand code base well enough to see context of where it is used (for my part I'm mostly interested in making sure randomised file names have unique prefix that I can use to identify these files)

What changes are included in this PR?

Introduces new option partitioned_file_prefix_name within ExecutionOptions with default value empty to retain current behavior by default

This option is used to generate prefix of the file name in writes of datasource' and datasource-* crates

Are these changes tested?

I included basic test to illustrate behaviour of partitioned file output

Are there any user-facing changes?

These changes do not change existing behaviour

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels Dec 10, 2025
Copy link
Contributor

@ethan-tyler ethan-tyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DoumanAsh Thanks for adding this - being able to prefix output filenames has come up before and this is a clean approach.

Found one bug and a test gap:

Bug: The format string arguments in plan_to_parquet, plan_to_csv, and plan_to_json are reversed. With an empty prefix this produces paths like /part-0.ext (leading slash, no separator before part-). With a prefix like "foo" it produces foo/part-0.ext (prefix becomes a directory). See inline comments for the fix.

Test gap: The test uses with_partition_by() which goes through demux.rs (that code is correct), but doesn't exercise the plan_to_* functions where the bug is. A test using SessionContext::write_parquet or DataFrame::write_parquet without partitioning would catch this.

The demux.rs changes look good.

Let me know if you want me to put together a test case for the other code path.

@DoumanAsh
Copy link
Author

@ethan-tyler Thank you for review!
I actually wanted to ask when code paths with part-{idx} are invoked because as newbie to the code base it was not obvious to me
Is there way to control how datafusion decides to split output?

@DoumanAsh
Copy link
Author

Ah, nevermind my question, I didn't understand original behavior of parquet writer as I always used with_single_file_output(true) when not partitioning!
I will add tests and verify my code is correct before asking for review

@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from d0c052d to 8a909fe Compare December 12, 2025 10:23
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 12, 2025
@DoumanAsh
Copy link
Author

@ethan-tyler I figured out how to make datasource-* plans to work and added comprehensive tests to cover all possible scenarios where prefix would be used I think
Please take a look when you have time

use async_trait::async_trait;
use datafusion_catalog::Session;

#[derive(Clone)]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convenient for test code, but generally there seems to be nothing wrong with having it clonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, nothing wrong with it. It's a plain data struct with no resources or
invariants that cloning would violate. Makes the builder pattern nicer to
work with too.

Copy link
Contributor

@ethan-tyler ethan-tyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The named format parameters are cleaner than the helper function
I suggested.

Tests cover all three write paths.

One small thing: the configs.md entry has false in the default column but
it should be empty string. Not blocking, can be fixed in a follow-up.

Thanks for working through all the feedback.

@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from a175fbb to 033761c Compare December 13, 2025 02:19
@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from 033761c to f12160a Compare December 20, 2025 13:36
@DoumanAsh
Copy link
Author

Squashed commits, rebased on latest master branch

@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from f12160a to b78b4b0 Compare January 31, 2026 08:35
@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

I kicked off the CI checks

@alamb alamb marked this pull request as draft February 3, 2026 18:12
@alamb
Copy link
Contributor

alamb commented Feb 3, 2026

Looks like there are some ci failures. Marking as draft

@DoumanAsh
Copy link
Author

DoumanAsh commented Feb 3, 2026

Just for reference I'm unable to run cargo test in root of repository since substrait build is broken out of box so I only run tests that are affecting functionality.
There seems to be some hidden magic to run checks against config.md so I tried my best to update affected files

@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from a12980d to 1b8f203 Compare February 3, 2026 22:57
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 3, 2026
@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from 4d75f52 to a691668 Compare February 4, 2026 14:26
@DoumanAsh
Copy link
Author

I was able to find out hidden dependencies of subtrait and verified all tests are passing

@DoumanAsh DoumanAsh marked this pull request as ready for review February 4, 2026 14:27
@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from a691668 to 98620e1 Compare February 24, 2026 23:55
Add test to illustrate prefixed parquet files

Update docs with new execution's parameter partitioned_file_prefix_name
@DoumanAsh DoumanAsh force-pushed the customize_writer_file_name_gen branch from 98620e1 to cb9a581 Compare March 4, 2026 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add way to specify custom prefix for partitioned file outputs

3 participants