Introduce way to customize prefix of multi file outputs by DoumanAsh · Pull Request #19262 · apache/datafusion

DoumanAsh · 2025-12-10T14:36:40Z

Which issue does this PR close?

Closes Add way to specify custom prefix for partitioned file outputs #19261

Rationale for this change

As per issue, this is most simple approach to allow user to have control over file outputs when writing partitioned parquet/csv

I'm not certain if it would be useful for part-{idx} or not as I do not understand code base well enough to see context of where it is used (for my part I'm mostly interested in making sure randomised file names have unique prefix that I can use to identify these files)

What changes are included in this PR?

Introduces new option partitioned_file_prefix_name within ExecutionOptions with default value empty to retain current behavior by default

This option is used to generate prefix of the file name in writes of datasource' and datasource-* crates

Are these changes tested?

I included basic test to illustrate behaviour of partitioned file output

Are there any user-facing changes?

These changes do not change existing behaviour

ethan-tyler

@DoumanAsh Thanks for adding this - being able to prefix output filenames has come up before and this is a clean approach.

Found one bug and a test gap:

Bug: The format string arguments in plan_to_parquet, plan_to_csv, and plan_to_json are reversed. With an empty prefix this produces paths like /part-0.ext (leading slash, no separator before part-). With a prefix like "foo" it produces foo/part-0.ext (prefix becomes a directory). See inline comments for the fix.

Test gap: The test uses with_partition_by() which goes through demux.rs (that code is correct), but doesn't exercise the plan_to_* functions where the bug is. A test using SessionContext::write_parquet or DataFrame::write_parquet without partitioning would catch this.

The demux.rs changes look good.

Let me know if you want me to put together a test case for the other code path.

datafusion/datasource-parquet/src/writer.rs

datafusion/datasource-csv/src/source.rs

datafusion/datasource-json/src/source.rs

datafusion/datasource/src/write/demux.rs

datafusion/core/tests/dataframe/mod.rs

datafusion/common/src/config.rs

DoumanAsh · 2025-12-12T00:37:21Z

@ethan-tyler Thank you for review!
I actually wanted to ask when code paths with part-{idx} are invoked because as newbie to the code base it was not obvious to me
Is there way to control how datafusion decides to split output?

DoumanAsh · 2025-12-12T00:39:25Z

Ah, nevermind my question, I didn't understand original behavior of parquet writer as I always used with_single_file_output(true) when not partitioning!
I will add tests and verify my code is correct before asking for review

DoumanAsh · 2025-12-12T10:23:52Z

@ethan-tyler I figured out how to make datasource-* plans to work and added comprehensive tests to cover all possible scenarios where prefix would be used I think
Please take a look when you have time

DoumanAsh · 2025-12-12T10:24:30Z

datafusion/core/src/dataframe/mod.rs

 use async_trait::async_trait;
 use datafusion_catalog::Session;

+#[derive(Clone)]


Convenient for test code, but generally there seems to be nothing wrong with having it clonable?

Agreed, nothing wrong with it. It's a plain data struct with no resources or
invariants that cloning would violate. Makes the builder pattern nicer to
work with too.

ethan-tyler

LGTM. The named format parameters are cleaner than the helper function
I suggested.

Tests cover all three write paths.

One small thing: the configs.md entry has false in the default column but
it should be empty string. Not blocking, can be fixed in a follow-up.

Thanks for working through all the feedback.

DoumanAsh · 2025-12-20T13:36:39Z

Squashed commits, rebased on latest master branch

alamb · 2026-02-02T23:38:12Z

I kicked off the CI checks

alamb · 2026-02-03T18:13:19Z

Looks like there are some ci failures. Marking as draft

DoumanAsh · 2026-02-03T22:57:07Z

Just for reference I'm unable to run cargo test in root of repository since substrait build is broken out of box so I only run tests that are affecting functionality.
There seems to be some hidden magic to run checks against config.md so I tried my best to update affected files

DoumanAsh · 2026-02-04T14:27:00Z

I was able to find out hidden dependencies of subtrait and verified all tests are passing

Add test to illustrate prefixed parquet files Update docs with new execution's parameter partitioned_file_prefix_name

github-actions bot added core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels Dec 10, 2025

ethan-tyler suggested changes Dec 11, 2025

View reviewed changes

DoumanAsh force-pushed the customize_writer_file_name_gen branch from d0c052d to 8a909fe Compare December 12, 2025 10:23

github-actions bot added the documentation Improvements or additions to documentation label Dec 12, 2025

DoumanAsh commented Dec 12, 2025

View reviewed changes

ethan-tyler reviewed Dec 12, 2025

View reviewed changes

ethan-tyler approved these changes Dec 12, 2025

View reviewed changes

DoumanAsh force-pushed the customize_writer_file_name_gen branch from a175fbb to 033761c Compare December 13, 2025 02:19

DoumanAsh force-pushed the customize_writer_file_name_gen branch from 033761c to f12160a Compare December 20, 2025 13:36

DoumanAsh force-pushed the customize_writer_file_name_gen branch from f12160a to b78b4b0 Compare January 31, 2026 08:35

alamb marked this pull request as draft February 3, 2026 18:12

DoumanAsh force-pushed the customize_writer_file_name_gen branch from a12980d to 1b8f203 Compare February 3, 2026 22:57

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 3, 2026

DoumanAsh force-pushed the customize_writer_file_name_gen branch from 4d75f52 to a691668 Compare February 4, 2026 14:26

DoumanAsh marked this pull request as ready for review February 4, 2026 14:27

DoumanAsh force-pushed the customize_writer_file_name_gen branch from a691668 to 98620e1 Compare February 24, 2026 23:55

DoumanAsh added 3 commits March 5, 2026 00:59

Introduce way to customize prefix of multi file outputs

d9b60e2

Add test to illustrate prefixed parquet files Update docs with new execution's parameter partitioned_file_prefix_name

Update config related stuff

4ce1428

prettifier

cb9a581

DoumanAsh force-pushed the customize_writer_file_name_gen branch from 98620e1 to cb9a581 Compare March 4, 2026 15:59

Conversation

DoumanAsh commented Dec 10, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ethan-tyler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DoumanAsh commented Dec 12, 2025

Uh oh!

DoumanAsh commented Dec 12, 2025

Uh oh!

DoumanAsh commented Dec 12, 2025

Uh oh!

DoumanAsh Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ethan-tyler Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ethan-tyler left a comment

Choose a reason for hiding this comment

Uh oh!

DoumanAsh commented Dec 20, 2025

Uh oh!

alamb commented Feb 2, 2026

Uh oh!

alamb commented Feb 3, 2026

Uh oh!

DoumanAsh commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DoumanAsh commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DoumanAsh commented Feb 3, 2026 •

edited

Loading