Skip to content

Conversation

@rynewang
Copy link

@rynewang rynewang commented Jan 3, 2026

Rationale for this change

Fixes #36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice

What changes are included in this PR?

The bug happens because:

  1. Header is written to data_buffer_ and flushed during CSVWriterImpl initialization
  2. TranslateMinimalBatch returns early for empty batches without modifying data_buffer_
  3. The WriteTable/WriteRecordBatch loop then writes data_buffer_ which still contains the stale header

The fix clears the buffer (resize to 0) when encountering an empty batch in TranslateMinimalBatch, so the subsequent write outputs nothing.

Are these changes tested?

Yes. Added C++ tests in writer_test.cc and Python tests in test_csv.py:

  • Empty batch at start of table
  • Empty batch in middle of table

Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

…ch is empty

When writing CSV, if the first record batch was empty, the header would be
written twice. This happened because:

1. Header is written to data_buffer_ and flushed during initialization
2. TranslateMinimalBatch returns early for empty batches without modifying data_buffer_
3. The loop then writes data_buffer_ which still contains the header

The fix clears the buffer (resize to 0) when encountering an empty batch,
so the subsequent write outputs nothing.

Added C++ and Python tests for empty batches at start and in middle of tables.

Claude-Generated-By: Claude Code (cli/claude-opus-4-5=1%)
Claude-Steers: 2
Claude-Permission-Prompts: 2
Claude-Escapes: 1
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

⚠️ GitHub issue #36889 has been automatically assigned in GitHub to PR creator.

Signed-off-by: Ruiyang Wang <ruiyang@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Python] Duplicate csv header when table batches start with empty

1 participant