Skip to content

GH-22232: [C++][Python] Introduce optional default_column_type parameter#47663

Merged
AlenkaF merged 16 commits into
apache:mainfrom
vladborovtsov:default-column-type
May 13, 2026
Merged

GH-22232: [C++][Python] Introduce optional default_column_type parameter#47663
AlenkaF merged 16 commits into
apache:mainfrom
vladborovtsov:default-column-type

Conversation

@vladborovtsov
Copy link
Copy Markdown
Contributor

@vladborovtsov vladborovtsov commented Sep 27, 2025

Rationale for this change

Add an optional default_column_type parameter to the CSV reading API (C++ and Python) to provide a fallback type when per-column types aren’t specified, improving schema consistency and complementing the existing column_types logic.

What changes are included in this PR?

Are these changes tested?

Yes. Existing and new tests are passing.

C++:

> [==========] Running 3 tests from 1 test suite.
> [----------] Global test environment set-up.
> [----------] 3 tests from ReaderTests
> [ RUN      ] ReaderTests.DefaultColumnTypePartialDefault
> [       OK ] ReaderTests.DefaultColumnTypePartialDefault (3 ms)
> [ RUN      ] ReaderTests.DefaultColumnTypeAllStringsWithHeader
> [       OK ] ReaderTests.DefaultColumnTypeAllStringsWithHeader (0 ms)
> [ RUN      ] ReaderTests.DefaultColumnTypeAllStringsNoHeader
> [       OK ] ReaderTests.DefaultColumnTypeAllStringsNoHeader (0 ms)
> [----------] 3 tests from ReaderTests (4 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 3 tests from 1 test suite ran. (4 ms total)
> [  PASSED  ] 3 tests.

All:

> [==========] 264 tests from 46 test suites ran. (452 ms total)
> [  PASSED  ] 264 tests.

pyarrow:
New tests are passing.

Are there any user-facing changes?

I believe this change is backward compatible. Parameter is optional and its default value doesn't change the existing behavior; All the existing rests are passing.

Maybe relevant: #22232

Relates to #47502

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

4 similar comments
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@vladborovtsov
Copy link
Copy Markdown
Contributor Author

@github-actions crossbow submit preview-docs

@github-actions
Copy link
Copy Markdown

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/18062577036

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@vladborovtsov vladborovtsov marked this pull request as ready for review September 27, 2025 19:26
@kou kou changed the title GH-47502: Introduce optional default_column_type parameter GH-47502: [C++] Introduce optional default_column_type parameter Oct 13, 2025
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47502 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Oct 24, 2025

Thank you @vladborovtsov for the contribution.
I will add info about the proposed solution in the original issue (#22232) so I can see opinions from C++ devs on the proposed solution.

@AlenkaF AlenkaF changed the title GH-47502: [C++] Introduce optional default_column_type parameter GH-22232: [C++][Python] Introduce optional default_column_type parameter Oct 24, 2025
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #22232 has been automatically assigned in GitHub to PR creator.

@vladborovtsov
Copy link
Copy Markdown
Contributor Author

Hi @AlenkaF
I'm happy to continue the labour and discussion to get that merged.
As for AI, it wasn't used much, although I tried :) With such huge codebase the generation quality is quite low.

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Oct 24, 2025

Happy to see a response!
All good, it is totally OK to use gen AI wisely ;)

I will wait for an opinion from a C++ dev and in the meantime try to look at the Python part.

@vladborovtsov
Copy link
Copy Markdown
Contributor Author

Hi @AlenkaF
Any feedback?

Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python part looks good to me. I only added a comment for the docstring examples.

@rok @raulcd do you have any opinions on the C++ approach to handle this use case?

Comment thread python/pyarrow/_csv.pyx Outdated
@github-actions github-actions Bot removed the awaiting review Awaiting review label Dec 12, 2025
Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just two minor comments below.

Comment thread cpp/src/arrow/csv/options.h
Comment thread cpp/src/arrow/csv/options.cc Outdated
Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your the work on this!
I only have two extra comments for the Python part.

First is connected to the repr method for the ConvertOptions. It needs default_column_type to be added in the list:

arrow/python/pyarrow/_csv.pyx

Lines 1144 to 1164 in 0600621

def _repr_base(self):
return (f"""
check_utf8={self.check_utf8},
column_types={self.column_types},
null_values={self.null_values},
true_values={self.true_values},
false_values={self.false_values},
decimal_point={self.decimal_point!r},
strings_can_be_null={self.strings_can_be_null},
quoted_strings_can_be_null={self.quoted_strings_can_be_null},
include_columns={self.include_columns},
include_missing_columns={self.include_missing_columns},
auto_dict_encode={self.auto_dict_encode},
auto_dict_max_cardinality={self.auto_dict_max_cardinality},
timestamp_parsers={[str(i) for i in self.timestamp_parsers]}""")
def __repr__(self):
return (f"<pyarrow.csv.ConvertOptions>({self._repr_base()})")
def __str__(self):
return (f"ConvertOptions({self._repr_base()})")

and in the test:

def test_convert_options(pickle_module):

The pickling compatibility is done ( __getstate__ and __setstate__).

The second comment is one extra assert in the test. Otherwise LGTM!

Comment thread python/pyarrow/tests/test_csv.py
@vladborovtsov vladborovtsov requested a review from AlenkaF May 4, 2026 20:41
@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 5, 2026

@vladborovtsov the CI failures seem to be related. Could you have a look?

@vladborovtsov vladborovtsov force-pushed the default-column-type branch from 14606b5 to 4b2dd10 Compare May 7, 2026 10:38
@vladborovtsov
Copy link
Copy Markdown
Contributor Author

Hi @AlenkaF
I believe now it should be fine?
I'm running tests locally in docker using pytest -v --pyargs pyarrow.tests.test_csv.
Am I missing something?

Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, everything looking good. I will just run the extended builds before merging. Thanks!

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 12, 2026

@github-actions crossbow submit -g python

@github-actions
Copy link
Copy Markdown

Revision: c1ad5f6

Submitted crossbow builds: ursacomputing/crossbow @ actions-9de857b990

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.14 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-debian-13-python-3-amd64 GitHub Actions
test-debian-13-python-3-i386 GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 13, 2026

The failing emscripten build is not related. Will merge.
Thanks!

@AlenkaF AlenkaF merged commit 3d6e138 into apache:main May 13, 2026
58 checks passed
@AlenkaF AlenkaF removed the awaiting change review Awaiting change review label May 13, 2026
@vladborovtsov vladborovtsov deleted the default-column-type branch May 13, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants