Skip to content

GH-40348: [Python] Add Python wrapper for VariableShapeTensor#40354

Open
rok wants to merge 1 commit intoapache:mainfrom
rok:38007_python
Open

GH-40348: [Python] Add Python wrapper for VariableShapeTensor#40354
rok wants to merge 1 commit intoapache:mainfrom
rok:38007_python

Conversation

@rok
Copy link
Copy Markdown
Member

@rok rok commented Mar 4, 2024

Rationale for this change

Once C++ implementation of VariableShapeTensor is complete (#38007) we want a Python wrapper for VariableShapeTensor.

What changes are included in this PR?

This adds a Python wrapper for VariableShapeTensor.

Are these changes tested?

This adds appropriate Python tests.

Are there any user-facing changes?

Yes, a new extension and array type are exposed.

Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits =) Thank you for working on this, takes lots of work!

@rok rok force-pushed the 38007_python branch 8 times, most recently from b6b63a6 to a372fe6 Compare April 6, 2024 10:24
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 6, 2024
@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Apr 6, 2024

The failures are connected.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 6, 2024
@rok rok force-pushed the 38007_python branch 4 times, most recently from fab047e to f58c973 Compare June 7, 2024 00:28
@rok rok force-pushed the 38007_python branch 2 times, most recently from 5197532 to 55cc40b Compare August 26, 2024 17:55
@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
Comment on lines +265 to 268
internal::Permute<int64_t>(permutation, &shape);
ARROW_ASSIGN_OR_RAISE(
auto strides, internal::ComputeStrides(ext_type.value_type(), shape, permutation));
internal::Permute<int64_t>(permutation, &shape);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking thinking about this still. It's probably a bug or a different interpretation at least.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Feb 27, 2026
@apache apache deleted a comment from github-actions bot Feb 27, 2026
@rok rok force-pushed the 38007_python branch 3 times, most recently from 0df8139 to d5a548d Compare March 3, 2026 17:45
@rok
Copy link
Copy Markdown
Member Author

rok commented Mar 17, 2026

@AlenkaF could you do a quick pass on this? I'm a little unsure about the *row_splits APIs, here's more about the idea.

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Mar 18, 2026

@rok sure! Will have a look later today.

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Mar 19, 2026

I did a very quick look at the code now. I would suggest splitting this PR into two as there is quite a lot of lines added and there might be even something that can be moved out.

Moving the *row_splits API related code into a separate issue and PR would be my suggestion. Otherwise, from first look, they sound useful. Might also be good to have a chat on the issue if it would be in fact useful for somebody?

Add PyArrow bindings for the VariableShapeTensor extension type,
including VariableShapeTensorType, VariableShapeTensorArray, and
VariableShapeTensorScalar with support for converting to/from
NumPy tensors.
@rok
Copy link
Copy Markdown
Member Author

rok commented Apr 14, 2026

Moving the *row_splits API related code into a separate issue and PR would be my suggestion. Otherwise, from first look, they sound useful. Might also be good to have a chat on the issue if it would be in fact useful for somebody?

Agreed, I moved these into a separate PR and rebased here. I think this is ready for a more final review.

Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this @rok! Huge amount of work was needed here (and in the C++ implementation).

I did a first check but still need to go through most of the Cython code and all the tests.
As for the C++ permutation change - maybe that also fits better into a separate issue&PR?


shared_ptr[CArray] storage()

cdef extern from "arrow/extension/variable_shape_tensor.h" namespace "arrow::extension" nogil:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved little lower, after fixed shape tensor ref?

cdef extern from "arrow/extension/fixed_shape_tensor.h" namespace "arrow::extension" nogil:
cdef cppclass CFixedShapeTensorType \
" arrow::extension::FixedShapeTensorType"(CExtensionType):
" arrow::extension::FixedShapeTensorType"(CExtensionType) nogil:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason nogil is added here and used in the variable version also?

const shared_ptr[CBuffer] null_bitmap,
)

@staticmethod
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure where this is needed?

void set_chunksize(int64_t chunksize)

cdef cppclass CTensor" arrow::Tensor":
CTensor(const shared_ptr[CDataType]& type,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, I am not sure if we actually need this?

Comment on lines +5110 to +5111
if isinstance(obj, np.ndarray):
raise TypeError("obj must be a sequence of numpy arrays")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first check falls into the second one and could be omitted?

array_ndim = arrays[0].ndim
arrow_type = from_numpy_dtype(base_dtype)

if value_type is not None and value_type != arrow_type:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to accept value_type if the input is not empty as it has to be same as the base type anyways? Not sure what are other options, but feels a bit funny as we are giving this option but it then has to be specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants