Skip to content

[python] Support BlobView feature#8021

Open
discivigour wants to merge 34 commits into
apache:masterfrom
discivigour:p/blobView1
Open

[python] Support BlobView feature#8021
discivigour wants to merge 34 commits into
apache:masterfrom
discivigour:p/blobView1

Conversation

@discivigour
Copy link
Copy Markdown
Contributor

@discivigour discivigour commented May 28, 2026

Summary

  • Add Python BlobViewStruct / BlobView wire-format support and the blob-view-field option.
  • Store descriptor/view BLOB fields inline, validate bad inline field configuration and payloads, and avoid writing new .blob files for view fields.
  • Resolve blob-view fields during reads through catalog-aware lookup, returning bytes by default or upstream BlobDescriptor bytes when blob-as-descriptor=true.

Tests

  • DataBlobWriterTest.test_blob_view_fields_resolve_upstream_blob()
  • BlobTest.test_blob_view_struct_roundtrip()

@leaves12138
Copy link
Copy Markdown
Contributor

Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review.

@discivigour discivigour marked this pull request as ready for review June 1, 2026 15:46
@discivigour
Copy link
Copy Markdown
Contributor Author

Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review.

Thanks, it is ready for a full review now. Could you please take a look at this PR when you have a moment?

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I focused on the Python BlobView implementation against the existing Java semantics. A few behavior gaps below.

continue
values = result.column(field.name).to_pylist()
for row_id, value in zip(row_id_values, values):
if value is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java keeps null upstream blob values as a separate resolved state (PreloadedBlobViews.putNull(...)), and BlobViewResolvingRow then returns null for that view. Here nulls are skipped, so a valid view pointing to a null upstream blob will later fail in resolve_descriptor() as if the row id was missing. Please track null resolutions separately and make the read path return null for those views.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I have completed this null feature.

return hash(self._descriptor)


class BlobView(Blob):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds the BlobView abstraction, but generic deserialization is still not wired up: Java Blob.fromBytes detects BlobViewStruct before BlobDescriptor and returns an unresolved BlobView. Python Blob.from_bytes() currently only checks descriptors, so serialized BlobViewStruct bytes fall through to BlobData; row.get_blob() / forwarding paths will treat view references as raw bytes instead of BlobView values. Please add BlobViewStruct detection there and cover the non-to_arrow() path in tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

def create_reader(self) -> RecordReader:
reader = self._create_raw_reader()

if (CoreOptions.blob_view_fields(self.table.options)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java has blob-view.resolve.enabled (default true) so callers can set it to false and preserve BlobViewStruct references when forwarding blob views to another table. This Python path resolves whenever blob-view-field is configured, so it loses that mode. Please add the option and gate the BlobInlineConvertReader wrapping on it, matching Java semantics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this param supported.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional pass focused on parity with the Java BlobView read path and schema validation.

def create_reader(self) -> RecordReader:
reader = self._create_raw_reader()

if ((CoreOptions.blob_view_fields(self.table.options) and CoreOptions.blob_view_resolve_enabled(
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Java only enters BlobView resolution in DataEvolutionTableRead when catalogContext != null, and it gives a clear error if readFactory is missing. This path wraps the reader based only on blob-view-field, so tables created with CatalogEnvironment.empty() such as FileStoreTable.from_path() still reach BlobViewLookup._load_table() and fail with an AttributeError on catalog_loader.load(). Please check catalog_environment.catalog_loader before enabling resolution: either skip resolving as Java does when no resolver context exists, or raise a clear error, and add coverage for the no-catalog-loader case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

This still seems not fully fixed. When catalog_loader is None, prescan is skipped, but view fields
still go through descriptor resolution. Blob.from_bytes() returns an unresolved BlobView, and
to_data() raises RuntimeError: BlobView is not resolved.

Should we skip BlobInlineConvertReader when no catalog_loader is available?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed. blob descriptor still needs BlobInlineConvertReader, so we can not skip it.

prescan_reader = self._prescan_reader_factory(self._view_fields)
try:
while True:
batch = prescan_reader.read_arrow_batch()
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] This prescan exhausts the prescan reader before the first main batch is returned. In Python, limit is enforced by outer TableRead after a batch is returned or sliced, and DataEvolutionSplitRead itself never receives that limit. Java wires topN/limit into the prescan reader in DataEvolutionTableRead.configureBlobViewPrescanRead. As a result, Python with_limit(1) can still prescan and validate every later BlobViewStruct in the same split, and can even fail on a bad reference that the user-visible result would never return. Please pass limit/topN into the prescan scope, or at least add a blob-view + limit test to lock down the intended behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pypaimon, currently, the limit pushdown in the Read stage of the Append table is not supported. I will support this function in the next pr

Comment thread paimon-python/pypaimon/schema/schema.py Outdated
if 'blob' in str(field.type).lower()
]

if not blob_names:
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] This early return skips validation for explicitly configured blob-field / blob-descriptor-field / blob-view-field when the schema currently has no BLOB fields. Java SchemaValidation validates these options unconditionally against BLOB fields. Today a table can be created with, for example, blob-view-field=picture while picture remains BYTES/STRING, and the later write/read paths either ignore or misuse that configuration. Please validate the configured fields first, then apply the BLOB-specific table constraints only when BLOB fields exist; blob-field should be included in that validation as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

Comment thread paimon-python/pypaimon/schema/schema.py Outdated

descriptor_fields = core_options.blob_descriptor_fields()
view_fields = core_options.blob_view_fields()
unknown_inline_fields = descriptor_fields.union(view_fields).difference(blob_field_names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blob-field is not validated against BLOB fields.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added.

if not prescan_fields:
return EmptyRecordBatchReader()

prescan_read = DataEvolutionSplitRead(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prescan reader only projects the blob-view columns, so SplitRead.__init__ drops any predicate whose fields are outside that reduced projection, and it also does not receive the original read limit. As a result, BlobInlineConvertReader._prescan_view_structs() can preload and resolve BlobViewStructs from rows that the final read would filter out or skip. I reproduced this with two target rows where id = 1 is valid and id = 2 points at a missing upstream table: reading with with_filter(id == 1) or with_limit(1) still fails during prescan on row 2. The prescan should preserve the effective filter/limit semantics, or otherwise only scan rows that can be returned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case is very rare. If the upstream table does not exist, an error should have been reported. Moreover, I have already introduced the limit pushdown function in another pr.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. The missing upstream table was only a minimal way to make the semantic issue visible. The correctness problem is that the prescan observes rows that the final query would filter out or skip. A filtered-out row may contain a stale BlobViewStruct, a permission/network failure, or simply many expensive references, and that should not make WHERE id = 1 fail or trigger extra upstream reads.

Also, a limit-pushdown change in another PR would not address the predicate case here, because this prescan rebuilds the read type with only blob-view fields, so predicates on non-view columns are trimmed before the prescan reader is built. I think this PR needs to preserve the effective filter/limit in the prescan path, matching the Java path, and add regression tests for both cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When another PR is merged, I will support this feature in BlobView Read

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supported.

Comment thread paimon-python/pypaimon/schema/schema.py Outdated
if primary_keys is not None:
raise ValueError("Blob type is not supported with primary key.")
# Validate Blob type fields in the schema
Schema._validate_blob_fields(fields, options, primary_keys)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation only runs when callers create schemas through Schema.from_pyarrow_schema(...). Callers can still construct Schema(fields=..., options=...) directly and SchemaManager.create_table will commit invalid blob-view-field / blob-descriptor-field settings, for example a blob-view-field pointing at a STRING column. Since direct Schema(...) construction is a supported/tested path, this validation should also run in SchemaManager.create_table after directives are applied, or otherwise be centralized before the schema is committed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python, currently there is no code similar to java Schema Validation to centrally validate option parameters. In the existing code, blob and vector-related parameters are all validated in schema.from pyarrow schema. If there is a need in the future, I will separately propose a pr to unify the validation of schema parameters.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that Python does not yet have a Java-style centralized validation layer, but this PR is adding new invariants and new read/write behavior that depends on them. Leaving the direct Schema(...) path unchecked means users can commit invalid blob-view-field / blob-descriptor-field options today, and the failure then moves to later writes/reads with much less clear errors.

This does not need a full validation refactor in this PR. A narrow fix would be to call the same blob-field validation from the schema commit path (for example in SchemaManager.create_table after the schema is materialized and before schema-0 is written), so both Schema.from_pyarrow_schema(...) and direct Schema(...) creation enforce the same invariants.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

leaves12138 and others added 19 commits June 3, 2026 21:25
# Conflicts:
#	paimon-python/pypaimon/common/options/core_options.py
#	paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py
#	paimon-python/pypaimon/read/reader/data_file_batch_reader.py
#	paimon-python/pypaimon/read/split_read.py
#	paimon-python/pypaimon/schema/schema.py
#	paimon-python/pypaimon/table/row/blob.py
#	paimon-python/pypaimon/tests/blob_test.py
#	paimon-python/pypaimon/write/writer/data_blob_writer.py

# Conflicts:
#	paimon-python/pypaimon/common/options/core_options.py
#	paimon-python/pypaimon/write/writer/dedicated_format_writer.py

# Conflicts:
#	paimon-python/pypaimon/read/reader/data_file_batch_reader.py

# Conflicts:
#	paimon-python/pypaimon/table/row/blob.py
# Conflicts:
#	paimon-python/pypaimon/read/split_read.py
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more pass on the latest revision, focused on schema-change parity with Java.

comment=schema.comment,
)

_validate_blob_fields(schema.fields, schema.options, schema.primary_keys)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] This closes the direct create_table(Schema(...)) path, but schema changes still bypass the same Blob validation. commit_changes() builds a new TableSchema and calls commit() without running _validate_blob_fields() or _validate_blob_external_storage_fields(), whereas Java SchemaManager.commit() runs SchemaValidation.validateTableSchema(newSchema) for every schema commit. This means users can still alter_table with SetOption('blob-view-field', 'non_blob_col'), remove row-tracking.enabled / data-evolution.enabled from a table with BLOB columns, or add a BLOB column via directive without the required options, and the invalid schema is committed. Please centralize this validation before writing schema changes as well, ideally in the Python commit() path to match Java.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another schema-validation detail from comparing the Python check with Java's top-level BLOB detection.

options = {}

blob_field_names = {
field.name for field in fields if 'blob' in str(field.type).lower()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] This detects BLOB columns by substring-matching the rendered type, which is broader than Java's field.type().getTypeRoot() == DataTypeRoot.BLOB. A normal top-level ROW column whose nested field is named or typed with blob will now be treated as a top-level BLOB column, so create-table validation starts requiring row-tracking.enabled and data-evolution.enabled even though Java would not. For example payload ROW<blob_name STRING> renders with blob in the type string and becomes part of blob_field_names. Please check only top-level BLOB types here (for example getattr(field.type, 'type', None) == 'BLOB' or an AtomicType/type-root helper) and add a regression test for nested fields containing the word blob.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants