[python] Support BlobView feature#8021
Conversation
|
Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review. |
Thanks, it is ready for a full review now. Could you please take a look at this PR when you have a moment? |
JingsongLi
left a comment
There was a problem hiding this comment.
I focused on the Python BlobView implementation against the existing Java semantics. A few behavior gaps below.
| continue | ||
| values = result.column(field.name).to_pylist() | ||
| for row_id, value in zip(row_id_values, values): | ||
| if value is None: |
There was a problem hiding this comment.
Java keeps null upstream blob values as a separate resolved state (PreloadedBlobViews.putNull(...)), and BlobViewResolvingRow then returns null for that view. Here nulls are skipped, so a valid view pointing to a null upstream blob will later fail in resolve_descriptor() as if the row id was missing. Please track null resolutions separately and make the read path return null for those views.
There was a problem hiding this comment.
Thanks for the review, I have completed this null feature.
| return hash(self._descriptor) | ||
|
|
||
|
|
||
| class BlobView(Blob): |
There was a problem hiding this comment.
This adds the BlobView abstraction, but generic deserialization is still not wired up: Java Blob.fromBytes detects BlobViewStruct before BlobDescriptor and returns an unresolved BlobView. Python Blob.from_bytes() currently only checks descriptors, so serialized BlobViewStruct bytes fall through to BlobData; row.get_blob() / forwarding paths will treat view references as raw bytes instead of BlobView values. Please add BlobViewStruct detection there and cover the non-to_arrow() path in tests.
| def create_reader(self) -> RecordReader: | ||
| reader = self._create_raw_reader() | ||
|
|
||
| if (CoreOptions.blob_view_fields(self.table.options) |
There was a problem hiding this comment.
Java has blob-view.resolve.enabled (default true) so callers can set it to false and preserve BlobViewStruct references when forwarding blob views to another table. This Python path resolves whenever blob-view-field is configured, so it loses that mode. Please add the option and gate the BlobInlineConvertReader wrapping on it, matching Java semantics.
There was a problem hiding this comment.
this param supported.
| def create_reader(self) -> RecordReader: | ||
| reader = self._create_raw_reader() | ||
|
|
||
| if ((CoreOptions.blob_view_fields(self.table.options) and CoreOptions.blob_view_resolve_enabled( |
There was a problem hiding this comment.
[P2] Java only enters BlobView resolution in DataEvolutionTableRead when catalogContext != null, and it gives a clear error if readFactory is missing. This path wraps the reader based only on blob-view-field, so tables created with CatalogEnvironment.empty() such as FileStoreTable.from_path() still reach BlobViewLookup._load_table() and fail with an AttributeError on catalog_loader.load(). Please check catalog_environment.catalog_loader before enabling resolution: either skip resolving as Java does when no resolver context exists, or raise a clear error, and add coverage for the no-catalog-loader case.
There was a problem hiding this comment.
added.
This still seems not fully fixed. When catalog_loader is None, prescan is skipped, but view fields
still go through descriptor resolution. Blob.from_bytes() returns an unresolved BlobView, and
to_data() raises RuntimeError: BlobView is not resolved.
Should we skip BlobInlineConvertReader when no catalog_loader is available?
There was a problem hiding this comment.
changed. blob descriptor still needs BlobInlineConvertReader, so we can not skip it.
| prescan_reader = self._prescan_reader_factory(self._view_fields) | ||
| try: | ||
| while True: | ||
| batch = prescan_reader.read_arrow_batch() |
There was a problem hiding this comment.
[P2] This prescan exhausts the prescan reader before the first main batch is returned. In Python, limit is enforced by outer TableRead after a batch is returned or sliced, and DataEvolutionSplitRead itself never receives that limit. Java wires topN/limit into the prescan reader in DataEvolutionTableRead.configureBlobViewPrescanRead. As a result, Python with_limit(1) can still prescan and validate every later BlobViewStruct in the same split, and can even fail on a bad reference that the user-visible result would never return. Please pass limit/topN into the prescan scope, or at least add a blob-view + limit test to lock down the intended behavior.
There was a problem hiding this comment.
In pypaimon, currently, the limit pushdown in the Read stage of the Append table is not supported. I will support this function in the next pr
| if 'blob' in str(field.type).lower() | ||
| ] | ||
|
|
||
| if not blob_names: |
There was a problem hiding this comment.
[P2] This early return skips validation for explicitly configured blob-field / blob-descriptor-field / blob-view-field when the schema currently has no BLOB fields. Java SchemaValidation validates these options unconditionally against BLOB fields. Today a table can be created with, for example, blob-view-field=picture while picture remains BYTES/STRING, and the later write/read paths either ignore or misuse that configuration. Please validate the configured fields first, then apply the BLOB-specific table constraints only when BLOB fields exist; blob-field should be included in that validation as well.
|
|
||
| descriptor_fields = core_options.blob_descriptor_fields() | ||
| view_fields = core_options.blob_view_fields() | ||
| unknown_inline_fields = descriptor_fields.union(view_fields).difference(blob_field_names) |
There was a problem hiding this comment.
blob-field is not validated against BLOB fields.
| if not prescan_fields: | ||
| return EmptyRecordBatchReader() | ||
|
|
||
| prescan_read = DataEvolutionSplitRead( |
There was a problem hiding this comment.
This prescan reader only projects the blob-view columns, so SplitRead.__init__ drops any predicate whose fields are outside that reduced projection, and it also does not receive the original read limit. As a result, BlobInlineConvertReader._prescan_view_structs() can preload and resolve BlobViewStructs from rows that the final read would filter out or skip. I reproduced this with two target rows where id = 1 is valid and id = 2 points at a missing upstream table: reading with with_filter(id == 1) or with_limit(1) still fails during prescan on row 2. The prescan should preserve the effective filter/limit semantics, or otherwise only scan rows that can be returned.
There was a problem hiding this comment.
I think this case is very rare. If the upstream table does not exist, an error should have been reported. Moreover, I have already introduced the limit pushdown function in another pr.
There was a problem hiding this comment.
Thanks for checking. The missing upstream table was only a minimal way to make the semantic issue visible. The correctness problem is that the prescan observes rows that the final query would filter out or skip. A filtered-out row may contain a stale BlobViewStruct, a permission/network failure, or simply many expensive references, and that should not make WHERE id = 1 fail or trigger extra upstream reads.
Also, a limit-pushdown change in another PR would not address the predicate case here, because this prescan rebuilds the read type with only blob-view fields, so predicates on non-view columns are trimmed before the prescan reader is built. I think this PR needs to preserve the effective filter/limit in the prescan path, matching the Java path, and add regression tests for both cases.
There was a problem hiding this comment.
When another PR is merged, I will support this feature in BlobView Read
| if primary_keys is not None: | ||
| raise ValueError("Blob type is not supported with primary key.") | ||
| # Validate Blob type fields in the schema | ||
| Schema._validate_blob_fields(fields, options, primary_keys) |
There was a problem hiding this comment.
This validation only runs when callers create schemas through Schema.from_pyarrow_schema(...). Callers can still construct Schema(fields=..., options=...) directly and SchemaManager.create_table will commit invalid blob-view-field / blob-descriptor-field settings, for example a blob-view-field pointing at a STRING column. Since direct Schema(...) construction is a supported/tested path, this validation should also run in SchemaManager.create_table after directives are applied, or otherwise be centralized before the schema is committed.
There was a problem hiding this comment.
In python, currently there is no code similar to java Schema Validation to centrally validate option parameters. In the existing code, blob and vector-related parameters are all validated in schema.from pyarrow schema. If there is a need in the future, I will separately propose a pr to unify the validation of schema parameters.
There was a problem hiding this comment.
I agree that Python does not yet have a Java-style centralized validation layer, but this PR is adding new invariants and new read/write behavior that depends on them. Leaving the direct Schema(...) path unchecked means users can commit invalid blob-view-field / blob-descriptor-field options today, and the failure then moves to later writes/reads with much less clear errors.
This does not need a full validation refactor in this PR. A narrow fix would be to call the same blob-field validation from the schema commit path (for example in SchemaManager.create_table after the schema is materialized and before schema-0 is written), so both Schema.from_pyarrow_schema(...) and direct Schema(...) creation enforce the same invariants.
# Conflicts: # paimon-python/pypaimon/common/options/core_options.py # paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py # paimon-python/pypaimon/read/reader/data_file_batch_reader.py # paimon-python/pypaimon/read/split_read.py # paimon-python/pypaimon/schema/schema.py # paimon-python/pypaimon/table/row/blob.py # paimon-python/pypaimon/tests/blob_test.py # paimon-python/pypaimon/write/writer/data_blob_writer.py # Conflicts: # paimon-python/pypaimon/common/options/core_options.py # paimon-python/pypaimon/write/writer/dedicated_format_writer.py # Conflicts: # paimon-python/pypaimon/read/reader/data_file_batch_reader.py # Conflicts: # paimon-python/pypaimon/table/row/blob.py
JingsongLi
left a comment
There was a problem hiding this comment.
One more pass on the latest revision, focused on schema-change parity with Java.
| comment=schema.comment, | ||
| ) | ||
|
|
||
| _validate_blob_fields(schema.fields, schema.options, schema.primary_keys) |
There was a problem hiding this comment.
[P2] This closes the direct create_table(Schema(...)) path, but schema changes still bypass the same Blob validation. commit_changes() builds a new TableSchema and calls commit() without running _validate_blob_fields() or _validate_blob_external_storage_fields(), whereas Java SchemaManager.commit() runs SchemaValidation.validateTableSchema(newSchema) for every schema commit. This means users can still alter_table with SetOption('blob-view-field', 'non_blob_col'), remove row-tracking.enabled / data-evolution.enabled from a table with BLOB columns, or add a BLOB column via directive without the required options, and the invalid schema is committed. Please centralize this validation before writing schema changes as well, ideally in the Python commit() path to match Java.
JingsongLi
left a comment
There was a problem hiding this comment.
Another schema-validation detail from comparing the Python check with Java's top-level BLOB detection.
| options = {} | ||
|
|
||
| blob_field_names = { | ||
| field.name for field in fields if 'blob' in str(field.type).lower() |
There was a problem hiding this comment.
[P2] This detects BLOB columns by substring-matching the rendered type, which is broader than Java's field.type().getTypeRoot() == DataTypeRoot.BLOB. A normal top-level ROW column whose nested field is named or typed with blob will now be treated as a top-level BLOB column, so create-table validation starts requiring row-tracking.enabled and data-evolution.enabled even though Java would not. For example payload ROW<blob_name STRING> renders with blob in the type string and becomes part of blob_field_names. Please check only top-level BLOB types here (for example getattr(field.type, 'type', None) == 'BLOB' or an AtomicType/type-root helper) and add a regression test for nested fields containing the word blob.
Summary
Tests