[python] Support BlobView feature by discivigour · Pull Request #8021 · apache/paimon

discivigour · 2026-05-28T12:46:28Z

Summary

Add Python BlobViewStruct / BlobView wire-format support and the blob-view-field option.
Store descriptor/view BLOB fields inline, validate bad inline field configuration and payloads, and avoid writing new .blob files for view fields.
Resolve blob-view fields during reads through catalog-aware lookup, returning bytes by default or upstream BlobDescriptor bytes when blob-as-descriptor=true.

Tests

DataBlobWriterTest.test_blob_view_fields_resolve_upstream_blob()
BlobTest.test_blob_view_struct_roundtrip()

leaves12138 · 2026-05-31T12:51:01Z

Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review.

discivigour · 2026-06-02T04:02:16Z

Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review.

Thanks, it is ready for a full review now. Could you please take a look at this PR when you have a moment?

JingsongLi

I focused on the Python BlobView implementation against the existing Java semantics. A few behavior gaps below.

JingsongLi · 2026-06-02T06:03:58Z

+                continue
+            values = result.column(field.name).to_pylist()
+            for row_id, value in zip(row_id_values, values):
+                if value is None:


Java keeps null upstream blob values as a separate resolved state (PreloadedBlobViews.putNull(...)), and BlobViewResolvingRow then returns null for that view. Here nulls are skipped, so a valid view pointing to a null upstream blob will later fail in resolve_descriptor() as if the row id was missing. Please track null resolutions separately and make the read path return null for those views.

Thanks for the review, I have completed this null feature.

JingsongLi · 2026-06-02T06:03:58Z

        return hash(self._descriptor)
+
+
+class BlobView(Blob):


This adds the BlobView abstraction, but generic deserialization is still not wired up: Java Blob.fromBytes detects BlobViewStruct before BlobDescriptor and returns an unresolved BlobView. Python Blob.from_bytes() currently only checks descriptors, so serialized BlobViewStruct bytes fall through to BlobData; row.get_blob() / forwarding paths will treat view references as raw bytes instead of BlobView values. Please add BlobViewStruct detection there and cover the non-to_arrow() path in tests.

JingsongLi · 2026-06-02T06:03:58Z

    def create_reader(self) -> RecordReader:
+        reader = self._create_raw_reader()
+
+        if (CoreOptions.blob_view_fields(self.table.options)


Java has blob-view.resolve.enabled (default true) so callers can set it to false and preserve BlobViewStruct references when forwarding blob views to another table. This Python path resolves whenever blob-view-field is configured, so it loses that mode. Please add the option and gate the BlobInlineConvertReader wrapping on it, matching Java semantics.

this param supported.

JingsongLi

Additional pass focused on parity with the Java BlobView read path and schema validation.

JingsongLi · 2026-06-03T05:16:11Z

    def create_reader(self) -> RecordReader:
+        reader = self._create_raw_reader()
+
+        if ((CoreOptions.blob_view_fields(self.table.options) and CoreOptions.blob_view_resolve_enabled(


[P2] Java only enters BlobView resolution in DataEvolutionTableRead when catalogContext != null, and it gives a clear error if readFactory is missing. This path wraps the reader based only on blob-view-field, so tables created with CatalogEnvironment.empty() such as FileStoreTable.from_path() still reach BlobViewLookup._load_table() and fail with an AttributeError on catalog_loader.load(). Please check catalog_environment.catalog_loader before enabling resolution: either skip resolving as Java does when no resolver context exists, or raise a clear error, and add coverage for the no-catalog-loader case.

added.

This still seems not fully fixed. When catalog_loader is None, prescan is skipped, but view fields
still go through descriptor resolution. Blob.from_bytes() returns an unresolved BlobView, and
to_data() raises RuntimeError: BlobView is not resolved.

Should we skip BlobInlineConvertReader when no catalog_loader is available?

changed. blob descriptor still needs BlobInlineConvertReader, so we can not skip it.

JingsongLi · 2026-06-03T05:16:11Z

+        prescan_reader = self._prescan_reader_factory(self._view_fields)
+        try:
+            while True:
+                batch = prescan_reader.read_arrow_batch()


[P2] This prescan exhausts the prescan reader before the first main batch is returned. In Python, limit is enforced by outer TableRead after a batch is returned or sliced, and DataEvolutionSplitRead itself never receives that limit. Java wires topN/limit into the prescan reader in DataEvolutionTableRead.configureBlobViewPrescanRead. As a result, Python with_limit(1) can still prescan and validate every later BlobViewStruct in the same split, and can even fail on a bad reference that the user-visible result would never return. Please pass limit/topN into the prescan scope, or at least add a blob-view + limit test to lock down the intended behavior.

In pypaimon, currently, the limit pushdown in the Read stage of the Append table is not supported. I will support this function in the next pr

JingsongLi · 2026-06-03T05:16:11Z

+            if 'blob' in str(field.type).lower()
+        ]
+
+        if not blob_names:


[P2] This early return skips validation for explicitly configured blob-field / blob-descriptor-field / blob-view-field when the schema currently has no BLOB fields. Java SchemaValidation validates these options unconditionally against BLOB fields. Today a table can be created with, for example, blob-view-field=picture while picture remains BYTES/STRING, and the later write/read paths either ignore or misuse that configuration. Please validate the configured fields first, then apply the BLOB-specific table constraints only when BLOB fields exist; blob-field should be included in that validation as well.

XiaoHongbo-Hope · 2026-06-03T06:56:04Z

+
+        descriptor_fields = core_options.blob_descriptor_fields()
+        view_fields = core_options.blob_view_fields()
+        unknown_inline_fields = descriptor_fields.union(view_fields).difference(blob_field_names)


blob-field is not validated against BLOB fields.

Thanks, added.

JingsongLi · 2026-06-03T09:21:48Z

+        if not prescan_fields:
+            return EmptyRecordBatchReader()
+
+        prescan_read = DataEvolutionSplitRead(


This prescan reader only projects the blob-view columns, so SplitRead.__init__ drops any predicate whose fields are outside that reduced projection, and it also does not receive the original read limit. As a result, BlobInlineConvertReader._prescan_view_structs() can preload and resolve BlobViewStructs from rows that the final read would filter out or skip. I reproduced this with two target rows where id = 1 is valid and id = 2 points at a missing upstream table: reading with with_filter(id == 1) or with_limit(1) still fails during prescan on row 2. The prescan should preserve the effective filter/limit semantics, or otherwise only scan rows that can be returned.

I think this case is very rare. If the upstream table does not exist, an error should have been reported. Moreover, I have already introduced the limit pushdown function in another pr.

Thanks for checking. The missing upstream table was only a minimal way to make the semantic issue visible. The correctness problem is that the prescan observes rows that the final query would filter out or skip. A filtered-out row may contain a stale BlobViewStruct, a permission/network failure, or simply many expensive references, and that should not make WHERE id = 1 fail or trigger extra upstream reads.

Also, a limit-pushdown change in another PR would not address the predicate case here, because this prescan rebuilds the read type with only blob-view fields, so predicates on non-view columns are trimmed before the prescan reader is built. I think this PR needs to preserve the effective filter/limit in the prescan path, matching the Java path, and add regression tests for both cases.

When another PR is merged, I will support this feature in BlobView Read

JingsongLi · 2026-06-03T09:21:48Z

-            if primary_keys is not None:
-                raise ValueError("Blob type is not supported with primary key.")
+        # Validate Blob type fields in the schema
+        Schema._validate_blob_fields(fields, options, primary_keys)


This validation only runs when callers create schemas through Schema.from_pyarrow_schema(...). Callers can still construct Schema(fields=..., options=...) directly and SchemaManager.create_table will commit invalid blob-view-field / blob-descriptor-field settings, for example a blob-view-field pointing at a STRING column. Since direct Schema(...) construction is a supported/tested path, this validation should also run in SchemaManager.create_table after directives are applied, or otherwise be centralized before the schema is committed.

In python, currently there is no code similar to java Schema Validation to centrally validate option parameters. In the existing code, blob and vector-related parameters are all validated in schema.from pyarrow schema. If there is a need in the future, I will separately propose a pr to unify the validation of schema parameters.

I agree that Python does not yet have a Java-style centralized validation layer, but this PR is adding new invariants and new read/write behavior that depends on them. Leaving the direct Schema(...) path unchecked means users can commit invalid blob-view-field / blob-descriptor-field options today, and the failure then moves to later writes/reads with much less clear errors.

This does not need a full validation refactor in this PR. A narrow fix would be to call the same blob-field validation from the schema commit path (for example in SchemaManager.create_table after the schema is materialized and before schema-0 is written), so both Schema.from_pyarrow_schema(...) and direct Schema(...) creation enforce the same invariants.

# Conflicts: # paimon-python/pypaimon/common/options/core_options.py # paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py # paimon-python/pypaimon/read/reader/data_file_batch_reader.py # paimon-python/pypaimon/read/split_read.py # paimon-python/pypaimon/schema/schema.py # paimon-python/pypaimon/table/row/blob.py # paimon-python/pypaimon/tests/blob_test.py # paimon-python/pypaimon/write/writer/data_blob_writer.py # Conflicts: # paimon-python/pypaimon/common/options/core_options.py # paimon-python/pypaimon/write/writer/dedicated_format_writer.py # Conflicts: # paimon-python/pypaimon/read/reader/data_file_batch_reader.py # Conflicts: # paimon-python/pypaimon/table/row/blob.py

# Conflicts: # paimon-python/pypaimon/read/split_read.py

JingsongLi

One more pass on the latest revision, focused on schema-change parity with Java.

JingsongLi · 2026-06-03T14:01:07Z

                comment=schema.comment,
            )

+            _validate_blob_fields(schema.fields, schema.options, schema.primary_keys)


[P2] This closes the direct create_table(Schema(...)) path, but schema changes still bypass the same Blob validation. commit_changes() builds a new TableSchema and calls commit() without running _validate_blob_fields() or _validate_blob_external_storage_fields(), whereas Java SchemaManager.commit() runs SchemaValidation.validateTableSchema(newSchema) for every schema commit. This means users can still alter_table with SetOption('blob-view-field', 'non_blob_col'), remove row-tracking.enabled / data-evolution.enabled from a table with BLOB columns, or add a BLOB column via directive without the required options, and the invalid schema is committed. Please centralize this validation before writing schema changes as well, ideally in the Python commit() path to match Java.

JingsongLi

Another schema-validation detail from comparing the Python check with Java's top-level BLOB detection.

JingsongLi · 2026-06-03T14:17:16Z

+        options = {}
+
+    blob_field_names = {
+        field.name for field in fields if 'blob' in str(field.type).lower()


[P2] This detects BLOB columns by substring-matching the rendered type, which is broader than Java's field.type().getTypeRoot() == DataTypeRoot.BLOB. A normal top-level ROW column whose nested field is named or typed with blob will now be treated as a top-level BLOB column, so create-table validation starts requiring row-tracking.enabled and data-evolution.enabled even though Java would not. For example payload ROW<blob_name STRING> renders with blob in the type string and becomes part of blob_field_names. Please check only top-level BLOB types here (for example getattr(field.type, 'type', None) == 'BLOB' or an AtomicType/type-root helper) and add a regression test for nested fields containing the word blob.

discivigour force-pushed the p/blobView1 branch from 0c80caa to 4db881f Compare June 1, 2026 15:46

discivigour marked this pull request as ready for review June 1, 2026 15:46

JingsongLi reviewed Jun 2, 2026

View reviewed changes

discivigour force-pushed the p/blobView1 branch from 79c3f57 to f32d46f Compare June 3, 2026 02:30

JingsongLi reviewed Jun 3, 2026

View reviewed changes

discivigour force-pushed the p/blobView1 branch from d0536bb to dfc55bf Compare June 3, 2026 06:25

XiaoHongbo-Hope reviewed Jun 3, 2026

View reviewed changes

JingsongLi reviewed Jun 3, 2026

View reviewed changes

discivigour force-pushed the p/blobView1 branch from 27bcb14 to 4a54a87 Compare June 3, 2026 12:23

leaves12138 and others added 19 commits June 3, 2026 21:25

[python] Refine blob view lookup

e75d6c1

[python] Stabilize concurrent update test

45cb386

proto

0fecd79

simplify

975ec42

prescan

528ad93

# Conflicts: # paimon-python/pypaimon/read/split_read.py

fix

60ce34b

fix

a441a33

refine

19c8a53

fmt

96eed81

fix

f8326b9

rm

6feb435

fix

ac8499f

fmt

7aa3836

fix

8ef2ff0

fix

8fb7548

fix

17ae8ce

support blob-view.resolve.enabled and null blob value

967a0c6

remove redundant code

1035cb1

umi added 11 commits June 3, 2026 21:27

fix

9fe7eb0

from bytes supports BlobView

80af025

fix

77e1a62

rebase

a594ad3

fix

b0bae12

fix

ea2509b

fix

179a0bc

schema

819950b

fix

fac4e15

schemaCheck

48c13ff

fix

2ddb95f

JingsongLi reviewed Jun 3, 2026

View reviewed changes

umi added 3 commits June 3, 2026 22:25

supportLimit

41f849f

fmt

0383e5a

fix

c36806b

discivigour force-pushed the p/blobView1 branch from 4a54a87 to c36806b Compare June 3, 2026 14:41

fix

e2b20c6

Conversation

discivigour commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

leaves12138 commented May 31, 2026

Uh oh!

discivigour commented Jun 2, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

discivigour commented May 28, 2026 •

edited

Loading

JingsongLi left a comment •

edited

Loading

JingsongLi Jun 3, 2026 •

edited

Loading

JingsongLi Jun 3, 2026 •

edited

Loading

JingsongLi Jun 3, 2026 •

edited

Loading