feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795
Merged
Conversation
…'s complex_types_as_json The connector's `_use_arrow_native_complex_types` toggle is honoured by the Thrift backend (forwarded server-side as `complexTypesAsArrow`) but was silently ignored by the kernel backend — the kernel always returned native Arrow `List` / `Map` / `Struct` regardless. This was the root cause of the 5 `THRIFT_VS_KERNEL_COMPLEX_DISABLED` diffs in the comparator's COMPLEX_TYPES suite. The kernel side gained an opt-in `complex_types_as_json` post- processor (kernel PR #36) that rewrites complex columns to `Utf8` columns of compact JSON text, matching the Thrift wire format byte-for-byte. This change wires the connector's existing kwarg through to that flag: - `session.py`: pass `_use_arrow_native_complex_types` to the kernel client (it was being dropped on the floor for the kernel branch). - `backend/kernel/client.py`: read it from kwargs (default `True`, matching the connector-wide default), invert at the boundary, and set `complex_types_as_json=not _use_arrow_native_complex_types` on the kernel `Session()` constructor. - `backend/kernel/type_mapping.py`: extend `_databricks_type_for_field` to honour `databricks.type_name` for `ARRAY` / `MAP` / `STRUCT` (it already did this for `VARIANT`). When the kernel JSON path is on, the columns arrive as `Utf8` but the kernel preserves the original SQL type name in metadata; `description` should report `array` / `map` / `struct`, matching what the Thrift backend reports under `complexTypesAsArrow=False`. Verified end-to-end against the pecotesting comparator workspace: the `THRIFT_VS_KERNEL_COMPLEX_DISABLED` suite drops from 5 type-shape diffs + 1 row diff to 1 row diff. The remaining row diff is a Thrift server-side bug — Thrift emits invalid JSON for map values containing embedded `"` characters (`{"k":"val with "quote""}` — unescaped inner quote), while the kernel emits the correctly-escaped form (`{"k":"val with \"quote\""}`). The kernel is right here; matching Thrift would mean deliberately producing un-parseable output. Unit tests: - Parametrised test of `_use_arrow_native_complex_types` (default / True / False) → kernel `Session(complex_types_as_json=…)`. - Parametrised test of `description_from_arrow_schema` recovering `array` / `map` / `struct` from metadata, case-insensitively. - Negative test that an unknown `databricks.type_name` defers to the Arrow type rather than corrupting the description. 85 → 94 kernel unit tests; full suite green; black-formatted. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
gopalldb
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The connector's
_use_arrow_native_complex_typestoggle is honoured by the Thrift backend (forwarded server-side ascomplexTypesAsArrow) but was silently ignored by the kernel backend — the kernel always returned native ArrowList/Map/Structregardless of the flag. This was the root cause of the 5THRIFT_VS_KERNEL_COMPLEX_DISABLEDdiffs in the comparator's COMPLEX_TYPES suite.The kernel side gained an opt-in
complex_types_as_jsonpost-processor in databricks/databricks-sql-kernel#36 that rewrites complex columns toUtf8columns of compact JSON text, matching the Thrift wire format byte-for-byte. This PR wires the connector's existing kwarg through to that flag.Changes
session.py— pass_use_arrow_native_complex_typestoKernelDatabricksClient(it was being dropped on the floor for the kernel branch in_create_backend).backend/kernel/client.py— read the kwarg in__init__(defaultTrue, matching the connector-wide default), invert at the boundary, and setcomplex_types_as_json=not _use_arrow_native_complex_typeson the kernelSession()constructor.backend/kernel/type_mapping.py— extend_databricks_type_for_fieldto honourdatabricks.type_nameforARRAY/MAP/STRUCT(it already did this forVARIANT). When the kernel JSON path is on, the columns arrive asUtf8but the kernel preserves the original SQL type name in field metadata;descriptionshould reportarray/map/struct, matching what the Thrift backend reports undercomplexTypesAsArrow=False.Why the dependency on kernel#36
This PR is a no-op on its own — without the kernel-side post-processor, passing
complex_types_as_json=Trueto_kernel.Session()is just an unrecognised kwarg. Once kernel#36 lands and the kernel wheel is rebuilt, this PR completes the wiring and unblocks the comparator's COMPLEX_TYPES_DISABLED suite. The connector code is correct regardless of merge ordering — the kernel side rejects unknown kwargs with a clear error if the consumer somehow gets a stale wheel.Test plan
Unit tests
test_open_session_passes_complex_types_as_json_to_kernel— verifies the boundary inversion: connectorTrue/ unset → kernelFalse; connectorFalse→ kernelTrue.test_description_recovers_complex_type_name_from_metadata— verifiesdescription_from_arrow_schemarecoversarray/map/struct(case-insensitively) fromdatabricks.type_namemetadata onUtf8columns.test_description_passes_through_unknown_databricks_type_name— confirms unknown server-reported names ("INT"on anint64Arrow column) defer to the Arrow shape rather than corrupting the description.blackclean.End-to-end against pecotesting + comparator harness
Ran the focused
thrift-vs-kernel-complexconfig (COMPLEX_TYPESsuite ×THRIFT_VS_KERNEL_COMPLEX_DISABLED/THRIFT_VS_KERNEL_COMPLEX_ENABLED):COMPLEX_ENABLED(native Arrow)COMPLEX_DISABLED(JSON strings)The remaining row diff is a server-side Thrift bug: Thrift emits invalid JSON for map values containing embedded
"characters ({"k":"val with "quote""}— unescaped inner quote) while the kernel emits the correctly-escaped form ({"k":"val with \"quote\""}). The kernel is right here; matching Thrift would mean deliberately producing un-parseable JSON. Belongs in the comparator'sresult_set_filtersas a known-divergent row, not in either driver.This pull request and its description were written by Claude Code.