Skip to content

feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795

Merged
vikrantpuppala merged 1 commit into
mainfrom
feat/kernel-complex-types-as-json
May 20, 2026
Merged

feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor#795
vikrantpuppala merged 1 commit into
mainfrom
feat/kernel-complex-types-as-json

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

Summary

The connector's _use_arrow_native_complex_types toggle is honoured by the Thrift backend (forwarded server-side as complexTypesAsArrow) but was silently ignored by the kernel backend — the kernel always returned native Arrow List / Map / Struct regardless of the flag. This was the root cause of the 5 THRIFT_VS_KERNEL_COMPLEX_DISABLED diffs in the comparator's COMPLEX_TYPES suite.

The kernel side gained an opt-in complex_types_as_json post-processor in databricks/databricks-sql-kernel#36 that rewrites complex columns to Utf8 columns of compact JSON text, matching the Thrift wire format byte-for-byte. This PR wires the connector's existing kwarg through to that flag.

Changes

  • session.py — pass _use_arrow_native_complex_types to KernelDatabricksClient (it was being dropped on the floor for the kernel branch in _create_backend).
  • backend/kernel/client.py — read the kwarg in __init__ (default True, matching the connector-wide default), invert at the boundary, and set complex_types_as_json=not _use_arrow_native_complex_types on the kernel Session() constructor.
  • backend/kernel/type_mapping.py — extend _databricks_type_for_field to honour databricks.type_name for ARRAY / MAP / STRUCT (it already did this for VARIANT). When the kernel JSON path is on, the columns arrive as Utf8 but the kernel preserves the original SQL type name in field metadata; description should report array / map / struct, matching what the Thrift backend reports under complexTypesAsArrow=False.

Why the dependency on kernel#36

This PR is a no-op on its own — without the kernel-side post-processor, passing complex_types_as_json=True to _kernel.Session() is just an unrecognised kwarg. Once kernel#36 lands and the kernel wheel is rebuilt, this PR completes the wiring and unblocks the comparator's COMPLEX_TYPES_DISABLED suite. The connector code is correct regardless of merge ordering — the kernel side rejects unknown kwargs with a clear error if the consumer somehow gets a stale wheel.

Test plan

Unit tests

  • New parametrised test test_open_session_passes_complex_types_as_json_to_kernel — verifies the boundary inversion: connector True / unset → kernel False; connector False → kernel True.
  • New parametrised test test_description_recovers_complex_type_name_from_metadata — verifies description_from_arrow_schema recovers array / map / struct (case-insensitively) from databricks.type_name metadata on Utf8 columns.
  • New negative test test_description_passes_through_unknown_databricks_type_name — confirms unknown server-reported names ("INT" on an int64 Arrow column) defer to the Arrow shape rather than corrupting the description.
  • 85 → 94 kernel-backend unit tests; full kernel suite green; black clean.

End-to-end against pecotesting + comparator harness

Ran the focused thrift-vs-kernel-complex config (COMPLEX_TYPES suite × THRIFT_VS_KERNEL_COMPLEX_DISABLED / THRIFT_VS_KERNEL_COMPLEX_ENABLED):

Run Before After
COMPLEX_ENABLED (native Arrow) match match
COMPLEX_DISABLED (JSON strings) 5 type-shape diffs + 1 row diff 0 type-shape diffs, 1 row diff

The remaining row diff is a server-side Thrift bug: Thrift emits invalid JSON for map values containing embedded " characters ({"k":"val with "quote""} — unescaped inner quote) while the kernel emits the correctly-escaped form ({"k":"val with \"quote\""}). The kernel is right here; matching Thrift would mean deliberately producing un-parseable JSON. Belongs in the comparator's result_set_filters as a known-divergent row, not in either driver.

This pull request and its description were written by Claude Code.

…'s complex_types_as_json

The connector's `_use_arrow_native_complex_types` toggle is honoured
by the Thrift backend (forwarded server-side as `complexTypesAsArrow`)
but was silently ignored by the kernel backend — the kernel always
returned native Arrow `List` / `Map` / `Struct` regardless. This was
the root cause of the 5 `THRIFT_VS_KERNEL_COMPLEX_DISABLED` diffs in
the comparator's COMPLEX_TYPES suite.

The kernel side gained an opt-in `complex_types_as_json` post-
processor (kernel PR #36) that rewrites complex columns to `Utf8`
columns of compact JSON text, matching the Thrift wire format
byte-for-byte. This change wires the connector's existing kwarg
through to that flag:

- `session.py`: pass `_use_arrow_native_complex_types` to the kernel
  client (it was being dropped on the floor for the kernel branch).
- `backend/kernel/client.py`: read it from kwargs (default `True`,
  matching the connector-wide default), invert at the boundary, and
  set `complex_types_as_json=not _use_arrow_native_complex_types`
  on the kernel `Session()` constructor.
- `backend/kernel/type_mapping.py`: extend `_databricks_type_for_field`
  to honour `databricks.type_name` for `ARRAY` / `MAP` / `STRUCT` (it
  already did this for `VARIANT`). When the kernel JSON path is on,
  the columns arrive as `Utf8` but the kernel preserves the original
  SQL type name in metadata; `description` should report `array` /
  `map` / `struct`, matching what the Thrift backend reports under
  `complexTypesAsArrow=False`.

Verified end-to-end against the pecotesting comparator workspace:
the `THRIFT_VS_KERNEL_COMPLEX_DISABLED` suite drops from 5 type-shape
diffs + 1 row diff to 1 row diff. The remaining row diff is a Thrift
server-side bug — Thrift emits invalid JSON for map values containing
embedded `"` characters (`{"k":"val with "quote""}` — unescaped
inner quote), while the kernel emits the correctly-escaped form
(`{"k":"val with \"quote\""}`). The kernel is right here; matching
Thrift would mean deliberately producing un-parseable output.

Unit tests:
- Parametrised test of `_use_arrow_native_complex_types` (default /
  True / False) → kernel `Session(complex_types_as_json=…)`.
- Parametrised test of `description_from_arrow_schema` recovering
  `array` / `map` / `struct` from metadata, case-insensitively.
- Negative test that an unknown `databricks.type_name` defers to the
  Arrow type rather than corrupting the description.

85 → 94 kernel unit tests; full suite green; black-formatted.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala vikrantpuppala merged commit 0c10d7b into main May 20, 2026
34 checks passed
@jprakash-db jprakash-db mentioned this pull request May 21, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants