-
Notifications
You must be signed in to change notification settings - Fork 1.3k
VirtualServer: coerce_column missing support for common Arrow types from external databases #3149
Description
Bug
When using a VirtualServerHandler backed by an external database (e.g. DuckDB WASM), from_arrow_ipc fails with "Unknown Arrow IPC type" errors for several common Arrow types that DuckDB and other databases produce.
The coerce_column function in data.rs only handles a subset of Arrow types. Types not handled fall through to the catch-all which converts to Utf8 via format!("{:?}", array) — producing debug representations instead of actual values.
Missing Types
Unsigned integers (UInt8, UInt16, UInt32, UInt64)
DuckDB uses unsigned integer types for internal columns (e.g. UTINYINT, USMALLINT, UINTEGER, UBIGINT). These should coerce to Int64.
Small integers (Int8, Int16)
DuckDB TINYINT and SMALLINT produce Int8/Int16 Arrow types. These should coerce to Int64.
Float32
Should coerce to Float64.
Decimal128
DuckDB DECIMAL types produce Decimal128. Should coerce to Float64 using the scale factor.
Date64
Should coerce to Date32 (days since epoch).
Time32/Time64 (all units)
DuckDB TIME produces various time types. Should coerce to Timestamp(Millisecond).
LargeUtf8
DuckDB uses LargeUtf8 for large strings. Should coerce to Utf8.
Dictionary-encoded columns
DuckDB dictionary-encodes low-cardinality string columns. Both coerce_column and extract_scalar need to decode dictionary arrays to plain Utf8, handling all key types (Int8/16/32/64, UInt8/16/32/64) and both Utf8 and LargeUtf8 value types.
Empty RecordBatch (zero data columns)
When a grouped view has only metadata columns (__GROUPING_ID__, __ROW_PATH_N__) and no data columns (e.g. filter dropdown views with columns: []), Phase B of from_arrow_ipc strips all columns, then RecordBatch::try_new fails with "must either specify a row count or at least one column". Should use try_new_with_options with with_row_count(Some(num_rows)).
Fix
The full set of additions to coerce_column in data.rs:
// Unsigned integers → Int64
DataType::UInt8 => { /* upcast to Int64 */ },
DataType::UInt16 => { /* upcast to Int64 */ },
DataType::UInt32 => { /* upcast to Int64 */ },
DataType::UInt64 => { /* upcast to Int64 */ },
// Small signed integers → Int64
DataType::Int8 => { /* upcast to Int64 */ },
DataType::Int16 => { /* upcast to Int64 */ },
// Float32 → Float64
DataType::Float32 => { /* upcast to Float64 */ },
// Decimal128 → Float64 (divide by 10^scale)
DataType::Decimal128(_, scale) => { /* convert to Float64 */ },
// Date64 → Date32
DataType::Date64 => { /* convert millis to days */ },
// Time types → Timestamp(Millisecond)
DataType::Time32(Second) | Time32(Millisecond) | Time64(Microsecond) | Time64(Nanosecond) => { ... },
// LargeUtf8 → Utf8
DataType::LargeUtf8 => { /* copy to StringBuilder */ },
// Dictionary → decode to Utf8
DataType::Dictionary(key_type, _) => { /* decode all key types to plain strings */ },And in extract_scalar, add Dictionary handling for row path extraction with the same key type coverage.
And in from_arrow_ipc, handle empty batches after Phase B:
if new_schema.fields().is_empty() {
self.frozen = Some(RecordBatch::try_new_with_options(
new_schema, new_arrays,
&RecordBatchOptions::new().with_row_count(Some(num_rows)),
)?);
}Impact
Without these fixes, any VirtualServerHandler connected to DuckDB (or similar databases that produce these Arrow types) will fail or show garbled data for:
- Tables with unsigned integer columns
- Tables with small integer columns (TINYINT, SMALLINT)
- Tables with DECIMAL columns
- Tables with TIME columns
- Dictionary-encoded string columns (common in DuckDB for low-cardinality data)
- Filter dropdown autocomplete (empty column case)