[Variant] extend shredded null handling for arrays#9599
[Variant] extend shredded null handling for arrays#9599sdf-jkl wants to merge 13 commits intoapache:mainfrom
Conversation
| Variant::List(list) => { | ||
| for element in list.iter() { | ||
| self.element_builder.append_value(element)?; | ||
| match element { |
There was a problem hiding this comment.
Append value would miss the shred_variant Null-handling semantics without it.
VariantToListArrowRowBuilder builds the Arrow typed array in an interesting way. It internally stores a StructArray with both value and typed_value fields, so both are kept.
There was a problem hiding this comment.
Not sure I understand.
We already know we're dealing with an array element, so what goes wrong if we just write Variant::Null? That would append the value and mark it as non-null, which seems to be exactly what append_to_null_row does for ANM::ArrayElement case?
match self {
Self::TopLevelVariant => nulls.append_null(),
=> Self::ObjectField | Self::ArrayElement => nulls.append_non_null(),
}
match self {
Self::TopLevelVariant | Self::ObjectField => value_builder.append_null(),
=> Self::ArrayElement => value_builder.append_value(Variant::Null),
}There was a problem hiding this comment.
It almost seems like there's actually no need for the third enum variant because an array element builder should never experience an append_null call in the first place? If so, we only could rely on that structural invariant to handle variant arrays, and we only need a boolean to control whether append_null causes nulls.append_null (top-level) vs. nulls.append_non_null (object field), with an unconditional value_builder.append_null()?
But I wonder what the actual call stack/object hierarchy is in these various cases? Maybe I'm misunderstanding the layout and how the various builders relate to each other.
There was a problem hiding this comment.
I added a test showing an error with using just Variant::Null.
If we use variant_get(..., List<Int64>) and cast_options(safe: False) // strict casting, appended Variant::Null can't be safely cast to the List's inner type causing an Err.
By using append_null we avoid this error.
There was a problem hiding this comment.
I didn't file one yet. I didn't realize that this was a problem existing code until you pointed it out.
There was a problem hiding this comment.
There was a problem hiding this comment.
Here, we need to distinguish whether the element is Variant::Null or not because Variant::Null can't be cast to any type. Will it be better if we handle this logic in the following code (test whether the passed in is Variant::Null)
arrow-rs/parquet-variant-compute/src/variant_to_arrow.rs
Lines 709 to 724 in 980ea0b
There was a problem hiding this comment.
Yeah, I glanced at that same code yesterday and reached a similar conclusion. We either need to teach the $transform to be a tri-state (e.g. Result<Option<...>>) or the macro needs to handle Variant::Null separately, bypassing the $transform entirely. I don't know which one is better, would need to stare more at the code.
| typed_value_builder: PrimitiveVariantToArrowRowBuilder<'a>, | ||
| nulls: NullBufferBuilder, | ||
| top_level: bool, | ||
| null_mode: AppendNullMode, |
There was a problem hiding this comment.
Is it a reasonable solution to move nulls into null_mode?
There was a problem hiding this comment.
Could you elaborate, please?
There was a problem hiding this comment.
Do we have any suitable sample data from other variant-wielding engines that we could cross-check this change against? This stuff is super subtle and I'm nervous code review alone might not catch all potential issues. Even if this PR exactly replicates what the bug report says is the correct behavior, there's a non-zero chance the bug report is wrong, and/or that other engines have chosen to interpret the variant spec differently. If other engines do differ, that would be good to know ASAP, regardless of who's "right."
scovich
left a comment
There was a problem hiding this comment.
As I try to refresh my memories of how this is all supposed to work, I realize there aren't many unit tests for the top-level and object field cases -- at least not visible in the PR diff. Even tho there is a behavior change for top-level field handling.
Do we have test gaps here?
The only behavior change in this PR is I added a focused unit test to cover that path ( |
|
Need #9612 to pass the tests |
| Variant::List(list) => { | ||
| for element in list.iter() { | ||
| self.element_builder.append_value(element)?; | ||
| match element { |
There was a problem hiding this comment.
First: That example sounds wrong? Shouldn't a variant_get that hits Variant::Null always return NULL?
Variant::Null coerces to a NULL of any type, no? (unlike, say encountering Variant::Float when List<Int64> was requested). Maybe this is a special case symptom of #9606?
Second: But anyway, this specific code is unpacking list elements not the list itself?
What specific variant value and json path did you hit problems for?
Third: This code that unpacks list elements used to just append_value(value) (without ever knowing value = Variant::Null), which would produce a non-NULL value column entry containing Variant::Null. The new code specifically checks for Variant::Null and calls append_null(). But the element builder was initialized in ArrayElement mode, so append_null does the exact same thing -- produces a non-NULL value column entry containing Variant::Null. Based on that, I'm struggling to see how this change would alter behavior at all?
| Variant::List(list) => { | ||
| for element in list.iter() { | ||
| self.element_builder.append_value(element)?; | ||
| match element { |
There was a problem hiding this comment.
Oh wait... this is going from shredded variant to some kind of ListArray, in response to variant_get(..., List<Int64>). That operation shouldn't produce shredded variant as output, so we shouldn't need to constrict it with a NullValue in the first place?? It should produce an actual list.
If we want to extract a shredded list as output, we'd need to pass variant_get(..., List<Struct>) where the list is annotated as variant extension type and the struct has value and typed_value fields of the appropriate types?
Not in strict casting. In strict casting any cast that returns
That list would error on trying to cast
It does end up appending the same
Earlier I mentioned that:
I think that currently its sole purpose is to be used by
I think you're right and it should be a separate behavior of that |
|
|
Right, but the way safe casting is implemented in variant_to_arrow returns an Err when a cast yields arrow-rs/parquet-variant-compute/src/variant_to_arrow.rs Lines 709 to 724 in 980ea0b |
Ah, that's awkward. Doesn't distinguish |
|
Actually, I think we can make it even cleaner than |
scovich
left a comment
There was a problem hiding this comment.
I think the current code is correct. Some thoughts about readability.
| const NULL_VALUES: [NullValue; 3] = [ | ||
| NullValue::TopLevelVariant, | ||
| NullValue::ObjectField, | ||
| NullValue::ArrayElement, | ||
| ]; | ||
|
|
There was a problem hiding this comment.
arrow-avro uses strum_macros crate, whose EnumIter seems relevant? Are we willing to take a dep on that crate tho?
There was a problem hiding this comment.
Looks unnecessary for extra 5loc in tests mod.
Which issue does this PR close?
Rationale for this change
Check issue
What changes are included in this PR?
AppendNullModeenum supporting all semantics.Are these changes tested?
Are there any user-facing changes?