Skip to content

Conversation

@rynewang
Copy link

@rynewang rynewang commented Jan 3, 2026

Rationale for this change

Fixes #47995

When merging ByteArray statistics, empty string min/max values were incorrectly discarded. This happened because CleanStatistic() rejected statistics where ptr == nullptr, but empty strings can legitimately have ptr == nullptr with len == 0.

What changes are included in this PR?

Introduces a sentinel pointer (kNoValueSentinel) distinct from nullptr to mark "no value" in ByteArray statistics. This allows CleanStatistic to distinguish between:

  • "no min/max computed" (sentinel)
  • "min/max is empty string" (nullptr with len=0)

FLBA is unchanged since it has fixed length and no "empty" concept.

Are these changes tested?

Yes. Added comprehensive tests covering all combinations of:

  • Empty stats (no min/max)
  • Stats with empty string min ("")
  • Stats with non-empty min

Are there any user-facing changes?

No API changes. This is a bug fix that preserves empty string statistics correctly during merge operations.

…ing lost during merge

Prior to this change, CleanStatistic() for ByteArray rejected statistics
where ptr == nullptr. However, empty strings can have ptr == nullptr with
len == 0, causing valid statistics to be discarded when the minimum value
is an empty string.

The fix introduces a sentinel pointer (kNoValueSentinel) distinct from
nullptr to mark "no value" in ByteArray statistics. This allows
CleanStatistic to distinguish between "no min/max computed" (sentinel)
and "min/max is empty string" (nullptr with len=0).

FLBA is unchanged since it has fixed length and no "empty" concept.
@rynewang rynewang requested a review from wgtmac as a code owner January 3, 2026 22:19
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

⚠️ GitHub issue #47995 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] MinMax statistics for strings may be inaccurate after a merge

1 participant