Skip to content

GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183

Open
clee704 wants to merge 5 commits into
apache:mainfrom
clee704:gh-50182-parquet-inf-stats
Open

GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183
clee704 wants to merge 5 commits into
apache:mainfrom
clee704:gh-50182-parquet-inf-stats

Conversation

@clee704

@clee704 clee704 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

When every value in a FLOAT, DOUBLE, or FLOAT16 column is the same infinity, the Parquet writer stored a finite min/max statistic instead of the infinity actually present in the column:

  • all +Infmin written as FLT_MAX / DBL_MAX (should be +Inf)
  • all -Infmax written as -FLT_MAX / -DBL_MAX (should be -Inf)

The running min/max were seeded with std::numeric_limits<T>::max() / ::lowest() — the largest/smallest finite values. These are not identity elements for min/max once the data contains infinities: Min(FLT_MAX, +Inf) keeps FLT_MAX, so an all-+Inf column never displaces the seed (and symmetrically for -Inf). The wrong value is then written with is_min_value_exact = true, so a reader is told the exact minimum of an all-+Inf column is FLT_MAX — a value not even present in the column. Range-based row-group filtering stays correct (the bound is only loosened in the safe direction), but engines that read min/max statistics as values (statistics-based MIN/MAX evaluation, metadata introspection) return wrong results.

See GH-50182 for a full reproduction in both PyArrow and C++.

What changes are included in this PR?

  • CompareHelper::DefaultMin() / DefaultMax() now seed floating-point accumulation with +Inf / -Inf (the true identities) instead of the largest finite values; integral types are unchanged.
  • The same change is applied to the Float16 comparator, via new Float16Constants::positive_infinity() / negative_infinity().
  • The "no valid value was seen" detection in CleanStatistic / CleanFloat16Statistic (which relied on the old finite seeds) is updated to match the new ±Inf seeds. An all-NaN or empty column still leaves min = +Inf, max = -Inf (min > max, which real data can never produce), so it stays unambiguously detectable and continues to emit no statistics.

Are these changes tested?

Yes. A new typed test TestFloatStatistics.Infinities (covering FLOAT, DOUBLE, and FLOAT16) asserts that all-+Inf, all--Inf, and mixed ±Inf columns produce the exact infinity as min/max (the existing AssertMinMaxAre helper also checks is_min_value_exact/is_max_value_exact). The existing all-NaN tests continue to pass, exercising the updated empty/all-NaN detection.

Are there any user-facing changes?

Parquet files written for all-infinity floating-point columns now record the correct ±Inf min/max statistics instead of the largest finite value. No public API signatures change.

Two minor behavior changes on existing public low-level APIs are worth noting (the normal writer accumulation path is unaffected, since a column with ≥1 non-NaN value always yields min ≤ max):

  • TypedComparator<FloatType/DoubleType/Float16LogicalType>::GetMinMax() now returns (+Inf, -Inf) for an all-NaN (or empty) batch, where it previously returned the largest/smallest finite value as a sentinel.
  • TypedStatistics::SetMinMax() now treats any inverted (min > max) input as "no valid statistics", whereas it previously only special-cased that one exact finite sentinel pair.

This PR contains a "Critical Fix". It fixes a bug that produced incorrect data: the writer emitted min/max statistics that are wrong yet flagged exact (is_min_value_exact = true), which can yield incorrect results for readers that evaluate MIN/MAX directly from Parquet statistics.

…ll-infinity float columns

Seed the floating-point and Float16 min/max accumulation with +/-infinity
instead of the largest/smallest finite value. A finite seed is not an identity
for min/max once the data contains infinities, so an all-+Inf column previously
reported min = FLT_MAX/DBL_MAX (and an all--Inf column reported max =
-FLT_MAX/-DBL_MAX) rather than the infinity actually present -- a value written
with is_min_value_exact = true.

Update the empty/all-NaN detection in CleanStatistic and CleanFloat16Statistic
to match the new seeds; an all-NaN or empty column still leaves min = +Inf,
max = -Inf (min > max, impossible for real data) and so emits no statistics.

Add a TestFloatStatistics.Infinities typed test covering FLOAT, DOUBLE and
FLOAT16.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 17:36
@clee704 clee704 requested review from pitrou and wgtmac as code owners June 15, 2026 17:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes Parquet statistics min/max computation for floating-point columns containing infinities (GH-50182), ensuring infinity values are not lost due to finite seeding.

Changes:

  • Seed floating-point running min/max with ±infinity instead of largest/smallest finite values.
  • Update statistic “no valid values” detection to match the new seeding strategy (including Float16).
  • Add unit tests covering all-±Inf and mixed ±Inf columns for float/double/float16.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cpp/src/parquet/statistics_test.cc Adds regression tests validating min/max behavior for infinity-only and mixed-infinity columns.
cpp/src/parquet/statistics.cc Changes default min/max seeds for floats (and Float16) to ±infinity and adjusts cleaning logic accordingly.

Comment thread cpp/src/parquet/statistics.cc Outdated
Comment thread cpp/src/parquet/statistics.cc
@github-actions github-actions Bot added the awaiting review Awaiting review label Jun 15, 2026
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #50182 has been automatically assigned in GitHub to PR creator.

…zedMinMax

The empty/all-NaN detection in CleanStatistic / CleanFloat16Statistic now uses a
single IsUninitializedMinMax(min, max) helper (max < min) instead of comparing
against the +/-infinity seeds directly. It relies only on the invariant
DefaultMin > DefaultMax, so the seed and its detection cannot drift.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 15, 2026
Switching the Float16 DefaultMin()/DefaultMax() seeds to the +/-infinity
constants left Float16Constants::max()/lowest() (and their backing members)
without any caller. Drop the dead accessors and members.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 18:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

clee704 and others added 2 commits June 15, 2026 18:54
The helper tests max < min. "Uninitialized" overstated what it checks: a valid
min/max pair always has min <= max, so max < min is simply an invalid pair --
either the inverted DefaultMin()/DefaultMax() seeds (no value observed) or an
inverted caller-supplied range. Rename to IsInvalidMinMax and reword the comments
so the predicate is definitional rather than implying an "iff uninitialized"
claim.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend TestFloatStatistics.Infinities with [+Inf, NaN, +Inf] and
[-Inf, NaN, -Inf] cases (float/double/Float16). This is the only case that
intersects the new +/-infinity seeds with NaN coalescing on a column whose true
min/max is infinite, and it confirms a NaN does not corrupt the infinite
min/max. Verified to fail on the pre-fix code (min/max came back as the largest
finite value).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 19:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread cpp/src/parquet/statistics.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants