GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183
Open
clee704 wants to merge 5 commits into
Open
GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183clee704 wants to merge 5 commits into
clee704 wants to merge 5 commits into
Conversation
…ll-infinity float columns Seed the floating-point and Float16 min/max accumulation with +/-infinity instead of the largest/smallest finite value. A finite seed is not an identity for min/max once the data contains infinities, so an all-+Inf column previously reported min = FLT_MAX/DBL_MAX (and an all--Inf column reported max = -FLT_MAX/-DBL_MAX) rather than the infinity actually present -- a value written with is_min_value_exact = true. Update the empty/all-NaN detection in CleanStatistic and CleanFloat16Statistic to match the new seeds; an all-NaN or empty column still leaves min = +Inf, max = -Inf (min > max, impossible for real data) and so emits no statistics. Add a TestFloatStatistics.Infinities typed test covering FLOAT, DOUBLE and FLOAT16. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes Parquet statistics min/max computation for floating-point columns containing infinities (GH-50182), ensuring infinity values are not lost due to finite seeding.
Changes:
- Seed floating-point running min/max with ±infinity instead of largest/smallest finite values.
- Update statistic “no valid values” detection to match the new seeding strategy (including Float16).
- Add unit tests covering all-±Inf and mixed ±Inf columns for float/double/float16.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cpp/src/parquet/statistics_test.cc | Adds regression tests validating min/max behavior for infinity-only and mixed-infinity columns. |
| cpp/src/parquet/statistics.cc | Changes default min/max seeds for floats (and Float16) to ±infinity and adjusts cleaning logic accordingly. |
|
|
…zedMinMax The empty/all-NaN detection in CleanStatistic / CleanFloat16Statistic now uses a single IsUninitializedMinMax(min, max) helper (max < min) instead of comparing against the +/-infinity seeds directly. It relies only on the invariant DefaultMin > DefaultMax, so the seed and its detection cannot drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switching the Float16 DefaultMin()/DefaultMax() seeds to the +/-infinity constants left Float16Constants::max()/lowest() (and their backing members) without any caller. Drop the dead accessors and members. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The helper tests max < min. "Uninitialized" overstated what it checks: a valid min/max pair always has min <= max, so max < min is simply an invalid pair -- either the inverted DefaultMin()/DefaultMax() seeds (no value observed) or an inverted caller-supplied range. Rename to IsInvalidMinMax and reword the comments so the predicate is definitional rather than implying an "iff uninitialized" claim. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend TestFloatStatistics.Infinities with [+Inf, NaN, +Inf] and [-Inf, NaN, -Inf] cases (float/double/Float16). This is the only case that intersects the new +/-infinity seeds with NaN coalescing on a column whose true min/max is infinite, and it confirms a NaN does not corrupt the infinite min/max. Verified to fail on the pre-fix code (min/max came back as the largest finite value). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
When every value in a
FLOAT,DOUBLE, orFLOAT16column is the same infinity, the Parquet writer stored a finite min/max statistic instead of the infinity actually present in the column:+Inf→minwritten asFLT_MAX/DBL_MAX(should be+Inf)-Inf→maxwritten as-FLT_MAX/-DBL_MAX(should be-Inf)The running min/max were seeded with
std::numeric_limits<T>::max()/::lowest()— the largest/smallest finite values. These are not identity elements for min/max once the data contains infinities:Min(FLT_MAX, +Inf)keepsFLT_MAX, so an all-+Infcolumn never displaces the seed (and symmetrically for-Inf). The wrong value is then written withis_min_value_exact = true, so a reader is told the exact minimum of an all-+Infcolumn isFLT_MAX— a value not even present in the column. Range-based row-group filtering stays correct (the bound is only loosened in the safe direction), but engines that read min/max statistics as values (statistics-basedMIN/MAXevaluation, metadata introspection) return wrong results.See GH-50182 for a full reproduction in both PyArrow and C++.
What changes are included in this PR?
CompareHelper::DefaultMin()/DefaultMax()now seed floating-point accumulation with+Inf/-Inf(the true identities) instead of the largest finite values; integral types are unchanged.Float16comparator, via newFloat16Constants::positive_infinity()/negative_infinity().CleanStatistic/CleanFloat16Statistic(which relied on the old finite seeds) is updated to match the new±Infseeds. An all-NaN or empty column still leavesmin = +Inf, max = -Inf(min > max, which real data can never produce), so it stays unambiguously detectable and continues to emit no statistics.Are these changes tested?
Yes. A new typed test
TestFloatStatistics.Infinities(coveringFLOAT,DOUBLE, andFLOAT16) asserts that all-+Inf, all--Inf, and mixed±Infcolumns produce the exact infinity as min/max (the existingAssertMinMaxArehelper also checksis_min_value_exact/is_max_value_exact). The existing all-NaN tests continue to pass, exercising the updated empty/all-NaN detection.Are there any user-facing changes?
Parquet files written for all-infinity floating-point columns now record the correct
±Infmin/max statistics instead of the largest finite value. No public API signatures change.Two minor behavior changes on existing public low-level APIs are worth noting (the normal writer accumulation path is unaffected, since a column with ≥1 non-NaN value always yields
min ≤ max):TypedComparator<FloatType/DoubleType/Float16LogicalType>::GetMinMax()now returns(+Inf, -Inf)for an all-NaN (or empty) batch, where it previously returned the largest/smallest finite value as a sentinel.TypedStatistics::SetMinMax()now treats any inverted (min > max) input as "no valid statistics", whereas it previously only special-cased that one exact finite sentinel pair.This PR contains a "Critical Fix". It fixes a bug that produced incorrect data: the writer emitted min/max statistics that are wrong yet flagged exact (
is_min_value_exact = true), which can yield incorrect results for readers that evaluateMIN/MAXdirectly from Parquet statistics.