Skip to content

GH-48210: [C++][Parquet] Fix Bloom Filter logic to enable Parquet DB support on s390x#48211

Open
Vishwanatha-HD wants to merge 1 commit into
apache:mainfrom
Vishwanatha-HD:fixBloomFilters
Open

GH-48210: [C++][Parquet] Fix Bloom Filter logic to enable Parquet DB support on s390x#48211
Vishwanatha-HD wants to merge 1 commit into
apache:mainfrom
Vishwanatha-HD:fixBloomFilters

Conversation

@Vishwanatha-HD

@Vishwanatha-HD Vishwanatha-HD commented Nov 21, 2025

Copy link
Copy Markdown
Contributor

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the Bloom Filter logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/parquet/bloom_filter.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #48210 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-48210 Fix Bloom Filter logic to enable Parquet DB support on s390x GH-48210: [C++][Parquet] Fix Bloom Filter logic to enable Parquet DB support on s390x Nov 22, 2025
@k8ika0s

k8ika0s commented Nov 23, 2025

Copy link
Copy Markdown

@Vishwanatha-HD

Bloom filters are one of those parts of Parquet where tiny byte-order details end up mattering way more than you’d expect, so it’s good to see attention landing here.

Something I ran into on s390x is that the xxhash input/output tends to stay a lot more predictable if the bitset words are kept in a single canonical order (LE in our case) and the reader/writer treat them as such. In my own experiments I normalized the bitset once at the boundary and let the rest of the logic operate on native values.

In this patch, the per-word FromLittleEndian/ToLittleEndian inside the find/insert loops definitely keeps things correct, though it does create a slightly tighter coupling between the hashing logic and the byte-swapping. I only mention it because it can sometimes show up in profiling when bloom filters are exercised heavily over wide row groups.

Not calling this out as a problem — the behavior you’re targeting here lines up with what I’ve seen on s390x, especially around making sure the mask checks behave the same across BE/LE hosts. Just sharing observations in case it’s useful while these pieces get polished.

@Vishwanatha-HD

Copy link
Copy Markdown
Contributor Author

@pitrou @kou @kiszk @zanmato1984
Hi All,
I know its been long time since I have my PRs opened.. Can you please help me with review and merging to the upstream. Is there anything more that you people want me to do it, I would be happy to work on it. Please suggest.
I have verified that with my fix all the 113 testcase passes without any issues.
Thanks.

@kou kou requested a review from Copilot June 3, 2026 21:41
@kou

kou commented Jun 3, 2026

Copy link
Copy Markdown
Member

Could you rebase on main?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes Parquet Bloom Filter bitset access so Bloom filters are interpreted/stored in little-endian word order, enabling correct behavior on big-endian architectures (notably s390x) while preserving existing behavior on little-endian systems.

Changes:

  • Read Bloom filter bitset words using arrow::bit_util::FromLittleEndian() during lookups.
  • Write Bloom filter bitset words using a read/modify/write cycle with FromLittleEndian() / ToLittleEndian() during inserts.
  • Add the required Arrow endian utilities include.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

uint32_t* raw_bitset32 = reinterpret_cast<uint32_t*>(data_->mutable_data());

for (int i = 0; i < kBitsSetPerBlock; i++) {
const int word_index = bucket_index * kBitsSetPerBlock + i;
@Vishwanatha-HD

Copy link
Copy Markdown
Contributor Author

Could you rebase on main?

Hi @kou..
I have rebased this PR onto the main. Please check.. Thanks..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants