Skip to content

GH-38558: [C++] Add support for null sort option per sort key#46926

Open
taepper wants to merge 101 commits into
apache:mainfrom
taepper:GH-38558
Open

GH-38558: [C++] Add support for null sort option per sort key#46926
taepper wants to merge 101 commits into
apache:mainfrom
taepper:GH-38558

Conversation

@taepper

@taepper taepper commented Jun 27, 2025

Copy link
Copy Markdown

See #38584 for original PR. Will be quoted for this PR description.

Rationale for this change

support multi sortkey nulls first.

order by i nulls first, j, k nulls first;

The current null sorting only supports all sortkeys, not a certain sortkey, so NullPlacement is extended to the SortKey field. Since the underlying framework is very well written, when modifying this function, you only need to pass the null_placement of each SortKey in. That’s it.

What changes are included in this PR?

1.SortKey structure, NullPlacemnt transfer logic, sorting logic and Ording related, test related
2.Substriait related.
3.c_glib related.
4.SelectK related.
5.RankOptions related.

Are these changes tested?

yes, I changed the code inside vector_sort_test.cc and performed additional tests.

Are there any user-facing changes?

yes, pg database include null sorting of multiple sort keys.

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

I amended the original PR to be less breaking in public APIs.

Still Ordering, SortOptions, RankOptions, and RankQuantileOptions now accept a std::optional<NullPlacement> instead of NullPlacement, which did lead to some changes in downstream APIs and bindings. I also need some help with fixing the c_glib bindings.

Light-City and others added 30 commits November 9, 2023 09:57
1.Reconstruct the SortKey structure and add NullPlacement.

2.Remove NullPlacement from SortOptions

3.Fix selectk not displaying non-empty results in null AtEnd scenario.

When limit k is greater than the actual table data and the table contains Null/NaN, the data cannot be obtained and only non-empty results are available.
Therefore, we support returning non-null and supporting the order of setting Null for each SortKey.

4.Add relevant unit tests and change the interface implemented by multiple versions
…8558

# Conflicts:
#	c_glib/arrow-glib/compute.cpp
#	c_glib/arrow-glib/compute.h
#	cpp/src/arrow/compute/kernels/vector_rank.cc
#	cpp/src/arrow/compute/kernels/vector_select_k.cc
#	cpp/src/arrow/compute/kernels/vector_sort.cc
#	cpp/src/arrow/compute/kernels/vector_sort_internal.h
#	python/pyarrow/_acero.pyx
#	python/pyarrow/_compute.pyx
#	python/pyarrow/array.pxi
#	python/pyarrow/tests/test_compute.py
#	python/pyarrow/tests/test_table.py
# Conflicts:
#	cpp/src/arrow/compute/api_vector.cc
#	cpp/src/arrow/compute/api_vector.h
#	cpp/src/arrow/compute/kernels/vector_rank.cc
#	cpp/src/arrow/compute/kernels/vector_select_k.cc
#	cpp/src/arrow/compute/kernels/vector_sort.cc
#	cpp/src/arrow/compute/kernels/vector_sort_internal.h
#	cpp/src/arrow/compute/kernels/vector_sort_test.cc
#	cpp/src/arrow/compute/ordering.cc
#	cpp/src/arrow/compute/ordering.h
@taepper

taepper commented Mar 10, 2026

Copy link
Copy Markdown
Author

bump @pitrou

@taepper

taepper commented Apr 30, 2026

Copy link
Copy Markdown
Author

Should I remove the changes to select_k and only keep the interface changes in this PR?

@kou

kou commented May 21, 2026

Copy link
Copy Markdown
Member

@pitrou Do you want to review this?

@pitrou

pitrou commented Jun 11, 2026

Copy link
Copy Markdown
Member

@taepper I get this error when compiling with gcc 15.2.0:

In file included from /home/antoine/arrow/dev/cpp/src/arrow/compute/api.h:33,
                 from /home/antoine/arrow/dev/cpp/src/arrow/array/array_dict.cc:33:
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h: In member function 'arrow::compute::Ordering arrow::compute::SortOptions::AsOrdering() &&':
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h:123:58: error: converting to 'arrow::compute::Ordering' from initializer list would use explicit constructor 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)'
  123 |   Ordering AsOrdering() && { return {std::move(sort_keys)}; }
      |                                                          ^
In file included from /home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h:24:
/home/antoine/arrow/dev/cpp/src/arrow/compute/ordering.h:66:12: note: 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)' declared here
   66 |   explicit Ordering(std::vector<SortKey> sort_keys) : sort_keys_(std::move(sort_keys)) {}
      |            ^~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h: In member function 'arrow::compute::Ordering arrow::compute::SortOptions::AsOrdering() const &':
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h:124:51: error: converting to 'arrow::compute::Ordering' from initializer list would use explicit constructor 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)'
  124 |   Ordering AsOrdering() const& { return {sort_keys}; }
      |                                                   ^
/home/antoine/arrow/dev/cpp/src/arrow/compute/ordering.h:66:12: note: 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)' declared here
   66 |   explicit Ordering(std::vector<SortKey> sort_keys) : sort_keys_(std::move(sort_keys)) {}
      |            ^~~~~~~~

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @taepper . This is looking very good, here are a bunch of comments.

Comment thread cpp/src/arrow/compute/api_vector.h Outdated
Comment thread cpp/src/arrow/compute/api_vector.h
Comment thread cpp/src/arrow/compute/ordering.h
Comment thread python/pyarrow/_acero.pyx Outdated
Comment thread python/pyarrow/_acero.pyx
Comment thread cpp/src/arrow/compute/kernels/vector_select_k.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/vector_select_k.cc
Comment thread cpp/src/arrow/compute/kernels/vector_select_k.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/vector_select_k.cc Outdated
Comment on lines +347 to +350
int64_t null_count = chunked_array_.null_count();
int64_t nan_count = ComputeNanCount<InType>();
int64_t non_null_like_count = chunked_array_.length() - null_count - nan_count;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having to separately compute the NaN count, can we call PartitionNullsAndNans on all chunks in advance, and then just sum the resulting NaN counts?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a better way! I will refactor the method

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed worked out quite nicely! The temporary storage of all indices for the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three of array, indices and partitions?

And I hope naming the individual parts of a ChunkedArray chunks and not arrays is consistent. At least other code I checked also referred to the parts as chunks

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporary storage of all indices for the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three of array, indices and partitions?

I don't think we care in this PR. Otherwise we have CompressedChunkLocation that can make things more compact.

@taepper

taepper commented Jun 12, 2026

Copy link
Copy Markdown
Author

Thank you for the review. I was not sure regarding the introduction of std::span usage, which I find very succinct.

For example, also this std::ranges:: based for-loop replacement has now been detected by my local clangd:

for (uint64_t& reverse_out_iter : std::ranges::reverse_view(output.non_null_like_range))
{...}

instead of

for (auto reverse_out_iter = output.non_null_like_range.rbegin();
     reverse_out_iter != output.non_null_like_range.rend(); reverse_out_iter++) {

Feel free to let me know if these code-style preferences are not wanted

@taepper taepper requested a review from pitrou June 12, 2026 17:09
@taepper

taepper commented Jun 15, 2026

Copy link
Copy Markdown
Author

Of course I will rebase onto main and update the code deprecation references to 25.0.0 from 24.0.0, but did no want to introduce noise to the history for now. When you are okay with the approach, I will rebase

@pitrou

pitrou commented Jun 15, 2026

Copy link
Copy Markdown
Member

Thank you for the review. I was not sure regarding the introduction of std::span usage, which I find very succinct.

For example, also this std::ranges:: based for-loop replacement has now been detected by my local clangd:

Oh, yes, these are nice additions, thank you.


def __init__(self, sort_keys="ascending", *, null_placement="at_end", tiebreaker="first"):
def __init__(self, sort_keys="ascending", *, null_placement=None, tiebreaker="first"):
self._set_options(sort_keys, null_placement, tiebreaker)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so (except fixing the advertised deprecation version, sorry!)

input_indices_begin, input_indices_end, arr, 0,
first_remaining_sort_key.null_placement);

// From k = output_range.size(), calculate

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
// From k = output_range.size(), calculate
// From k = output_indices.size(), calculate

Comment on lines +347 to +350
int64_t null_count = chunked_array_.null_count();
int64_t nan_count = ComputeNanCount<InType>();
int64_t non_null_like_count = chunked_array_.length() - null_count - nan_count;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporary storage of all indices for the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three of array, indices and partitions?

I don't think we care in this PR. Otherwise we have CompressedChunkLocation that can make things more compact.

@pitrou

pitrou commented Jun 15, 2026

Copy link
Copy Markdown
Member

Of course I will rebase onto main and update the code deprecation references to 25.0.0 from 24.0.0, but did no want to introduce noise to the history for now. When you are okay with the approach, I will rebase

Actually, please rebase now, I think it will make local testing easier.

@taepper

taepper commented Jun 15, 2026

Copy link
Copy Markdown
Author

Of course I will rebase onto main and update the code deprecation references to 25.0.0 from 24.0.0, but did no want to introduce noise to the history for now. When you are okay with the approach, I will rebase

Actually, please rebase now, I think it will make local testing easier.

Okay, the merge actually had no conflicts, so my wariness was not necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants