GH-38558: [C++] Add support for null sort option per sort key by taepper · Pull Request #46926 · apache/arrow

taepper · 2025-06-27T12:12:56Z

See #38584 for original PR. Will be quoted for this PR description.

Rationale for this change

support multi sortkey nulls first.

order by i nulls first, j, k nulls first;

The current null sorting only supports all sortkeys, not a certain sortkey, so NullPlacement is extended to the SortKey field. Since the underlying framework is very well written, when modifying this function, you only need to pass the null_placement of each SortKey in. That’s it.

What changes are included in this PR?

1.SortKey structure, NullPlacemnt transfer logic, sorting logic and Ording related, test related
2.Substriait related.
3.c_glib related.
4.SelectK related.
5.RankOptions related.

Are these changes tested?

yes, I changed the code inside vector_sort_test.cc and performed additional tests.

Are there any user-facing changes?

yes, pg database include null sorting of multiple sort keys.

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

I amended the original PR to be less breaking in public APIs.

Still Ordering, SortOptions, RankOptions, and RankQuantileOptions now accept a std::optional<NullPlacement> instead of NullPlacement, which did lead to some changes in downstream APIs and bindings. ~~I also need some help with fixing the c_glib bindings.~~

GitHub Issue: [C++][Compute] Fix null sorting of multiple sort keys #38558

1.Reconstruct the SortKey structure and add NullPlacement. 2.Remove NullPlacement from SortOptions 3.Fix selectk not displaying non-empty results in null AtEnd scenario. When limit k is greater than the actual table data and the table contains Null/NaN, the data cannot be obtained and only non-empty results are available. Therefore, we support returning non-null and supporting the order of setting Null for each SortKey. 4.Add relevant unit tests and change the interface implemented by multiple versions

…8558 # Conflicts: # c_glib/arrow-glib/compute.cpp # c_glib/arrow-glib/compute.h # cpp/src/arrow/compute/kernels/vector_rank.cc # cpp/src/arrow/compute/kernels/vector_select_k.cc # cpp/src/arrow/compute/kernels/vector_sort.cc # cpp/src/arrow/compute/kernels/vector_sort_internal.h # python/pyarrow/_acero.pyx # python/pyarrow/_compute.pyx # python/pyarrow/array.pxi # python/pyarrow/tests/test_compute.py # python/pyarrow/tests/test_table.py

# Conflicts: # cpp/src/arrow/compute/api_vector.cc # cpp/src/arrow/compute/api_vector.h # cpp/src/arrow/compute/kernels/vector_rank.cc # cpp/src/arrow/compute/kernels/vector_select_k.cc # cpp/src/arrow/compute/kernels/vector_sort.cc # cpp/src/arrow/compute/kernels/vector_sort_internal.h # cpp/src/arrow/compute/kernels/vector_sort_test.cc # cpp/src/arrow/compute/ordering.cc # cpp/src/arrow/compute/ordering.h

…most likely human-error while merging)

taepper · 2026-03-10T10:54:03Z

bump @pitrou

taepper · 2026-04-30T07:58:15Z

Should I remove the changes to select_k and only keep the interface changes in this PR?

kou · 2026-05-21T02:04:59Z

@pitrou Do you want to review this?

pitrou · 2026-06-11T14:53:59Z

@taepper I get this error when compiling with gcc 15.2.0:

In file included from /home/antoine/arrow/dev/cpp/src/arrow/compute/api.h:33,
                 from /home/antoine/arrow/dev/cpp/src/arrow/array/array_dict.cc:33:
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h: In member function 'arrow::compute::Ordering arrow::compute::SortOptions::AsOrdering() &&':
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h:123:58: error: converting to 'arrow::compute::Ordering' from initializer list would use explicit constructor 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)'
  123 |   Ordering AsOrdering() && { return {std::move(sort_keys)}; }
      |                                                          ^
In file included from /home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h:24:
/home/antoine/arrow/dev/cpp/src/arrow/compute/ordering.h:66:12: note: 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)' declared here
   66 |   explicit Ordering(std::vector<SortKey> sort_keys) : sort_keys_(std::move(sort_keys)) {}
      |            ^~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h: In member function 'arrow::compute::Ordering arrow::compute::SortOptions::AsOrdering() const &':
/home/antoine/arrow/dev/cpp/src/arrow/compute/api_vector.h:124:51: error: converting to 'arrow::compute::Ordering' from initializer list would use explicit constructor 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)'
  124 |   Ordering AsOrdering() const& { return {sort_keys}; }
      |                                                   ^
/home/antoine/arrow/dev/cpp/src/arrow/compute/ordering.h:66:12: note: 'arrow::compute::Ordering::Ordering(std::vector<arrow::compute::SortKey>)' declared here
   66 |   explicit Ordering(std::vector<SortKey> sort_keys) : sort_keys_(std::move(sort_keys)) {}
      |            ^~~~~~~~

pitrou

Sorry for the delay @taepper . This is looking very good, here are a bunch of comments.

pitrou · 2026-06-11T15:48:09Z

+    int64_t null_count = chunked_array_.null_count();
+    int64_t nan_count = ComputeNanCount<InType>();
+    int64_t non_null_like_count = chunked_array_.length() - null_count - nan_count;
+


Instead of having to separately compute the NaN count, can we call PartitionNullsAndNans on all chunks in advance, and then just sum the resulting NaN counts?

That sounds like a better way! I will refactor the method

This indeed worked out quite nicely! The temporary storage of all indices for the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three of array, indices and partitions?

And I hope naming the individual parts of a ChunkedArray chunks and not arrays is consistent. At least other code I checked also referred to the parts as chunks

The temporary storage of all indices for the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three of array, indices and partitions?

I don't think we care in this PR. Otherwise we have CompressedChunkLocation that can make things more compact.

taepper · 2026-06-12T16:23:58Z

Thank you for the review. I was not sure regarding the introduction of std::span usage, which I find very succinct.

For example, also this std::ranges:: based for-loop replacement has now been detected by my local clangd:

for (uint64_t& reverse_out_iter : std::ranges::reverse_view(output.non_null_like_range))
{...}

instead of

for (auto reverse_out_iter = output.non_null_like_range.rbegin();
     reverse_out_iter != output.non_null_like_range.rend(); reverse_out_iter++) {

Feel free to let me know if these code-style preferences are not wanted

taepper · 2026-06-15T13:20:31Z

Of course I will rebase onto main and update the code deprecation references to 25.0.0 from 24.0.0, but did no want to introduce noise to the history for now. When you are okay with the approach, I will rebase

pitrou · 2026-06-15T13:49:32Z

Thank you for the review. I was not sure regarding the introduction of std::span usage, which I find very succinct.

For example, also this std::ranges:: based for-loop replacement has now been detected by my local clangd:

Oh, yes, these are nice additions, thank you.

pitrou · 2026-06-15T12:54:22Z


-    def __init__(self, sort_keys="ascending", *, null_placement="at_end", tiebreaker="first"):
+    def __init__(self, sort_keys="ascending", *, null_placement=None, tiebreaker="first"):
        self._set_options(sort_keys, null_placement, tiebreaker)


I don't think so (except fixing the advertised deprecation version, sorry!)

pitrou · 2026-06-15T13:42:37Z

+          input_indices_begin, input_indices_end, arr, 0,
+          first_remaining_sort_key.null_placement);
+
+      // From k = output_range.size(), calculate


Nit

Suggested change

// From k = output_range.size(), calculate

// From k = output_indices.size(), calculate

pitrou · 2026-06-15T13:49:07Z

+    int64_t null_count = chunked_array_.null_count();
+    int64_t nan_count = ComputeNanCount<InType>();
+    int64_t non_null_like_count = chunked_array_.length() - null_count - nan_count;
+


The temporary storage of all indices for the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three of array, indices and partitions?

I don't think we care in this PR. Otherwise we have CompressedChunkLocation that can make things more compact.

pitrou · 2026-06-15T13:53:00Z

Of course I will rebase onto main and update the code deprecation references to 25.0.0 from 24.0.0, but did no want to introduce noise to the history for now. When you are okay with the approach, I will rebase

Actually, please rebase now, I think it will make local testing easier.

taepper · 2026-06-15T14:39:20Z

Of course I will rebase onto main and update the code deprecation references to 25.0.0 from 24.0.0, but did no want to introduce noise to the history for now. When you are okay with the approach, I will rebase

Actually, please rebase now, I think it will make local testing easier.

Okay, the merge actually had no conflicts, so my wariness was not necessary

Light-City and others added 30 commits November 9, 2023 09:57

fix test

d8bca7c

fix sortkey assert

970f5bf

fix serde test

6c0bc76

better api

25984c5

merge follow-up

dafbe14

merge follow-up

48a3e0c

formatting

5c9eb50

formatting

12d1b0d

missing unwrap_null_placement in python

0b442f7

do not remove demoted null_placement from python api

4398d16

fix member name in hash_aggregate_test

bdc4069

update ToString method output

8fc630c

format remove extra empty line

dcb3b7a

fix python interface

682a37a

python formatting

e6afb8a

do not pass None to python C binding

82299bc

format python

370cdac

fix minor python api mistakes

503a7d1

python formatting

1d0883e

do not rename member of exposed API struct

7833add

do not rename member of exposed API struct, missed test file

6e2759b

format cc file

0a029ca

updating api that was not using additional argument for some reason (…

3463a59

…most likely human-error while merging)

make null_placement optional in the python acero api

e01a009

minor additional fixes

794a2cd

amend c_glib api to use std::optional<NullPlacement> for RankOptions

adaae37

amend c_glib api to use std::optional<NullPlacement> for RankOptions

14c3914

taepper force-pushed the GH-38558 branch from f7e52a5 to 01fb5f2 Compare February 17, 2026 19:28

formatting

7b59c4f

taepper force-pushed the GH-38558 branch from 6c93868 to 7b59c4f Compare February 18, 2026 14:20

pitrou reviewed Jun 11, 2026

View reviewed changes

taepper added 9 commits June 12, 2026 15:10

fix compilation errors and correctly place deprecations and suppresions

cc88a60

correctly specify deprecations in python bindings

c7350b5

remove duplicated test

0d2da39

improve coverage in test_record_batch_sort()

ef2d7a3

update select_k doc

df57326

do not check for stable sort when interface is not guaranteed stable

a35a90b

add tests for SortOption serialization and equality

eea17e5

add all fields of SortKey to serialization

fb9fc4f

refactor ChunkedArraySelector and address minor comments

91d1d9e

taepper requested a review from pitrou June 12, 2026 17:09

pitrou reviewed Jun 15, 2026

View reviewed changes

taepper added 3 commits June 15, 2026 16:02

Merge remote-tracking branch 'origin/main' into apacheGH-38558

4119792

update variable name in comment

29eb202

update deprecation warnings and since doc-strings to arrow 25.0.0

68d59c7

taepper added 2 commits June 15, 2026 17:25

fix test result of added test

073c687

suppress null_placement deprecation warnings in arrow-glib

cb8aa3c

taepper force-pushed the GH-38558 branch from c69e417 to cb8aa3c Compare June 15, 2026 17:40

	// From k = output_range.size(), calculate
	// From k = output_indices.size(), calculate

Conversation

taepper commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

taepper commented Mar 10, 2026

Uh oh!

taepper commented Apr 30, 2026

Uh oh!

kou commented May 21, 2026

Uh oh!

pitrou commented Jun 11, 2026

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

taepper Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

taepper Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

taepper commented Jun 12, 2026

Uh oh!

taepper commented Jun 15, 2026

Uh oh!

pitrou commented Jun 15, 2026

Uh oh!

pitrou Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 15, 2026

Uh oh!

taepper commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

taepper commented Jun 27, 2025 •

edited

Loading