GH-38558: [C++] Add support for null sort option per sort key#46926
GH-38558: [C++] Add support for null sort option per sort key#46926taepper wants to merge 101 commits into
Conversation
1.Reconstruct the SortKey structure and add NullPlacement. 2.Remove NullPlacement from SortOptions 3.Fix selectk not displaying non-empty results in null AtEnd scenario. When limit k is greater than the actual table data and the table contains Null/NaN, the data cannot be obtained and only non-empty results are available. Therefore, we support returning non-null and supporting the order of setting Null for each SortKey. 4.Add relevant unit tests and change the interface implemented by multiple versions
…8558 # Conflicts: # c_glib/arrow-glib/compute.cpp # c_glib/arrow-glib/compute.h # cpp/src/arrow/compute/kernels/vector_rank.cc # cpp/src/arrow/compute/kernels/vector_select_k.cc # cpp/src/arrow/compute/kernels/vector_sort.cc # cpp/src/arrow/compute/kernels/vector_sort_internal.h # python/pyarrow/_acero.pyx # python/pyarrow/_compute.pyx # python/pyarrow/array.pxi # python/pyarrow/tests/test_compute.py # python/pyarrow/tests/test_table.py
# Conflicts: # cpp/src/arrow/compute/api_vector.cc # cpp/src/arrow/compute/api_vector.h # cpp/src/arrow/compute/kernels/vector_rank.cc # cpp/src/arrow/compute/kernels/vector_select_k.cc # cpp/src/arrow/compute/kernels/vector_sort.cc # cpp/src/arrow/compute/kernels/vector_sort_internal.h # cpp/src/arrow/compute/kernels/vector_sort_test.cc # cpp/src/arrow/compute/ordering.cc # cpp/src/arrow/compute/ordering.h
…most likely human-error while merging)
|
bump @pitrou |
|
Should I remove the changes to |
|
@pitrou Do you want to review this? |
|
@taepper I get this error when compiling with gcc 15.2.0: |
| int64_t null_count = chunked_array_.null_count(); | ||
| int64_t nan_count = ComputeNanCount<InType>(); | ||
| int64_t non_null_like_count = chunked_array_.length() - null_count - nan_count; | ||
|
|
There was a problem hiding this comment.
Instead of having to separately compute the NaN count, can we call PartitionNullsAndNans on all chunks in advance, and then just sum the resulting NaN counts?
There was a problem hiding this comment.
That sounds like a better way! I will refactor the method
There was a problem hiding this comment.
This indeed worked out quite nicely! The temporary storage of all indices for the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three of array, indices and partitions?
And I hope naming the individual parts of a ChunkedArray chunks and not arrays is consistent. At least other code I checked also referred to the parts as chunks
There was a problem hiding this comment.
The temporary storage of all
indicesfor the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three ofarray,indicesandpartitions?
I don't think we care in this PR. Otherwise we have CompressedChunkLocation that can make things more compact.
|
Thank you for the review. I was not sure regarding the introduction of For example, also this instead of Feel free to let me know if these code-style preferences are not wanted |
|
Of course I will rebase onto main and update the code deprecation references to 25.0.0 from 24.0.0, but did no want to introduce noise to the history for now. When you are okay with the approach, I will rebase |
Oh, yes, these are nice additions, thank you. |
|
|
||
| def __init__(self, sort_keys="ascending", *, null_placement="at_end", tiebreaker="first"): | ||
| def __init__(self, sort_keys="ascending", *, null_placement=None, tiebreaker="first"): | ||
| self._set_options(sort_keys, null_placement, tiebreaker) |
There was a problem hiding this comment.
I don't think so (except fixing the advertised deprecation version, sorry!)
| input_indices_begin, input_indices_end, arr, 0, | ||
| first_remaining_sort_key.null_placement); | ||
|
|
||
| // From k = output_range.size(), calculate |
There was a problem hiding this comment.
Nit
| // From k = output_range.size(), calculate | |
| // From k = output_indices.size(), calculate |
| int64_t null_count = chunked_array_.null_count(); | ||
| int64_t nan_count = ComputeNanCount<InType>(); | ||
| int64_t non_null_like_count = chunked_array_.length() - null_count - nan_count; | ||
|
|
There was a problem hiding this comment.
The temporary storage of all
indicesfor the individual chunks could still be a bit better. Maybe I should introduce a struct that holds all three ofarray,indicesandpartitions?
I don't think we care in this PR. Otherwise we have CompressedChunkLocation that can make things more compact.
Actually, please rebase now, I think it will make local testing easier. |
Okay, the merge actually had no conflicts, so my wariness was not necessary |
See #38584 for original PR. Will be quoted for this PR description.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)
I amended the original PR to be less breaking in public APIs.
Still Ordering, SortOptions, RankOptions, and RankQuantileOptions now accept a
std::optional<NullPlacement>instead of NullPlacement, which did lead to some changes in downstream APIs and bindings.I also need some help with fixing thec_glibbindings.