Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve std.sort.sortContext to use block sort implementation #11117

Open
andrewrk opened this issue Mar 10, 2022 · 8 comments
Open

improve std.sort.sortContext to use block sort implementation #11117

andrewrk opened this issue Mar 10, 2022 · 8 comments
Labels
contributor friendly This issue is limited in scope and/or knowledge of Zig internals. enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization standard library This issue involves writing Zig code for the standard library.
Milestone

Comments

@andrewrk
Copy link
Member

andrewrk commented Mar 10, 2022

In #11109 I introduced a new function, std.sort.sortContext which allows the caller to override the swap function, which makes it possible to call std lib sort in the MultiArrayList.sort method:

/// `ctx` has the following method:
/// `fn lessThan(ctx: @TypeOf(ctx), a_index: usize, b_index: usize) bool`
pub fn sort(self: Self, ctx: anytype) void {
const SortContext = struct {
sub_ctx: @TypeOf(ctx),
slice: Slice,
pub fn swap(sc: @This(), a_index: usize, b_index: usize) void {
inline for (fields) |field_info, i| {
if (@sizeOf(field_info.field_type) != 0) {
const field = @intToEnum(Field, i);
const ptr = sc.slice.items(field);
mem.swap(field_info.field_type, &ptr[a_index], &ptr[b_index]);
}
}
}
pub fn lessThan(sc: @This(), a_index: usize, b_index: usize) bool {
return sc.sub_ctx.lessThan(a_index, b_index);
}
};
std.sort.sortContext(self.len, SortContext{
.sub_ctx = ctx,
.slice = self.slice(),
});
}

This, in turn, made it possible to add a sort method to ArrayHashMap.

However, I did not convert our main std lib sort function to utilize this context; instead it calls insertion sort:

zig/lib/std/sort.zig

Lines 835 to 839 in 0b82c02

/// TODO currently this just calls `insertionSortContext`. The block sort implementation
/// in this file needs to be adapted to use the sort context.
pub fn sortContext(len: usize, context: anytype) void {
return insertionSortContext(len, context);
}

The main problem to solve here is that the block sort implementation does not exclusively swap elements; it also uses a "cache" of elements that it copies items into and out of. In order to generalize this sort implementation, and make it usable from, for example, MultiArrayList, the cache concept needs to either be eliminated (harming performance) or incorporated in a generalized manner into the context (complicating the API). Since we strive for optimality in Zig, we will pay the price of the more complicated API.

On the plus side, this will make the cache size configurable, which is something that we already should be exposing in the API.

The tricky part of this issue will be designing the Context API carefully such that block sort can be implemented in terms of it, but is as simple as possible otherwise for callers.

@andrewrk andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. contributor friendly This issue is limited in scope and/or knowledge of Zig internals. optimization standard library This issue involves writing Zig code for the standard library. labels Mar 10, 2022
@andrewrk andrewrk added this to the 0.11.0 milestone Mar 10, 2022
@alichraghi
Copy link
Contributor

alichraghi commented Mar 11, 2022

there's a benchmark result in alichraghi/zort. as you can see Insertion sort is ~160 times slower than block sort. we can try other algorithms such as tim and quick

@matu3ba
Copy link
Contributor

matu3ba commented Mar 11, 2022

@alichraghi
Did you try to implement pattern-defeating quicksort aka pdqsort, timsort ?
Another optimization goal is the generated code size, for which I dont know any numbers yet.
Other possibilities are https://lib.rs/crates/dmsort for 80% sorted data.
Timesort uses a buffer to move whole sections of elements.

Notable for pd quicksort
The current algorithm is based on pattern-defeating quicksort by Orson Peters, which combines the fast average case of randomized quicksort with the fast worst case of heapsort, while achieving linear time on slices with certain patterns. It uses some randomization to avoid degenerate cases, but with a fixed seed to always provide deterministic behavior.
It is typically faster than stable sorting, except in a few special cases, e.g., when the slice consists of several concatenated sorted sequences.

Not really sure, if quadsort is worth the complexity+memory cost https://github.com/scandum/quadsort

Would also be nice to have the sorting algorithms as ascii table overview and which is (simple to ) parallelize etc.

@alichraghi
Copy link
Contributor

alichraghi commented Sep 22, 2022

Zort now has pdq, tim, and many more algorithms.

gantt
    title Sorting (ascending) 10000000 usize (Core i7-4600U CPU 2.10GHz)
    dateFormat x
    axisFormat %S s
    section random
    tim 1.932: 0,1932
    pdq 0.400: 0,400
    quick 0.882: 0,882
    radix 0.311: 0,311
    twin 1.126: 0,1126
    std_block_merge 1.402: 0,1402
    comb 1.631: 0,1631
    shell 2.943: 0,2944
    section sorted
    tim 0.010: 0,10
    pdq 0.011: 0,11
    quick 0.281: 0,281
    radix 0.303: 0,303
    twin 0.085: 0,85
    std_block_merge 0.063: 0,63
    comb 0.500: 0,500
    shell 0.332: 0,332
    section reverse
    tim 0.603: 0,603
    pdq 0.108: 0,108
    quick 0.474: 0,474
    radix 0.251: 0,251
    twin 0.431: 0,431
    std_block_merge 0.477: 0,477
    comb 0.559: 0,559
    shell 0.357: 0,357
    section ascending saw
    tim 0.325: 0,325
    pdq 0.408: 0,408
    quick 1.454: 0,1454
    radix 0.318: 0,318
    twin 0.285: 0,285
    std_block_merge 0.518: 0,518
    comb 0.836: 0,836
    shell 0.590: 0,590
    section descending saw
    tim 0.482: 0,482
    pdq 0.404: 0,404
    quick 49.500: 0,49519
    radix 0.375: 0,375
    twin 0.674: 0,674
    std_block_merge 0.746: 0,746
    comb 1.082: 0,1082
    shell 0.834: 0,834
Loading

Possible alternatives of block sort

@tensorush
Copy link
Contributor

tensorush commented Sep 29, 2022

@alichraghi Nice work on zorting algorithms! It might be better to open another issue if there is a need for other sorting algorithms to be included in the std. Since std.sort.sort is implemented as a stable, zero-allocation, in-place one the current block sort (a.k.a WikiSort) is a pretty good fit here, however, it could be upgraded to an enhanced GrailSort (see issue #2651).

@alichraghi
Copy link
Contributor

alichraghi commented Sep 29, 2022

thank you. pdq, tail, tim and twin is @voroskoi work

Since std.sort.sort is implemented as a stable, zero-allocation, in-place one the current block sort (a.k.a WikiSort) is a pretty good fit here, however, it could be upgraded to an enhanced GrailSort (see issue #2651).

@SpexGuy has implemented GrailSort in Zig
let's see if he likes to include it into zort? so we can benchmark it with other algorithms

EDIT: bench result

gantt
    title Sorting (ascending) 10000000 usize
    dateFormat x
    axisFormat %S s
    section random
    pdq 0.403: 0,403
    grail 1.650: 0,1650
    std_block_merge 1.725: 0,1725
    section sorted
    pdq 0.009: 0,9
    grail 0.717: 0,717
    std_block_merge 0.064: 0,64
    section reverse
    pdq 0.133: 0,133
    grail 0.769: 0,769
    std_block_merge 0.513: 0,513
    section ascending saw
    pdq 0.464: 0,464
    grail 0.939: 0,939
    std_block_merge 0.367: 0,367
    section descending saw
    pdq 0.431: 0,431
    grail 1.043: 0,1043
    std_block_merge 0.688: 0,688
Loading

@voroskoi
Copy link
Sponsor Contributor

voroskoi commented Oct 2, 2022

I think it would make sense to replace std.sort.sort() with std.sort.stable() and std.sort.unstable(). This would be similar to what other languages do.
PDQ is a good candidate for unstable sorting, I would like to see it more broadly tested.
Stable sort could be Timsort, but I would prefer something without allocation in the standard library, dunno.

@cb22
Copy link

cb22 commented Oct 2, 2023

On the plus side, this will make the cache size configurable, which is something that we already should be exposing in the API.

This would be great for TigerBeetle (part of tigerbeetle/tigerbeetle#1191). In our workload, bumping the cache to 2MB works out to around a ~16% throughput improvement. (we can't use an unstable sort due to how we currently handle duplicates.)

It would also be nice to eliminate the stack overflow risk from [512]T that could happen with a large T, but that doesn't affect us currently.

@alichraghi
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor friendly This issue is limited in scope and/or knowledge of Zig internals. enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

No branches or pull requests

6 participants