Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve random sorting algorithm #4246

Merged
merged 1 commit into from
Nov 2, 2023

Conversation

DingDongSoLong4
Copy link
Collaborator

@DingDongSoLong4 DingDongSoLong4 commented Oct 26, 2023

This implements a new random sorting algorithm, which is both faster and "more random" than the previous implementation.

These are performance numbers I get for the initial SQL query of a random sort on the images page, using a database of ~2.5M images:

Time SQL
Before 650-680ms SELECT DISTINCT images.id FROM images ORDER BY (substr(images.id * 0.1583058238029480, length(images.id) + 2)) ASC, COALESCE(images.title, images.id) COLLATE NATURAL_CI ASC LIMIT 40 OFFSET 0
After 250-280ms SELECT DISTINCT images.id FROM images ORDER BY mod((images.id + 15830583) * (images.id + 15830583) * 52959209 + (images.id + 15830583) * 1047483763, 2147483647) ASC, COALESCE(images.title, images.id) COLLATE NATURAL_CI ASC LIMIT 40 OFFSET 0
Created At 150-170ms SELECT DISTINCT images.id FROM images ORDER BY images.created_at ASC, COALESCE(images.title, images.id) COLLATE NATURAL_CI ASC LIMIT 40 OFFSET 0

I've included a sort on Created At as comparison, since that column is not indexed and thus SQLite needs to do a full table scan, like it does for a random sort.

As for the randomness, the previous algorithm was effectively doing rand = images.id * seed % 100000000. If you plot a graph of id vs rand, the pattern that results from this is extremely obvious. As a result, many seed values produce terrible results. Some particularly bad seeds I found: 10000005, 11111111, 55555555, 99999995 (use in e.g. http://localhost:9999/scenes?sortby=random_{seed}). Those are admittedly a bit contrived, but I've definitely come across bad seeds randomly while using stash normally several times.

I've "borrowed" the new algorithm from a comment on StackOverflow here. Its values are far more random when plotted, and give far better output, with no "broken" seed values.

As I mention in a comment, ideally we'd be using a custom function, since that would allow the use of uints and overflow rather than floats, but that was much slower than even the previous algorithm in my testing.

@WithoutPants WithoutPants added this to the Version 0.24.0 milestone Nov 2, 2023
@WithoutPants WithoutPants added the improvement Something needed tweaking. label Nov 2, 2023
@WithoutPants WithoutPants merged commit d965587 into stashapp:develop Nov 2, 2023
2 checks passed
@DingDongSoLong4 DingDongSoLong4 deleted the random-sort-perf branch November 2, 2023 10:30
halkeye pushed a commit to halkeye/stash that referenced this pull request Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something needed tweaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants