-
Notifications
You must be signed in to change notification settings - Fork 5
/
003-dplyr-radix-ordering.Rmd
254 lines (175 loc) · 14.6 KB
/
003-dplyr-radix-ordering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
output: github_document
editor_options:
chunk_output_type: console
---
# Tidyup 3: Radix Ordering in `dplyr::arrange()`
**Champion**: Davis
**Co-Champion**: Hadley
**Status**: Accepted
## Abstract
As of dplyr 1.0.9, `arrange()` uses a modified version of `base::order()` to sort columns of a data frame.
Recently, vctrs gained `vec_order_radix()`, a fast radix ordering function with first class support for data frames and custom vctrs types, along with enhanced character ordering.
The purpose of this tidyup is to propose switching `arrange()` from `order()` to `vec_order_radix()` in the most user-friendly way possible.
## Motivation
Thanks to the data.table team, R \>= 3.3 gained support for an extremely fast radix ordering algorithm in `order()`.
This has become the default algorithm for ordering most atomic types, with the notable exception of character vectors.
Radix ordering character vectors is only possible in the C locale, but the shell sort currently in use by `order()` respects the system locale.
Because R is justifiably hesitant to break backwards compatibility, if *any* character vector is present, the entire ordering procedure falls back to a much slower shell sort.
Because dplyr uses `order()` internally, the performance of `arrange()` is negatively affected by this fallback.
Inspired by the performance of the radix ordering algorithm, and by its many practical applications for data science, a radix order-based `vec_order_radix()` was added to vctrs, which has the following benefits:
- Radix ordering on all atomic types.
- First class support for data frames, including specifying sorting direction on a per column basis.
- Support for an optional character transformation function, which generates an intermediate *sort key* that is unique to a particular locale.
When sorted in the C locale, the sort key returns an ordering that would be equivalent to directly sorting the original vector in the locale that the sort key was generated for.
It is worth looking at a quick example that demonstrates just how fast this radix ordering algorithm is when compared against the defaults of `order()`.
```{r}
library(stringi)
library(vctrs)
vec_order_radix <- vctrs:::vec_order_radix
set.seed(123)
# 10,000 random strings, sampled to a total size of 1,000,000
n_unique <- 10000L
n_total <- 1000000L
dictionary <- stri_rand_strings(
n = n_unique,
length = sample(1:30, n_unique, replace = TRUE)
)
x <- sample(dictionary, size = n_total, replace = TRUE)
head(x)
# Force `order()` to use the C locale to match `vec_order_radix()`
bench::mark(
base = withr::with_locale(c(LC_COLLATE = "C"), order(x)),
vctrs = vec_order_radix(x)
)
# Force `vec_order_radix()` to use the American English locale, which is also
# my system locale. To do that we'll need to generate a sort key, which
# we can sort in the C locale, but the result will be like we sorted
# directly in the American English locale.
bench::mark(
base = order(x),
vctrs = vec_order_radix(
x = x,
chr_proxy_collate = ~stri_sort_key(.x, locale = "en_US")
)
)
# Generating the sort key takes most of the time
bench::mark(
sort_key = stri_sort_key(x, locale = "en_US")
)
```
In dplyr, we'd like to utilize `vec_order_radix()` while breaking as little code as possible.
Switching to `vec_order_radix()` has many potential positive impacts, including:
- Improved performance when ordering character vectors.
- Which also results in improved overall performance when ordering character vectors alongside other atomic types, since it would no longer cause the whole procedure to fall back to a shell sort.
- Improved reproducibility across sessions by ensuring that the default behavior doesn't depend on an environment variable, `LC_COLLATE`, which is used by `order()` to determine the locale.
- Improved reproducibility across OSes where `LC_COLLATE` might differ, even for the same locale.
For example, `"en_US"` on a Mac is approximately equivalent to `"English_United States.1252"` on Windows.
- Improved consistency within the tidyverse.
Specifically, with stringr, which has an explicit argument for changing the locale, i.e. `stringr::str_order(locale = "en")` that utilizes the locale identifiers from stringi.
However, making this switch also has potential negative impacts:
- Breaking code that relies strongly on the ordering resulting from using the `LC_COLLATE` environment variable.
- Surprising users if the new ordering does not match the previous one.
## Solution
To switch to `vec_order_radix()` internally while surprising the least amount of users, it is proposed that the data frame method for `arrange()` gain a new argument, `.locale`, with the following properties:
- Defaults to `NULL`, which typically returns the `"C"` locale, but can be overriden to use legacy behavior, see below.
- If stringi is installed, allow a string locale identifier, such as `"fr"` for French, for explicitly adjusting the locale. If stringi is not installed, and the user has explicitly specified a locale identifier, an error will be thrown.
- Allow `"C"` to be specified as a special case, which is an explicit way to request the C locale. If the exact details of the ordering are not critical, this is often much faster than specifying a locale identifier. This does not require stringi.
We understand that a small subset of users will need time to switch over to the new behavior.
These users can set the global option, `dplyr.legacy_locale`, to `TRUE` to retain the current behavior that utilizes the system locale.
This is intended to be a temporary option, and will be removed in a later version of dplyr.
Setting this option is only applicable when `.locale = NULL`.
If it is set to any other value, then `.locale` overrides the global option value.
Note that this is the same global option that is utilized in [Tidyup 006: Ordering of `dplyr::group_by()`](https://github.com/tidyverse/tidyups/blob/main/006-dplyr-group-by-ordering.md) meaning that setting it will also revert `group_by()` to its legacy behavior.
C has been chosen as the default because it is the most reproducible option.
The C locale is available in all versions of R and across all operating systems, and works the same everywhere.
This satisfies one of the main goals of this tidyup, improving the reproducibility of `arrange()` when compared with the current behavior of `LC_COLLATE`.
The C locale is also fairly close to the American English locale (i.e. `"en"`), with the main difference being how case-sensitivity is treated, but this rarely makes a practical difference in a data analysis.
On certain systems, stringi can be a difficult dependency to install.
Because of this, this proposal recommends that stringi only be *suggested* so that users without stringi can still use dplyr.
This proposal relies on `stringi::stri_sort_key()` when a stringi locale identifier is supplied, which generates the sort key mentioned under Motivation as a proxy that can be ordered in the C locale.
However, sort key generation is expensive.
In fact, it is often the most expensive part of the entire sorting process, which is one of the reasons that the C locale is the default.
That said, generating a sort key + sorting it in the C locale is generally still 5-10x faster than using `order()` directly.
## Implementation
- Using `vec_order_radix()` in `arrange()`, and adding `.locale`
- <https://github.com/tidyverse/dplyr/pull/6263>
## Backwards Compatibility
### arrange()
The proposal outlined above should preserve the results of most programs using the American English locale.
The main difference between the C locale and American English is related to case-sensitivity.
In the C locale, the English alphabet is grouped by *case*, while in English locales the alphabet is grouped by *letter*:
```{r, message=FALSE}
library(dplyr) # tidyverse/dplyr#6263
df <- tibble(x = c("a", "b", "C", "B", "c"))
df
# The C locale groups the English alphabet by case, placing uppercase letters
# before lowercase letters
arrange(df, x)
# The American English locale groups the alphabet by letter
arrange(df, x, .locale = "en")
```
This rarely has a practical difference in a data analysis.
The C locale will return identical results to the American English locale as long as the case is consistent between observations, like with the following two IDs: `"AD25" < "AG66"`, or with proper nouns, which are often either capitalized consistently or not capitalized at all: `"America" < "Japan"` or `"america" < "japan"`.
With non-English Latin script languages, we expect there to be more meaningful differences.
For example, in Spanish the `n` with a tilde above it is sorted directly after `n`, but in the C locale this is placed after `z`.
```{r}
tbl <- tibble(x = c("\u00F1", "n", "z"))
tbl
arrange(tbl, x)
arrange(tbl, x, .locale = "es")
```
Spanish users that have `LC_COLLATE` set to Spanish may be surprised that `arrange()` would now be placing this character in the "wrong order" even though they have set that global option.
The fix would be to either set `.locale = "es"` in their calls to `arrange()`, or to set `options(dplyr.legacy_locale = TRUE)` to revert back to the legacy behavior.
We expect this issue to affect a small number of users, due to the sheer magnitude of users that use English as their system locale.
### arrange_at/if/all()
While these three variants of `arrange()` are superseded, we have decided to add a `.locale` argument to each of them anyways.
## Unresolved Questions
stringi provides various options for fine tuning the sorting method through `stringi::stri_opts_collator()`.
The most useful of these is `numeric`, which allows for natural sorting of strings containing a mix of alphabetical and numeric characters.
```{r}
library(stringi)
x <- c("A1", "A100", "A2")
# Compares the 2nd character of each string as 1 <= 1 <= 2
stri_sort(x, locale = "en")
# Compares 1 <= 2 <= 100
opts <- stri_opts_collator(locale = "en", numeric = TRUE)
stri_sort(x, opts_collator = opts)
```
Feedback suggested that it might be useful to allow `.locale` to accept a `stri_opts_collator()` list to fine tune the procedure used by `arrange()`, but ultimately we decided not to add this at this time because most of the arguments to `stri_opts_collator()` are extreme special cases.
We may return to this in the future if it proves to be a popular request.
## Alternatives
### No global option
The original proposal defaulted to the American English locale, but did not include a way to globally override this.
This had the benefit of being more reproducible across sessions, since as long as stringi was installed it was guaranteed that the locale was always American English unless specified otherwise through `.locale`.
However, in the tidyverse meeting on 2021-06-14 it was determined that not including a way to globally override this default was too aggressive of a change for non-English Latin script users.
The solution here is to at least provide a way to fall back to legacy behavior temporarily if users need more time to fully update their scripts to the new behavior.
We feel that, for the most part, locale is not a particularly important part of arranging data frames until the final stages of your process, i.e. right before producing some human readable output like a table, so there should be relatively few places where you need to update `arrange()` calls.
### `dplyr.locale` global option
A previous version of this proposal contained a `dplyr.locale` global option that would allow users to globally override the locale to one of the stringi locale identifiers.
This would be a way for them to quickly adapt their scripts without having to go through and update all of the `arrange()` calls.
It also was meant to provide an escape hatch for situations when `arrange()` was being called from a function you don't control.
We found that these arguments were not strong enough to retain this global option for a few reasons:
- There should be relatively few places where you need to update your calls to `arrange()`.
- For places where `arrange()` is called but you can't control it, you can generally just `arrange()` the output of that function a second time to your liking.
- This global option would not be respected by `group_by()`, even though `dplyr.legacy_locale` would be, which seems inconsistent.
- This global option actually reduces reproducibility.
One user might set this global option in their `.Rprofile`, pass a script to a co-worker, and then be surprised when the co-worker doesn't get exactly the same results.
This goes directly against one of the core goals of this tidyup.
### Defaulting to the American English locale if stringi was installed
A previous version of this proposal changed the behavior of `dplyr_locale()` so that it:
- Defaulted to `"en"` if stringi was installed
- Fell back to `"C"` with a warning if stringi wasn't installed
- Still allowed `dplyr.locale` to override this behavior
This seemed reasonable since most R users use an English locale, and that would mean their scripts had very little chance of breaking from this change.
However, on 2022-05-11 we realized that in practice this would cause more confusion than it is worth.
In particular, if you are a package developer relying on dplyr, then the default behavior of `arrange()` would change for you depending on whether or not stringi was installed.
This would result in difficult to debug situations where you might locally write tests with stringi installed, but then on CI your tests might fail or throw warnings if you forget to `Import` stringi (even if you weren't working with character columns!).
By defaulting to the C locale, the default behavior of `arrange()` will work the same everywhere, and if package developers explicitly set `.locale` to anything else, such as `"en"`, then this will be a clear signal that they have opted into an optional behavior and need to add stringi as a dependency of their package.
### Tagged character vectors
A final proposed alternative is to implement a "tagged" character vector class, with an additional attribute specifying the locale to be used when ordering.
This would remove the need for any `.locale` argument, and the locale would even be allowed to vary between columns.
If no locale tag was supplied, `arrange()` would default to either `"en"` or `"C"` for the locale.
This approach is relatively clean, but is practically very difficult because it would require cross package effort to generate and retain these locale tags.
Additionally, it doesn't solve the problem of avoiding breakage for existing code that uses a non-English locale.
Lastly, it would require an additional learning curve for users to understand how to use them in conjunction with `arrange()`.