use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() #5808

romainfrancois · 2021-03-10T10:59:26Z

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

bench::mark(
  tib %>% group_by(grp), 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tib %>% group_by(grp)    627ms    627ms      1.60     228MB     4.79

vs on master:

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

bench::mark(
  not_ordered = tib %>% group_by(grp), 
  ordered = {
    o <- order(grp)
    tibo <- tibble(grp=grp[o], x=x[o])
    b <- tibo %>% group_by(grp)
  }, 
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_ordered    2.64s    2.64s     0.379     248MB     1.14
#> 2 ordered        1.33s    1.33s     0.752     400MB     1.50

^{Created on 2021-03-10 by the reprex package (v0.3.0)}

romainfrancois · 2021-03-10T11:06:40Z

@DavisVaughan this does not seem to affect the tests, so presumably the tests don't take into account the C locale order thing. So we probably need more tests.

romainfrancois · 2021-03-10T11:15:44Z

Maybe vctrs has also something to help with arrange() :

  # we can't just use vec_compare_proxy(data) because we need to apply
  # direction for each column, so we get a list of proxies instead
  # and then mimic vctrs:::order_proxy
  #
  # should really be map2(quosures, directions, ...)
  proxies <- map2(data, directions, function(column, direction) {
    proxy <- vec_proxy_order(column)
    desc <- identical(direction, "desc")
    if (is.data.frame(proxy)) {
      proxy <- order(vec_order(proxy,
        direction = direction,
        na_value = if(desc) "smallest" else "largest"
      ))
    } else if(desc) {
      proxy <- desc(proxy)
    }
    proxy
  })

  exec("order", !!!unname(proxies), decreasing = FALSE, na.last = TRUE)

DavisVaughan · 2021-03-10T11:23:20Z

we need to apply direction for each column

you can do that with vctrs:::vec_order_radix()!

romainfrancois · 2021-03-10T12:14:18Z

you can do that with vctrs:::vec_order_radix()!

Let's go then :p

romainfrancois · 2021-03-10T12:37:48Z

again, using vctrs:::vec_order_radix() does not trigger test failures, probably as we lack tests that do stress locale ordering.

romainfrancois · 2021-03-10T13:03:57Z

This helps #4962

hadley · 2021-04-28T13:21:40Z

R/grouped-df.r

-  split_id$loc <- new_list_of(split_id$loc, ptype = integer())
-
-  vec_slice(split_id, vec_order(split_id$key))
+  split_id <- vctrs:::vec_order_locs(x)


@DavisVaughan will this change the behaviour of group_by()? i.e. does it change the ordering of character vectors to always use the C locale?

Yes

library(dplyr) df <- tibble(g = c("a", "A", "B", "b"), x = 1:4) df #> # A tibble: 4 x 2 #> g x #> <chr> <int> #> 1 a 1 #> 2 A 2 #> 3 B 3 #> 4 b 4 # Master df %>% group_by(g) %>% summarise(x = x) #> # A tibble: 4 x 2 #> g x #> <chr> <int> #> 1 a 1 #> 2 A 2 #> 3 b 4 #> 4 B 3 # This PR df %>% group_by(g) %>% summarise(x = x) #> # A tibble: 4 x 2 #> g x #> <chr> <int> #> 1 A 2 #> 2 B 3 #> 3 a 1 #> 4 b 4

We could have an option to eventually sort the grouping data after the fact if really we wanted to maintain bw compat ?

group_by() could also get a .locale argument? The groups would be created and sorted in the locale specified (again, defaulting to C), which you'd see the effects of when you do a summarise()

DavisVaughan · 2022-05-09T20:56:33Z

We know we are tackling this elsewhere. Like #6018 and #5942

use vctrs:::vec_order_locs()

958015f

using vctrs:::vec_order_radix() in arrange()

f959f05

romainfrancois changed the title ~~use vctrs:::vec_order_locs()~~ use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() Mar 10, 2021

romainfrancois mentioned this pull request Mar 10, 2021

Performance drop-off for arrange() #4962

Closed

hadley reviewed Apr 28, 2021

View reviewed changes

DavisVaughan mentioned this pull request Apr 28, 2021

Add a .locale argument to arrange() and use radix ordering #5868

Closed

DavisVaughan mentioned this pull request Jul 1, 2021

Add a .locale argument to arrange() and implement dplyr_locale() #5942

Closed

DavisVaughan mentioned this pull request Sep 16, 2021

Update group_by() algorithm to utilize vec_locate_sorted_groups() #6018

Closed

DavisVaughan closed this May 9, 2022

DavisVaughan mentioned this pull request May 10, 2022

Add a .locale argument to arrange() and implement dplyr_locale() #6263

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() #5808

use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() #5808

romainfrancois commented Mar 10, 2021

romainfrancois commented Mar 10, 2021

romainfrancois commented Mar 10, 2021

DavisVaughan commented Mar 10, 2021

romainfrancois commented Mar 10, 2021

romainfrancois commented Mar 10, 2021 •

edited

Loading

romainfrancois commented Mar 10, 2021

hadley Apr 28, 2021

DavisVaughan Apr 28, 2021

romainfrancois Apr 29, 2021

DavisVaughan Apr 29, 2021

DavisVaughan commented May 9, 2022

use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() #5808

use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() #5808

Conversation

romainfrancois commented Mar 10, 2021

romainfrancois commented Mar 10, 2021

romainfrancois commented Mar 10, 2021

DavisVaughan commented Mar 10, 2021

romainfrancois commented Mar 10, 2021

romainfrancois commented Mar 10, 2021 • edited Loading

romainfrancois commented Mar 10, 2021

hadley Apr 28, 2021

Choose a reason for hiding this comment

DavisVaughan Apr 28, 2021

Choose a reason for hiding this comment

romainfrancois Apr 29, 2021

Choose a reason for hiding this comment

DavisVaughan Apr 29, 2021

Choose a reason for hiding this comment

DavisVaughan commented May 9, 2022

romainfrancois commented Mar 10, 2021 •

edited

Loading