Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() #5808

Closed
wants to merge 2 commits into from

Conversation

romainfrancois
Copy link
Member

closes #4406

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

bench::mark(
  tib %>% group_by(grp), 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tib %>% group_by(grp)    627ms    627ms      1.60     228MB     4.79

vs on master:

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

bench::mark(
  not_ordered = tib %>% group_by(grp), 
  ordered = {
    o <- order(grp)
    tibo <- tibble(grp=grp[o], x=x[o])
    b <- tibo %>% group_by(grp)
  }, 
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_ordered    2.64s    2.64s     0.379     248MB     1.14
#> 2 ordered        1.33s    1.33s     0.752     400MB     1.50

Created on 2021-03-10 by the reprex package (v0.3.0)

@romainfrancois
Copy link
Member Author

@DavisVaughan this does not seem to affect the tests, so presumably the tests don't take into account the C locale order thing. So we probably need more tests.

@romainfrancois
Copy link
Member Author

Maybe vctrs has also something to help with arrange() :

  # we can't just use vec_compare_proxy(data) because we need to apply
  # direction for each column, so we get a list of proxies instead
  # and then mimic vctrs:::order_proxy
  #
  # should really be map2(quosures, directions, ...)
  proxies <- map2(data, directions, function(column, direction) {
    proxy <- vec_proxy_order(column)
    desc <- identical(direction, "desc")
    if (is.data.frame(proxy)) {
      proxy <- order(vec_order(proxy,
        direction = direction,
        na_value = if(desc) "smallest" else "largest"
      ))
    } else if(desc) {
      proxy <- desc(proxy)
    }
    proxy
  })

  exec("order", !!!unname(proxies), decreasing = FALSE, na.last = TRUE)

@DavisVaughan
Copy link
Member

we need to apply direction for each column

you can do that with vctrs:::vec_order_radix()!

@romainfrancois
Copy link
Member Author

you can do that with vctrs:::vec_order_radix()!

Let's go then :p

@romainfrancois
Copy link
Member Author

romainfrancois commented Mar 10, 2021

again, using vctrs:::vec_order_radix() does not trigger test failures, probably as we lack tests that do stress locale ordering.

@romainfrancois romainfrancois changed the title use vctrs:::vec_order_locs() use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() Mar 10, 2021
@romainfrancois
Copy link
Member Author

This helps #4962

split_id$loc <- new_list_of(split_id$loc, ptype = integer())

vec_slice(split_id, vec_order(split_id$key))
split_id <- vctrs:::vec_order_locs(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavisVaughan will this change the behaviour of group_by()? i.e. does it change the ordering of character vectors to always use the C locale?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

library(dplyr)

df <- tibble(g = c("a", "A", "B", "b"), x = 1:4)
df
#> # A tibble: 4 x 2
#>   g         x
#>   <chr> <int>
#> 1 a         1
#> 2 A         2
#> 3 B         3
#> 4 b         4

# Master
df %>%
  group_by(g) %>%
  summarise(x = x)
#> # A tibble: 4 x 2
#>   g         x
#>   <chr> <int>
#> 1 a         1
#> 2 A         2
#> 3 b         4
#> 4 B         3

# This PR
df %>%
  group_by(g) %>%
  summarise(x = x)
#> # A tibble: 4 x 2
#>   g         x
#>   <chr> <int>
#> 1 A         2
#> 2 B         3
#> 3 a         1
#> 4 b         4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an option to eventually sort the grouping data after the fact if really we wanted to maintain bw compat ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_by() could also get a .locale argument? The groups would be created and sorted in the locale specified (again, defaulting to C), which you'd see the effects of when you do a summarise()

@DavisVaughan
Copy link
Member

We know we are tackling this elsewhere. Like #6018 and #5942

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

group_by performance: potential for easy and substantial improvement
3 participants