Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vec_map() #1227

Closed
wants to merge 15 commits into from
Closed

Implement vec_map() #1227

wants to merge 15 commits into from

Conversation

lionel-
Copy link
Member

@lionel- lionel- commented Aug 19, 2020

Branched from #1226.

This implements vec_map(). Depending on the supplied prototype, it covers a range of functionality provided in purrr and more:

  • .ptype = list() implements map()
  • .ptype = integer() or other base atomic types implements map_int() etc
  • .ptype can also be set to any vctrs type

It isn't possible to infer the prototype from the list results because this would have no advantage over piping into vec_simplify() at the end.

The implementation follows two code paths depending on whether .ptype is a list or an atomic vector.

  • When mapping to a list, we update the input in place in the mapping loop and coerce the complete list of outputs to the target ptype at the end. This way the coercion is vectorised.

  • When mapping to an atomic vector we initialise an output vector and then coerce each input before assignment.

This vctrs implementation is competitive with purrr and base:

### Mapping to list

x <- list(1, b = 2)
bench::mark(
  base = lapply(x, plus_one),
  purrr = map(x, plus_one),
  vctrs = vec_map(x, plus_one)
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#> 1 base         2.43µs   3.51µs   268997.        0B     0    10000     0
#> 2 purrr       14.92µs  17.33µs    54736.        0B     5.47  9999     1
#> 3 vctrs        3.91µs   4.91µs   194467.        0B    19.4   9999     1

short <- rep(list(1, b = 2), 20)
bench::mark(
  base = lapply(short, plus_one),
  purrr = map(short, plus_one),
  vctrs = vec_map(short, plus_one)
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#> 1 base         21.6µs   25.8µs    36666.      368B     29.4  9992     8
#> 2 purrr        35.7µs   41.7µs    22743.      368B     25.0  9989    11
#> 3 vctrs        17.1µs     21µs    44847.      368B     31.4  9993     7

long <- rep(list(1, b = 2), 1e5)
bench::mark(
  base = lapply(long, plus_one),
  purrr = map(long, plus_one),
  vctrs = vec_map(long, plus_one)
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#> 1 base          120ms    126ms      6.81    1.53MB     15.3     4     9
#> 2 purrr         128ms    135ms      7.37    1.53MB     12.9     4     7
#> 3 vctrs          79ms     88ms     11.6     1.53MB     17.4     6     9


### Mapping to atomic vector

x <- list(1L, b = 2L)
bench::mark(
  base = vapply(x, plus_one, integer(1)),
  purrr = map_int(x, plus_one),
  vctrs = vec_map(x, plus_one, .ptype = int())
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#> 1 base         3.54µs    4.8µs   199405.        0B      0   10000     0
#> 2 purrr        5.67µs   7.28µs   130768.        0B     13.1  9999     1
#> 3 vctrs        6.82µs    8.2µs   114366.        0B      0   10000     0

short <- rep(list(1L, b = 2L), 20)
bench::mark(
  base = vapply(short, plus_one, integer(1)),
  purrr = map_int(short, plus_one),
  vctrs = vec_map(short, plus_one, .ptype = int())
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#> 1 base         22.7µs   26.9µs    35467.      208B     7.09  9998     2
#> 2 purrr        24.4µs   29.5µs    32371.      208B     6.48  9998     2
#> 3 vctrs        23.5µs   26.8µs    35572.      208B     3.56  9999     1

long <- rep(list(1L, b = 2L), 1e5)
bench::mark(
  base = vapply(long, plus_one, integer(1)),
  purrr = map_int(long, plus_one),
  vctrs = vec_map(long, plus_one, .ptype = int())
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#> 1 base          104ms  104.1ms      9.61     781KB     38.4     1     4
#> 2 purrr         115ms  115.2ms      8.68     781KB     34.7     1     4
#> 3 vctrs          89ms   89.4ms     11.2      781KB     22.4     2     4

The purrr compat file has been updated to use the vec_map() to get some internal testing and to make it easy to import unit tests from purrr.

@lionel-
Copy link
Member Author

lionel- commented Aug 19, 2020

modify() could now be implemented with vec_map():

modify <- function(.x, .fn, ...) {
  vec_map(.x, .fn, ..., .ptype = .x)
}

modify(1:3, plus_one)
#> [1] 2 3 4

modify(c(FALSE, FALSE), plus_one)
#> [1] TRUE TRUE

modify(c(FALSE, TRUE), plus_one)
#> Error: Can't convert from <integer> to <logical> due to loss of precision.
#> * Locations: 1

@lionel-
Copy link
Member Author

lionel- commented Aug 19, 2020

When mapping to a list, we update the input in place in the mapping loop and coerce the complete list of outputs to the target ptype at the end. This way the coercion is vectorised.

Like the vec_assign2() implementation in #1228, this requires the list type to be coercible with bare lists. I think we don't lose any important property by requiring this coercion since it goes in the direction of the richer type.

(The more problematic coercion is towards the narrower type but since we require "list" inheritance, we'll basically enforce narrowing coercions to lists once the base type fallback of #1135 is implemented.)

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when it has docs to review, and I'll think more about the interface.

R/compat-purrr.R Outdated Show resolved Hide resolved
@@ -50,7 +50,7 @@ new_partial_factor <- function(partial = factor(), learned = factor()) {
vec_ptype_full.vctrs_partial_factor <- function(x, ...) {
empty <- ""

levels <- map(x, levels)
levels <- map(unclass(x), levels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid an infinite recursion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why did it work before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because map() didn't use vctrs operations and genericity.

The purrr compat file has been updated to use the vec_map() to get some internal testing and to make it easy to import unit tests from purrr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry, the infinite recursion was another problem. It doesn't make sense for ptype-full to be recursed into anyway.

The problem is that map() now assigns names, and partial factors and data frames don't support that:

test-partial-factor.R:5: error: has ok print method
`names<-.vctrs_partial_factor()` not supported.

This is a flaw in the partial types but as usual I'm just working around these types when they cause problems for now.

Copy link
Member Author

@lionel- lionel- Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer assign NULL names to be a little more efficient. However this unclass() change is still needed because partial types inherit from vctrs_sclr which have an unsupported names<- method but still allow names(), so vec_map() sees the internal field names. Making the latter unsupported causes a bunch of other issues. It wouldn't solve the problem at hand anyway because now we'd have a vector type for which names() is an error, which is a big genericity flaw. I think we shouldn't worry about these types too much for now.

@lionel-
Copy link
Member Author

lionel- commented Aug 19, 2020

@hadley

Let me know when it has docs to review, and I'll think more about the interface.

Do you like the genericity model with lists implemented here? This version of map() supports both list and atomic outputs, and so needs the equivalent of [[<-. I thought I'd first get feedback on the mechanism proposed here and in #1226 (the operations in #1228 are also relevant) before documenting and exporting.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall interface seems reasonable to me.


SEXP elt_out = PROTECT(r_eval_force(vec_map_call, env));
if (vec_size(elt_out) != 1) {
r_abort("Mapped function must return a size 1 vector.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to includde index in this error message.


expect_identical(
vec_map(vctr, identity),
vec_chop(vctr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to define the expected output directly rather than in terms of another function, because I don't have enough of vec_chop() loaded in my head to have any idea what this does.


return("Used to work in purrr")

out2 <- map(NULL, identity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this return now?

@DavisVaughan
Copy link
Member

Generally this looks good, but I think that I have a fairly strong aversion to the list part of this interface.

From slider, I learned that there is a meaningful difference between slide() and slide_vec(.ptype = list()).

  1. The first takes each .f result and assigns it into a bare output list non-generically with SET_VECTOR_ELT() with no restrictions on the result of .f (type or size).

  2. The second works identically to other suffix functions, like slide_dbl(), by casting to the list() ptype and by checking that the size of that list is 1 before assigning it into the output generically with vec_assign().

library(slider)

# Returns a list, assigns each element into the list with SET_VECTOR_ELT()
slide(1:2, ~1)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1

# Each `.f` result must be castable to a list of size 1,
# Assigned into output generically with `vec_assign()`
# (so it would extend nicely to list subclasses)
slide_vec(1:2, ~list(1), .ptype = list())
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1

slide_vec(1:2, ~list(1, 2), .ptype = list())
#> Error: In iteration 1, the result of `.f` had size 2, not 1.

slide_vec(1:2, ~1, .ptype = list())
#> Error: Can't convert <double> to <list>.

I think that the current vec_map(.ptype = list()) implementation allows for 1), but not 2).

For purrr, this isn't a huge deal, because map_lst() and map_vec(.ptype = list()) both don't exist, but from a theoretical perspective it seems like it would be nice for vec_map(.ptype = list()) to work exactly like vec_map(.ptype = dbl()) does (with the same casting and size restrictions).

I would advocate for:

# This is the default
# `.ptype = NULL` is identical to `map()`,
# this is NOT guessing the ptype
vec_map(.x, .fn, ..., .ptype = NULL)

# This is like `slide_vec(.x, .fn, ..., .ptype = list())`
# and follows the exact same code path as the current `atomic_map()`
vec_map(.x, .fn, ..., .ptype = list())

# Genericity can be achieved with the following also going through `atomic_map()`
vec_map(.x, .fn, ..., .ptype = lst_rcrd())

Two additional notes:

  1. .ptype = NULL is used elsewhere to mean "we are going to guess the output ptype". It is worth keeping this in mind, but I would be okay with that meaning something different here.

  2. The current implementation would do .ptype = lst_rcrd() differently from my proposed implementation. It looks like it currently assigns non-generically with SET_VECTOR_ELT() into a bare list, and then would cast at the end to lst_rcrd(). It seems like this works when you can cast a list->list-subclass, but I'm not sure you always can? It works with vctrs_list_rcrd because the attribute fields can be computed from the elements, but I doubt that is always the case. Here is an example where the vectorized attribute fields can't be computed from the data:

library(vctrs)
library(rlang)
#> Warning: package 'rlang' was built under R version 4.0.2

local_methods <- function(..., .frame = caller_env()) {
  local_bindings(..., .env = global_env(), .frame = .frame)
}
local_methods_name_rcrd <- function(frame = caller_env()) {
  local_methods(
    .frame = frame,
    vec_proxy.vctrs_name_rcrd = function(x, ...) data_frame(data = unclass(x), last_names = attr(x, "last_names")),
    vec_restore.vctrs_name_rcrd = function(x, to, ...) new_name_rcrd(x$data, x$last_names),
    vec_ptype2.vctrs_name_rcrd.vctrs_name_rcrd = function(x, y, ...) x,
    vec_cast.vctrs_name_rcrd.vctrs_name_rcrd = function(x, to, ...) x,
    vec_cast.list.vctrs_name_rcrd = function(x, to, ...) vec_data(x)
  )
}

local_methods_name_rcrd()
#> Setting deferred event(s) on global environment.
#>   * Execute (and clear) with `withr::deferred_run()`.
#>   * Clear (without executing) with `withr::deferred_clear()`.

new_name_rcrd <- function(x, last_names) {
  structure(x, last_names = last_names, class = c("vctrs_name_rcrd", "list"))
}

# Example:
new_name_rcrd(list(1, 2:5), c("Foo", "Bar"))
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2 3 4 5
#> 
#> attr(,"last_names")
#> [1] "Foo" "Bar"
#> attr(,"class")
#> [1] "vctrs_name_rcrd" "list"

x <- c("Vaughan", "Henry")
ptype <- new_name_rcrd(list(), character())

# Currently assigns into bare list, then casts to `ptype`,
# but you can't cast a list to vctrs_name_rcrd
vctrs:::vec_map(x, ~1, .ptype = ptype)
#> Error: Can't convert <list> to <vctrs_name_rcrd>.

# Alternatively, could cast each element to vctrs_name_rcrd and
# assign generically with `vec_assign()`.
# NOTE: You can't do this at all with the current impl.
# The following would give the same result:
# vctrs:::vec_map(x, ~new_name_rcrd(1, .x), .ptype = ptype)
new_name_rcrd(list(1, 1), x)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1
#> 
#> attr(,"last_names")
#> [1] "Vaughan" "Henry"  
#> attr(,"class")
#> [1] "vctrs_name_rcrd" "list"

@lionel-
Copy link
Member Author

lionel- commented Aug 31, 2020

Generally this looks good, but I think that I have a fairly strong aversion to the list part of this interface.

From slider, I learned that there is a meaningful difference between slide() and slide_vec(.ptype = list()).

Sorry to have convinced you to use one approach in slider and then implement another approach for purrr... I now think a unique operation based on [[<- makes more sense, provided that the proposed way to deal with lists (chop2 / assign2) makes sense.

I think that the current vec_map(.ptype = list()) implementation allows for 1), but not 2).

I think it's clearer to just unbox the scalar rather than specify such a ptype. It doesn't seem like there's any practical advantage to an implementation based on [<-. When you think about it, it's weird to require functions that return a list containing a size-one vector.

@tzakharko
Copy link

I am curious whether this effort has been abandoned or still on the roadmap? I would be happy if I could retire purrr for a more streamlined implementation in vctrs

@lionel-
Copy link
Member Author

lionel- commented Sep 28, 2022

Now implemented in purrr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants