-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rectangling function rewrite #1200
Rectangling function rewrite #1200
Conversation
auto_name <- names2(pluckers) == "" & is_string | ||
|
||
if (any(auto_name)) { | ||
names(pluckers)[auto_name] <- unlist(pluckers[auto_name]) | ||
} | ||
|
||
if (!is_named(pluckers)) { | ||
stop("All elements of `...` must be named", call. = FALSE) | ||
if (length(pluckers) > 0 && !is_named(pluckers)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be is_named2()
if we had dev rlang
# Standardize all pluckers to lists for splicing into `pluck()` | ||
# and for easier handling in `strike()` | ||
is_not_list <- !map_lgl(pluckers, vec_is_list) | ||
pluckers[is_not_list] <- map(pluckers[is_not_list], vec_chop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is generally easier to transform a mix of plucker types like: hoist(df, col, x = "x", y = c(1, 2), z = list("foo", "bar"))
into their equivalent list type, like hoist(df, col, x = list("x"), y = list(1, 2), z = list("foo", "bar"))
# TODO: Use `allow_rename = FALSE`. | ||
# Requires https://github.com/r-lib/tidyselect/issues/225. | ||
cols <- tidyselect::eval_select(enquo(col), data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we just rely on the assumption that the user didn't do any renaming, which should be okay for now
data[[col]] <- map( | ||
data[[col]], vec_to_long, | ||
col = col, | ||
if (!vec_is_list(col)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General idea is to get the column into a list or list-of form as fast as possible, no matter the original type
for (i in seq_along(col)) { | ||
col[[i]] <- elt_to_long( | ||
x = col[[i]], | ||
index = indices[[i]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use map2()
here because indices
might be NULL
, but NULL[[i]]
is valid and just returns NULL
so the for loop form worked well
R/rectangle.R
Outdated
# Extremely special case for data.frames, | ||
# which we want to treat like lists where we know the type of each element | ||
x <- tidyr_new_list(x) | ||
ptypes <- map(x, vec_ptype) | ||
x <- tidyr_chop(x) | ||
x <- map2(x, ptypes, new_list_of) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't think the data frame behavior of unnest_wider()
completely follows the semantics that I would expect. But what you have implemented is probably the most practically useful thing for it to do.
Here is what I would have expected (yes, it is weird):
library(tibble)
df <- tibble(x = list(tibble(a = 1:2, b = 2:3, c = 3:4)))
df
#> # A tibble: 1 × 1
#> x
#> <list>
#> 1 <tibble [2 × 3]>
df$x
#> [[1]]
#> # A tibble: 2 × 3
#> a b c
#> <int> <int> <int>
#> 1 1 2 3
#> 2 2 3 4
# unnest_wider(df, x)
# - The size of each element of `df$x` determines the number of columns.
# 2 rows (size) = 2 new columns
# - The `vec_names()` of the element determine the column names, but tibbles
# don't use rownames...
tibble(...1 = df$x[[1]][1,], ...2 = df$x[[1]][2,])
#> # A tibble: 1 × 2
#> ...1$a $b $c ...2$a $b $c
#> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 2 3 4
# This is more useful if you had a data frame with row names
df <- tibble(x = list(data.frame(a = 1:2, b = 2:3, c = 3:4, row.names = c("r1", "r2"))))
df$x
#> [[1]]
#> a b c
#> r1 1 2 3
#> r2 2 3 4
# Then you'd get:
# unnest_wider(df, x)
tibble(r1 = df$x[[1]][1,], r2 = df$x[[1]][2,])
#> # A tibble: 1 × 2
#> r1$a $b $c r2$a $b $c
#> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 2 3 4
Created on 2021-11-08 by the reprex package (v2.0.1)
R/rectangle.R
Outdated
if (!is.null(ptype)) { | ||
x <- tidyr_new_list(x) | ||
x <- vec_cast_common(!!!x, .to = ptype) | ||
x <- new_list_of(x, ptype = ptype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty slick. It means that if you can't simplify (or choose not to), then you can create a list-of of the non-simplified results by supplying a ptype for that component. This was requested in #998
library(tidyr)
df <- tibble(
x = list(
list(a = 1:2),
list(a = 1)
)
)
df
#> # A tibble: 2 × 1
#> x
#> <list>
#> 1 <named list [1]>
#> 2 <named list [1]>
unnest_wider(df, x)
#> # A tibble: 2 × 1
#> a
#> <list>
#> 1 <int [2]>
#> 2 <dbl [1]>
unnest_wider(df, x, ptype = list(a = integer()))
#> # A tibble: 2 × 1
#> a
#> <list<int>>
#> 1 [2]
#> 2 [1]
Created on 2021-11-08 by the reprex package (v2.0.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is cool!
} else { | ||
abort(glue("Can't simplfy '{nm}'; elements have length > 1")) | ||
} | ||
if (!simplify) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we now apply the ptype and transform before returning if we don't want to simplify. We also apply the transform then the ptype if they both are given, which I'm fairly confident is correct.
} else { | ||
stop("Input must be list of vectors", call. = FALSE) | ||
# Ensure empty elements are filled in with their correct size 1 equivalent | ||
info <- list_init_empty(x, null = TRUE, typed = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more correct version of the previous x[n == 0] <- list(NA)
behavior
# Assume that if combining fails, then we want to return the object | ||
# after the `ptype` and `transform` have been applied, but before the | ||
# empty element filling was applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. compare CRAN against dev here:
library(tidyr)
df <- tibble(
x = list(
list(a = NULL),
list(a = 1L),
list(a = "x")
)
)
df
#> # A tibble: 3 × 1
#> x
#> <list>
#> 1 <named list [1]>
#> 2 <named list [1]>
#> 3 <named list [1]>
# CRAN behavior (the NULL is replaced by NA)
unnest_wider(df, x)
#> # A tibble: 3 × 1
#> a
#> <list>
#> 1 <lgl [1]>
#> 2 <int [1]>
#> 3 <chr [1]>
unnest_wider(df, x)$a[[1]]
#> [1] NA
# Dev behavior
unnest_wider(df, x)
#> # A tibble: 3 × 1
#> a
#> <list>
#> 1 <NULL>
#> 2 <int [1]>
#> 3 <chr [1]>
unnest_wider(df, x)$a[[1]]
#> NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that looks much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, just through the docs. I'll try and find the time to play with the code a bit too.
R/rectangle.R
Outdated
#' the inner names or positions (if not named) of the values. If multiple | ||
#' columns are specified in `col`, this can also be a glue string containing | ||
#' `"{col}"` to provide a template for the column names. If `NULL`, defaults | ||
#' to `values_to` suffixed with `"_id"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' to `values_to` suffixed with `"_id"`. | |
#' to `"{col}_id"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do mean values_to
here, because it might be set to something other than "{col}"
, like "{col}2"
, making the indices_to
columns default to "{col}2_id"
R/rectangle.R
Outdated
if (!is.null(ptype)) { | ||
x <- tidyr_new_list(x) | ||
x <- vec_cast_common(!!!x, .to = ptype) | ||
x <- new_list_of(x, ptype = ptype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is cool!
# Assume that if combining fails, then we want to return the object | ||
# after the `ptype` and `transform` have been applied, but before the | ||
# empty element filling was applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that looks much better.
So far I only had a look at Some things I found:
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
tibble(y = list(a = 1:2, 3:4)) %>%
unnest_longer(y, indices_include = FALSE)
#> Error: If any elements of `.data` are named, all must be named
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
# two ptypes for x
tibble(x = integer()) %>%
unnest_longer(
c(x),
ptype = list(x = double(), x = integer())
)
#> # A tibble: 0 × 1
#> # … with 1 variable: x <dbl>
# unnamed list
tibble(x = integer()) %>%
unnest_longer(
c(x),
ptype = list(double())
)
#> # A tibble: 0 × 1
#> # … with 1 variable: x <int>
# maybe warn/inform about unused ptypes?
tibble(x = integer()) %>%
unnest_longer(
c(x),
ptype = list(y = "a")
)
#> # A tibble: 0 × 1
#> # … with 1 variable: x <int>
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
tibble(
x = list(c(a = 1, b = 2)),
y = c(a = 1, b = 2)
) %>%
unnest_longer(
c(x, y),
indices_include = TRUE
)
#> # A tibble: 4 × 4
#> x x_id y y_id
#> <dbl> <chr> <dbl> <int>
#> 1 1 a 1 1
#> 2 2 b 1 1
#> 3 1 a 2 1
#> 4 2 b 2 1 Created on 2021-11-10 by the reprex package (v2.0.1) |
For
tibble(
y = tibble(a = 1, b = 11)
) %>%
unnest_wider(y, names_sep = "_")
tibble(
y = c(a = 1, b = 11)
) %>%
unnest_wider(y, names_sep = "_") |
Thanks @mgirlich, very helpful!
library(tidyr)
library(vctrs)
# Say we have this:
df <- tibble(x = c(a = 1L, b = 2L))
# Names are moved to outer list (tidyr_chop())
equivalent1 <- tibble(x = list_of(a = 1L, b = 2L))
# Names are kept on inner elements (vec_chop())
equivalent2 <- tibble(x = list_of(c(a = 1L), c(b = 2L)))
unnest_wider(equivalent1, x)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 2 × 1
#> ...1
#> <int>
#> 1 1
#> 2 2
unnest_wider(equivalent2, x)
#> # A tibble: 2 × 2
#> a b
#> <int> <int>
#> 1 1 NA
#> 2 NA 2
longer1 <- unnest_longer(equivalent1, x)
longer1
#> # A tibble: 2 × 1
#> x
#> <int>
#> 1 1
#> 2 2
longer1$x
#> [1] 1 2
longer2 <- unnest_longer(equivalent2, x)
longer2
#> # A tibble: 2 × 2
#> x x_id
#> <int> <chr>
#> 1 1 a
#> 2 2 b
longer2$x
#> a b
#> 1 2
|
Unused Currently, library(tidyr)
df <- tibble(
x = 1:2,
y = list(
list(a = 1),
list(a = 2, b = 2)
)
)
df %>% unnest_wider(y)
#> # A tibble: 2 × 3
#> x a b
#> <int> <dbl> <dbl>
#> 1 1 1 NA
#> 2 2 2 2
df %>% dplyr::slice(1) %>% unnest_wider(y)
#> # A tibble: 1 × 2
#> x a
#> <int> <dbl>
#> 1 1 1 Created on 2021-11-11 by the reprex package (v2.0.1) The df %>% dplyr::slice(1) %>% unnest_wider(y, ptype = list(a = double(), b = double()))
#> # A tibble: 1 × 2
#> x a
#> <int> <dbl>
#> 1 1 1
# maybe this should output
#> # A tibble: 1 × 3
#> x a b
#> <int> <dbl> <dbl>
#> 1 1 1 NA |
I am still not entirely sure about the handling of indices as they are not type stable: devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
df <- tibble(x = list(c(1, 2), c(a = 3)))
df %>%
unnest_longer(x, indices_include = TRUE)
#> # A tibble: 3 × 2
#> x x_id
#> <dbl> <chr>
#> 1 1 <NA>
#> 2 2 <NA>
#> 3 3 a
df %>%
dplyr::slice(1) %>%
unnest_longer(x, indices_include = TRUE)
#> # A tibble: 2 × 2
#> x x_id
#> <dbl> <int>
#> 1 1 1
#> 2 2 2 Created on 2021-11-11 by the reprex package (v2.0.1) I am not sure whether this is a common scenario but it feels like this goes against the idea of the shape and type predictability in the tidyverse. |
I think that would stray pretty far away from how
Yea I think we are probably stuck with this at this point for backwards compatibility's sake. I agree that some alternative API might have been cleaner / more type stable (like |
@mgirlich would you like to review any further? At this point I am ready to merge. No revdeps are affected now that wehoop and ffscrapr have already updated their packages on CRAN. |
If we are really looking for some way to justify this, Hadley reminded me that one way to look at this is that an index could be an integer or character vector, so it can be seen as a union type of integer+character. And since |
A couple more things then I think it is ready to merge
if (is.data.frame(col)) {
out <- col
if (!is.null(names_sep)) {
outer <- name
inner <- colnames(out)
names(out) <- apply_names_sep(outer, inner, names_sep)
}
return(out)
} drastically improves the performance set.seed(1)
n <- 10e3
df <- tibble(
x = tibble(a = sample(1:n))
)
bench::mark(unnest_wider(df, x))
# Current
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 unnest_wider(df, x) 2.71s 2.71s 0.370 1.5GB 17.0 1 46
# With early return
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 unnest_wider(df, x) 2.04ms 2.27ms 421. 1.93MB 31.4 134 10
if (!vec_is_list(col)) {
out <- tibble("{name}" := col)
if (is_false(indices_include)) {
return(out)
}
indices <- vec_names(col)
all_unnamed <- is.null(indices)
if (is.null(indices_include) && all_unnamed) {
return(out)
}
if (all_unnamed) {
out[[indices_to]] <- vec_rep(1L, vec_size(col))
} else {
out[[indices_to]] <- indices
}
return(out)
} set.seed(1)
n <- 10e3
df <- tibble(
x = tibble(a = sample(1:n))
)
bench::mark(unnest_longer(df, x))
# Current
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 unnest_longer(df, x) 212ms 266ms 3.76 3.6MB 9.40 2 5
# With early return
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 unnest_longer(df, x) 2.74ms 3.15ms 316. 2.19MB 38.1 91 11
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
a <- set_names(1:3, c("x", NA, ""))
# get rid of duplicated new names notification?
tibble(a) %>%
unnest_wider(a)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 3 × 2
#> x ...1
#> <int> <int>
#> 1 1 NA
#> 2 NA 2
#> 3 NA 3 Created on 2021-11-15 by the reprex package (v2.0.1)
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
a <- set_names(1:3, c("x", NA, ""))
# "a_NA" name?
tibble(a) %>%
unnest_wider(a, names_sep = "_")
#> # A tibble: 3 × 3
#> a_x a_NA a_
#> <int> <int> <int>
#> 1 1 NA NA
#> 2 NA 2 NA
#> 3 NA NA 3 Created on 2021-11-15 by the reprex package (v2.0.1) |
library(tidyr)
df <- tibble(x = tibble(a = 1:2))
# This PR
unnest_wider(df, x, simplify = FALSE)
#> # A tibble: 2 × 1
#> a
#> <list<int>>
#> 1 [1]
#> 2 [1]
# With the mentioned change
unnest_wider(df, x, simplify = FALSE)
#> # A tibble: 2 × 1
#> a
#> <int>
#> 1 1
#> 2 2
library(tidyr)
library(rlang)
col <- list(
set_names(1, "a"),
set_names(1, NA_character_),
set_names(1:2, c("", ""))
)
df <- tibble(col = col)
# Should be the same
unnest_wider(df, col)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> # A tibble: 3 × 3
#> a ...1 ...2
#> <dbl> <dbl> <int>
#> 1 1 NA NA
#> 2 NA 1 NA
#> 3 NA 1 2
unnest_wider(df, col, names_sep = "_")
#> New names:
#> * col_ -> col_...1
#> * col_ -> col_...2
#> # A tibble: 3 × 4
#> col_a col_NA col_...1 col_...2
#> <dbl> <dbl> <int> <int>
#> 1 1 NA NA NA
#> 2 NA 1 NA NA
#> 3 NA NA 1 2 library(tidyr)
library(rlang)
col <- list(
set_names(1, "a"),
set_names(1, NA_character_),
set_names(1, "")
)
df <- tibble(col = col)
# Should be the same
unnest_wider(df, col)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 3 × 2
#> a ...1
#> <dbl> <dbl>
#> 1 1 NA
#> 2 NA 1
#> 3 NA 1
unnest_wider(df, col, names_sep = "_")
#> # A tibble: 3 × 3
#> col_a col_NA col_
#> <dbl> <dbl> <dbl>
#> 1 1 NA NA
#> 2 NA 1 NA
#> 3 NA NA 1 At the very least this means we should do minimal name repair before applying library(tidyr)
library(rlang)
col1 <- list(set_names(1, ""))
df1 <- tibble(col = col1)
col2 <- list(set_names(1:2, c("", "")))
df2 <- tibble(col = col2)
unnest_wider(df1, col, names_sep = "_")
#> # A tibble: 1 × 1
#> col_
#> <dbl>
#> 1 1
unnest_wider(df2, col, names_sep = "_")
#> New names:
#> * col_ -> col_...1
#> * col_ -> col_...2
#> # A tibble: 1 × 2
#> col_...1 col_...2
#> <int> <int>
#> 1 1 2 |
Since we don't have `vec_chop2()` yet r-lib/vctrs#1226
This will be useful for an enhanced `simplify_col()`
Because vctr objects can't currently have `""` names
This uses more explainable vctrs tooling for converting non-primary data types (i.e. non-lists) into lists. This also seems to produce the expected output in more scenarios. Also inlined `tidyr_chop()` into `elt_to_wide()` since that is the only other place it was used. The fact that this removed a helper makes me optimistic that it is the right approach.
Applying it before `names_sep` rather than after means that `""` and `NA_character_` names get repaired early on before they are combined with the prefix and `names_sep`, which can make them mistakently look like "valid" names
e01b348
to
e0d0d18
Compare
unnest_longer()
:Closes #1201
Closes #1199
Closes #1198
Closes #1196
Closes #1195
Closes #1194
Closes #1193
Closes #1034
Closes #1029
Closes #740
unnest_wider()
:Closes #1188
Closes #1187
Closes #1186
Closes #1125
Closes #1133
Closes #1060
hoist()
:Added test of current behavior for #1203
Closes #1044
Closes #1000
Closes #999
Closes #998
Closes #996
Other:
Part of #1185
Affected revdeps:
ffscrapr
unnest_wider()
ffverse/ffscrapr#344wehoop
unnest_wider()
. PR Directly use column name inunnest_wider()
calls sportsdataverse/wehoop#14unnest_wider()
drops column if it contains empty lists #1125 (comment) means that columns with all empty elements are now (correctly) retained, which means this set of column names should be updated. Luckily this test isn't tested on CRAN, so it should be easy. PR Expect a"notes"
column in the scoreboard functions sportsdataverse/wehoop#15Update: Both revdeps have already merged the PRs and have updated their packages on CRAN. So there are now no affected revdeps.
This PR completely rewrites the rectangling functions. They are now much more consistent in a variety of edge cases, and are often much faster than they used to be. I anticipate very little breaking changes, except in some extreme off-label usage situations. I will leave in-line comments, but I think this generally should be reviewed as if these were completely new functions.
Here is the most interesting performance difference I found so far, which comes from https://stackoverflow.com/questions/68897169/any-speedier-alternatives-to-tidyrunnest-longer-when-dealing-with-nested-nam. Mainly this comes from getting rid of
tibble()
in favor ofnew_data_frame()
. I think #1133 also identified this. (It is also worth noting that in this case a simpleunnest()
is even faster. Like, in the ~200ms range. But it is nice to see this improvement).