Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rectangling function rewrite #1200

Merged
merged 17 commits into from
Nov 15, 2021

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Nov 4, 2021

unnest_longer():
Closes #1201
Closes #1199
Closes #1198
Closes #1196
Closes #1195
Closes #1194
Closes #1193
Closes #1034
Closes #1029
Closes #740

unnest_wider():
Closes #1188
Closes #1187
Closes #1186
Closes #1125
Closes #1133
Closes #1060

hoist():
Added test of current behavior for #1203
Closes #1044
Closes #1000
Closes #999
Closes #998
Closes #996

Other:
Part of #1185


Affected revdeps:

Update: Both revdeps have already merged the PRs and have updated their packages on CRAN. So there are now no affected revdeps.


This PR completely rewrites the rectangling functions. They are now much more consistent in a variety of edge cases, and are often much faster than they used to be. I anticipate very little breaking changes, except in some extreme off-label usage situations. I will leave in-line comments, but I think this generally should be reviewed as if these were completely new functions.

Here is the most interesting performance difference I found so far, which comes from https://stackoverflow.com/questions/68897169/any-speedier-alternatives-to-tidyrunnest-longer-when-dealing-with-nested-nam. Mainly this comes from getting rid of tibble() in favor of new_data_frame(). I think #1133 also identified this. (It is also worth noting that in this case a simple unnest() is even faster. Like, in the ~200ms range. But it is nice to see this improvement).

library(tidyr)

elt <- 1:5
col <- rep(list(elt), 200000L)

df <- tibble(col = col)
df
#> # A tibble: 200,000 × 1
#>    col      
#>    <list>   
#>  1 <int [5]>
#>  2 <int [5]>
#>  3 <int [5]>
#>  4 <int [5]>
#>  5 <int [5]>
#>  6 <int [5]>
#>  7 <int [5]>
#>  8 <int [5]>
#>  9 <int [5]>
#> 10 <int [5]>
#> # … with 199,990 more rows

# CRAN
bench::mark(unnest_longer(df, col))
#> # A tibble: 1 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest_longer(df, col)    1.14m    1.14m    0.0146     292MB     2.09

# This PR
bench::mark(unnest_longer(df, col))
#> # A tibble: 1 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest_longer(df, col)    2.45s    2.45s     0.408    39.6MB     4.90

auto_name <- names2(pluckers) == "" & is_string

if (any(auto_name)) {
names(pluckers)[auto_name] <- unlist(pluckers[auto_name])
}

if (!is_named(pluckers)) {
stop("All elements of `...` must be named", call. = FALSE)
if (length(pluckers) > 0 && !is_named(pluckers)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be is_named2() if we had dev rlang

Comment on lines +240 to +244
# Standardize all pluckers to lists for splicing into `pluck()`
# and for easier handling in `strike()`
is_not_list <- !map_lgl(pluckers, vec_is_list)
pluckers[is_not_list] <- map(pluckers[is_not_list], vec_chop)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generally easier to transform a mix of plucker types like: hoist(df, col, x = "x", y = c(1, 2), z = list("foo", "bar")) into their equivalent list type, like hoist(df, col, x = list("x"), y = list(1, 2), z = list("foo", "bar"))

Comment on lines +318 to +323
# TODO: Use `allow_rename = FALSE`.
# Requires https://github.com/r-lib/tidyselect/issues/225.
cols <- tidyselect::eval_select(enquo(col), data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we just rely on the assumption that the user didn't do any renaming, which should be okay for now

data[[col]] <- map(
data[[col]], vec_to_long,
col = col,
if (!vec_is_list(col)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General idea is to get the column into a list or list-of form as fast as possible, no matter the original type

for (i in seq_along(col)) {
col[[i]] <- elt_to_long(
x = col[[i]],
index = indices[[i]],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use map2() here because indices might be NULL, but NULL[[i]] is valid and just returns NULL so the for loop form worked well

R/rectangle.R Outdated
Comment on lines 652 to 657
# Extremely special case for data.frames,
# which we want to treat like lists where we know the type of each element
x <- tidyr_new_list(x)
ptypes <- map(x, vec_ptype)
x <- tidyr_chop(x)
x <- map2(x, ptypes, new_list_of)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think the data frame behavior of unnest_wider() completely follows the semantics that I would expect. But what you have implemented is probably the most practically useful thing for it to do.

Here is what I would have expected (yes, it is weird):

library(tibble)

df <- tibble(x = list(tibble(a = 1:2, b = 2:3, c = 3:4)))
df
#> # A tibble: 1 × 1
#>   x               
#>   <list>          
#> 1 <tibble [2 × 3]>

df$x
#> [[1]]
#> # A tibble: 2 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     2     3
#> 2     2     3     4

# unnest_wider(df, x)
# - The size of each element of `df$x` determines the number of columns.
#   2 rows (size) = 2 new columns
# - The `vec_names()` of the element determine the column names, but tibbles
#   don't use rownames...
tibble(...1 = df$x[[1]][1,], ...2 = df$x[[1]][2,])
#> # A tibble: 1 × 2
#>   ...1$a    $b    $c ...2$a    $b    $c
#>    <int> <int> <int>  <int> <int> <int>
#> 1      1     2     3      2     3     4

# This is more useful if you had a data frame with row names
df <- tibble(x = list(data.frame(a = 1:2, b = 2:3, c = 3:4, row.names = c("r1", "r2"))))
df$x
#> [[1]]
#>    a b c
#> r1 1 2 3
#> r2 2 3 4

# Then you'd get:
# unnest_wider(df, x)
tibble(r1 = df$x[[1]][1,], r2 = df$x[[1]][2,])
#> # A tibble: 1 × 2
#>    r1$a    $b    $c  r2$a    $b    $c
#>   <int> <int> <int> <int> <int> <int>
#> 1     1     2     3     2     3     4

Created on 2021-11-08 by the reprex package (v2.0.1)

R/rectangle.R Outdated
if (!is.null(ptype)) {
x <- tidyr_new_list(x)
x <- vec_cast_common(!!!x, .to = ptype)
x <- new_list_of(x, ptype = ptype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty slick. It means that if you can't simplify (or choose not to), then you can create a list-of of the non-simplified results by supplying a ptype for that component. This was requested in #998

library(tidyr)

df <- tibble(
  x = list(
    list(a = 1:2),
    list(a = 1)
  )
)

df
#> # A tibble: 2 × 1
#>   x               
#>   <list>          
#> 1 <named list [1]>
#> 2 <named list [1]>

unnest_wider(df, x)
#> # A tibble: 2 × 1
#>   a        
#>   <list>   
#> 1 <int [2]>
#> 2 <dbl [1]>

unnest_wider(df, x, ptype = list(a = integer()))
#> # A tibble: 2 × 1
#>             a
#>   <list<int>>
#> 1         [2]
#> 2         [1]

Created on 2021-11-08 by the reprex package (v2.0.1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is cool!

} else {
abort(glue("Can't simplfy '{nm}'; elements have length > 1"))
}
if (!simplify) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we now apply the ptype and transform before returning if we don't want to simplify. We also apply the transform then the ptype if they both are given, which I'm fairly confident is correct.

} else {
stop("Input must be list of vectors", call. = FALSE)
# Ensure empty elements are filled in with their correct size 1 equivalent
info <- list_init_empty(x, null = TRUE, typed = TRUE)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more correct version of the previous x[n == 0] <- list(NA) behavior

Comment on lines +844 to +878
# Assume that if combining fails, then we want to return the object
# after the `ptype` and `transform` have been applied, but before the
# empty element filling was applied
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. compare CRAN against dev here:

library(tidyr)

df <- tibble(
  x = list(
    list(a = NULL),
    list(a = 1L),
    list(a = "x")
  )
)

df
#> # A tibble: 3 × 1
#>   x               
#>   <list>          
#> 1 <named list [1]>
#> 2 <named list [1]>
#> 3 <named list [1]>

# CRAN behavior (the NULL is replaced by NA)
unnest_wider(df, x)
#> # A tibble: 3 × 1
#>   a        
#>   <list>   
#> 1 <lgl [1]>
#> 2 <int [1]>
#> 3 <chr [1]>
unnest_wider(df, x)$a[[1]]
#> [1] NA

# Dev behavior
unnest_wider(df, x)
#> # A tibble: 3 × 1
#>   a        
#>   <list>   
#> 1 <NULL>   
#> 2 <int [1]>
#> 3 <chr [1]>
unnest_wider(df, x)$a[[1]]
#> NULL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks much better.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, just through the docs. I'll try and find the time to play with the code a bit too.

R/rectangle.R Outdated Show resolved Hide resolved
R/rectangle.R Outdated Show resolved Hide resolved
R/rectangle.R Outdated Show resolved Hide resolved
R/rectangle.R Outdated Show resolved Hide resolved
R/rectangle.R Outdated Show resolved Hide resolved
R/rectangle.R Outdated Show resolved Hide resolved
R/rectangle.R Outdated
#' the inner names or positions (if not named) of the values. If multiple
#' columns are specified in `col`, this can also be a glue string containing
#' `"{col}"` to provide a template for the column names. If `NULL`, defaults
#' to `values_to` suffixed with `"_id"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' to `values_to` suffixed with `"_id"`.
#' to `"{col}_id"`.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do mean values_to here, because it might be set to something other than "{col}", like "{col}2", making the indices_to columns default to "{col}2_id"

R/rectangle.R Outdated Show resolved Hide resolved
R/rectangle.R Outdated
if (!is.null(ptype)) {
x <- tidyr_new_list(x)
x <- vec_cast_common(!!!x, .to = ptype)
x <- new_list_of(x, ptype = ptype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is cool!

Comment on lines +844 to +878
# Assume that if combining fails, then we want to return the object
# after the `ptype` and `transform` have been applied, but before the
# empty element filling was applied
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks much better.

@mgirlich
Copy link
Contributor

mgirlich commented Nov 10, 2021

So far I only had a look at unnest_longer() and df_unchop(). The code looks good to me and easy to read. Great work!

Some things I found:

  1. Partial inner names cause an error
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
tibble(y = list(a = 1:2, 3:4)) %>% 
  unnest_longer(y, indices_include = FALSE)
#> Error: If any elements of `.data` are named, all must be named
  1. The names of the ptype and transform argument should be checked with vec_as_names(repair = "check_unique")
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
# two ptypes for x
tibble(x = integer()) %>% 
  unnest_longer(
    c(x),
    ptype = list(x = double(), x = integer())
  )
#> # A tibble: 0 × 1
#> # … with 1 variable: x <dbl>

# unnamed list
tibble(x = integer()) %>% 
  unnest_longer(
    c(x),
    ptype = list(double())
  )
#> # A tibble: 0 × 1
#> # … with 1 variable: x <int>

# maybe warn/inform about unused ptypes?
tibble(x = integer()) %>% 
  unnest_longer(
    c(x),
    ptype = list(y = "a")
  )
#> # A tibble: 0 × 1
#> # … with 1 variable: x <int>
  1. I am not sure about the handling of names in vectors that are not lists. Should y_id be c("a", "a", "b", "b")?
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
tibble(
  x = list(c(a = 1, b = 2)),
  y = c(a = 1, b = 2)
) %>% 
  unnest_longer(
    c(x, y),
    indices_include = TRUE
  )
#> # A tibble: 4 × 4
#>       x x_id      y  y_id
#>   <dbl> <chr> <dbl> <int>
#> 1     1 a         1     1
#> 2     2 b         1     1
#> 3     1 a         2     1
#> 4     2 b         2     1

Created on 2021-11-10 by the reprex package (v2.0.1)

@mgirlich
Copy link
Contributor

For unnest_wider()

  1. Add a test with a data frame column and names_sep
tibble(
  y = tibble(a = 1, b = 11)
) %>% 
  unnest_wider(y, names_sep = "_")
  1. Add a test for a column that is not a list and not a data frame
tibble(
  y = c(a = 1, b = 11)
) %>% 
  unnest_wider(y, names_sep = "_")

@DavisVaughan
Copy link
Member Author

Thanks @mgirlich, very helpful!

  1. This is due to an annoying list-of bug that would be fixed by Allow "" as a valid name for <vctrs_vctr> objects r-lib/vctrs#784. I'll try and get that in the next vctrs release, it has affected me many times now. I've worked around it the best I can with tidyr_temporary_new_list_of(). There is one edge case that it affects, which I opened an issue for Remove usage of tidyr_temporary_new_list_of() #1212.

  2. I added more input validation here, but I think unused ptype/transform elements are going to have to go by silently. I don't think it is worth it to check for these.

  3. I think your intuition is right. When converting a non-list to a list, there are two things you can do with the names. You can move them to the outer list (this is what we currently do, and what CRAN tidyr does but doesn't have tests for) or you can keep them on the inner elements after converting to a list. The latter aligns with your suggestion, and I now feel that that is correct because it uses more standard vctrs tooling (i.e. vec_chop() rather than tidyr_chop()) and it generally results in more expected output, even though these are non-primary data types. Below are the two options. I've switched over to the equivalent2 approach you suggested.

library(tidyr)
library(vctrs)

# Say we have this:
df <- tibble(x = c(a = 1L, b = 2L))

# Names are moved to outer list (tidyr_chop())
equivalent1 <- tibble(x = list_of(a = 1L, b = 2L))

# Names are kept on inner elements (vec_chop())
equivalent2 <- tibble(x = list_of(c(a = 1L), c(b = 2L)))

unnest_wider(equivalent1, x)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 2 × 1
#>    ...1
#>   <int>
#> 1     1
#> 2     2

unnest_wider(equivalent2, x)
#> # A tibble: 2 × 2
#>       a     b
#>   <int> <int>
#> 1     1    NA
#> 2    NA     2

longer1 <- unnest_longer(equivalent1, x)
longer1
#> # A tibble: 2 × 1
#>       x
#>   <int>
#> 1     1
#> 2     2

longer1$x
#> [1] 1 2

longer2 <- unnest_longer(equivalent2, x)
longer2
#> # A tibble: 2 × 2
#>       x x_id 
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b

longer2$x
#> a b 
#> 1 2
  1. I've added those two tests, thanks!

@mgirlich
Copy link
Contributor

I added more input validation here, but I think unused ptype/transform elements are going to have to go by silently. I don't think it is worth it to check for these.

Unused ptype/transform element might just be a typo. On the other hand in unnest_wider() unused ptype elements might be used for something different:

Currently, unnest_wider() isn't stable regarding the shape of the output:

library(tidyr)
df <- tibble(
  x = 1:2,
  y = list(
    list(a = 1),
    list(a = 2, b = 2)
  )
)

df %>% unnest_wider(y)
#> # A tibble: 2 × 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1     1    NA
#> 2     2     2     2

df %>% dplyr::slice(1) %>% unnest_wider(y)
#> # A tibble: 1 × 2
#>       x     a
#>   <int> <dbl>
#> 1     1     1

Created on 2021-11-11 by the reprex package (v2.0.1)

The ptype argument can be used to ensure the types of the columns. Maybe one could also use it to ensure to also have the columns x, a, and b in the second case:

df %>% dplyr::slice(1) %>% unnest_wider(y, ptype = list(a = double(), b = double()))
#> # A tibble: 1 × 2
#>       x     a
#>   <int> <dbl>
#> 1     1     1

# maybe this should output
#> # A tibble: 1 × 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1     1    NA

@mgirlich
Copy link
Contributor

I am still not entirely sure about the handling of indices as they are not type stable:

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
df <- tibble(x = list(c(1, 2), c(a = 3)))

df %>% 
  unnest_longer(x, indices_include = TRUE)
#> # A tibble: 3 × 2
#>       x x_id 
#>   <dbl> <chr>
#> 1     1 <NA> 
#> 2     2 <NA> 
#> 3     3 a

df %>% 
  dplyr::slice(1) %>% 
  unnest_longer(x, indices_include = TRUE)
#> # A tibble: 2 × 2
#>       x  x_id
#>   <dbl> <int>
#> 1     1     1
#> 2     2     2

Created on 2021-11-11 by the reprex package (v2.0.1)

I am not sure whether this is a common scenario but it feels like this goes against the idea of the shape and type predictability in the tidyverse.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 11, 2021

Maybe one could also use it to ensure to also have the columns x, a, and b in the second case:

I think that would stray pretty far away from how ptype is used elsewhere in tidyr. But that could be the place of a "spec" object, like mentioned in #835. I doubt we add that in this release, but that seems like the right place for this idea

I am still not entirely sure about the handling of indices as they are not type stable:

Yea I think we are probably stuck with this at this point for backwards compatibility's sake. I agree that some alternative API might have been cleaner / more type stable (like names_to and indices_to you mentioned in #1029), but at this point it is probably slightly better to be backwards compatible. Mainly this PR just makes the current API work as it was intended to in more cases.

@DavisVaughan
Copy link
Member Author

@mgirlich would you like to review any further?

At this point I am ready to merge. No revdeps are affected now that wehoop and ffscrapr have already updated their packages on CRAN.

@DavisVaughan
Copy link
Member Author

the handling of indices as they are not type stable

If we are really looking for some way to justify this, Hadley reminded me that one way to look at this is that an index could be an integer or character vector, so it can be seen as a union type of integer+character. And since x[<index>] works correctly with integer or character vectors, it should be fine for most usage. Obviously not a perfect argument, but that + preserving backwards compatibility is enough to convince me to keep this as is.

@mgirlich
Copy link
Contributor

A couple more things then I think it is ready to merge

  1. An early return to col_to_wide() for data frames would be great. Adding this
  if (is.data.frame(col)) {
    out <- col

    if (!is.null(names_sep)) {
      outer <- name
      inner <- colnames(out)
      names(out) <- apply_names_sep(outer, inner, names_sep)
    }

    return(out)
  }

drastically improves the performance

set.seed(1)

n <- 10e3
df <- tibble(
  x = tibble(a = sample(1:n))
)
bench::mark(unnest_wider(df, x))


# Current
# A tibble: 1 × 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_wider(df, x)    2.71s    2.71s     0.370     1.5GB     17.0     1    46

# With early return
# A tibble: 1 × 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_wider(df, x)   2.04ms   2.27ms      421.    1.93MB     31.4   134    10  
  1. Similarly an early return to col_to_long() for non-list vectors would be great. This passes the tests and is way faster
  if (!vec_is_list(col)) {
    out <- tibble("{name}" := col)

    if (is_false(indices_include)) {
      return(out)
    }

    indices <- vec_names(col)
    all_unnamed <- is.null(indices)

    if (is.null(indices_include) && all_unnamed) {
      return(out)
    }

    if (all_unnamed) {
      out[[indices_to]] <- vec_rep(1L, vec_size(col))
    } else {
      out[[indices_to]] <- indices
    }

    return(out)
  }
set.seed(1)

n <- 10e3
df <- tibble(
  x = tibble(a = sample(1:n))
)
bench::mark(unnest_longer(df, x))

# Current
# A tibble: 1 × 13
  expression                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_longer(df, x)    212ms    266ms      3.76     3.6MB     9.40     2     5

# With early return
# A tibble: 1 × 13
  expression                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_longer(df, x)   2.74ms   3.15ms      316.    2.19MB     38.1    91    11
  1. Can we get rid of the duplicated New names notification?
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr

a <- set_names(1:3, c("x", NA, ""))
# get rid of duplicated new names notification?
tibble(a) %>% 
  unnest_wider(a)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 3 × 2
#>       x  ...1
#>   <int> <int>
#> 1     1    NA
#> 2    NA     2
#> 3    NA     3

Created on 2021-11-15 by the reprex package (v2.0.1)

  1. In unnest_wider() the names_sep doesn't work well with NA
devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr

a <- set_names(1:3, c("x", NA, ""))
# "a_NA" name?
tibble(a) %>% 
  unnest_wider(a, names_sep = "_")
#> # A tibble: 3 × 3
#>     a_x  a_NA    a_
#>   <int> <int> <int>
#> 1     1    NA    NA
#> 2    NA     2    NA
#> 3    NA    NA     3

Created on 2021-11-15 by the reprex package (v2.0.1)

@DavisVaughan
Copy link
Member Author

  1. This changes the behavior when simplify = FALSE. In general, I am a big fan of the fact that col_to_wide() now always returns a list-of column where each element is a tibble. I'd like to avoid changing that to special case non-primary data types. I did add a test for this now though, so thanks!
library(tidyr)

df <- tibble(x = tibble(a = 1:2))

# This PR
unnest_wider(df, x, simplify = FALSE)
#> # A tibble: 2 × 1
#>             a
#>   <list<int>>
#> 1         [1]
#> 2         [1]

# With the mentioned change
unnest_wider(df, x, simplify = FALSE)
#> # A tibble: 2 × 1
#>       a
#>   <int>
#> 1     1
#> 2     2
  1. I can't find any fault with this one, I think it would work, but I am still a bit hesitant about branching too far away from the main code path for this special case. It is already a pretty complex operation, so I think I'd rather keep it simple here (It took me a long time to verify that it wouldn't change anything). I am optimistic that since this is not the intended use case for unnest_longer() anyways, that it won't affect very much actual code.

  2. This is a vector special case of unnest_wider() and messages with unnamed lists #1060. I think that we are going to have to keep these duplicated names messages with unnamed / partially named lists, as otherwise if we stopped after the first name repair message then you could be fooled into thinking you only have 1 element without names. Non-lists obviously get converted to lists with vec_chop(), so possibly we could repair names early on before using vec_chop(), but I'm not sure it is worth it. Again, with this being the non-primary data type, I'm mainly happy with it returning a consistent and theoretically correct result with the least amount of variation from the main code path as is possible.

  3. Yea, I think this one is a bug, in particular, these two cases should return the same data frame structure (even if the column names are a little different)

library(tidyr)
library(rlang)

col <- list(
  set_names(1, "a"),
  set_names(1, NA_character_),
  set_names(1:2, c("", ""))
)

df <- tibble(col = col)

# Should be the same
unnest_wider(df, col)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> # A tibble: 3 × 3
#>       a  ...1  ...2
#>   <dbl> <dbl> <int>
#> 1     1    NA    NA
#> 2    NA     1    NA
#> 3    NA     1     2
unnest_wider(df, col, names_sep = "_")
#> New names:
#> * col_ -> col_...1
#> * col_ -> col_...2
#> # A tibble: 3 × 4
#>   col_a col_NA col_...1 col_...2
#>   <dbl>  <dbl>    <int>    <int>
#> 1     1     NA       NA       NA
#> 2    NA      1       NA       NA
#> 3    NA     NA        1        2
library(tidyr)
library(rlang)

col <- list(
  set_names(1, "a"),
  set_names(1, NA_character_),
  set_names(1, "")
)

df <- tibble(col = col)

# Should be the same
unnest_wider(df, col)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 3 × 2
#>       a  ...1
#>   <dbl> <dbl>
#> 1     1    NA
#> 2    NA     1
#> 3    NA     1
unnest_wider(df, col, names_sep = "_")
#> # A tibble: 3 × 3
#>   col_a col_NA  col_
#>   <dbl>  <dbl> <dbl>
#> 1     1     NA    NA
#> 2    NA      1    NA
#> 3    NA     NA     1

At the very least this means we should do minimal name repair before applying names_sep, but really I think we should do "unique" name repair to ensure that these return consistent results (i.e. should probably get "...1" for the first suffix).

library(tidyr)
library(rlang)

col1 <- list(set_names(1, ""))
df1 <- tibble(col = col1)

col2 <- list(set_names(1:2, c("", "")))
df2 <- tibble(col = col2)

unnest_wider(df1, col, names_sep = "_")
#> # A tibble: 1 × 1
#>    col_
#>   <dbl>
#> 1     1

unnest_wider(df2, col, names_sep = "_")
#> New names:
#> * col_ -> col_...1
#> * col_ -> col_...2
#> # A tibble: 1 × 2
#>   col_...1 col_...2
#>      <int>    <int>
#> 1        1        2

Because vctr objects can't currently have `""` names
This uses more explainable vctrs tooling for converting non-primary data types (i.e. non-lists) into lists. This also seems to produce the expected output in more scenarios.

Also inlined `tidyr_chop()` into `elt_to_wide()` since that is the only other place it was used. The fact that this removed a helper makes me optimistic that it is the right approach.
Applying it before `names_sep` rather than after means that `""` and `NA_character_` names get repaired early on before they are combined with the prefix and `names_sep`, which can make them mistakently look like "valid" names
@DavisVaughan DavisVaughan merged commit a99bfd9 into tidyverse:main Nov 15, 2021
@DavisVaughan DavisVaughan deleted the feature/rework-unnest-wider branch November 15, 2021 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment