Rectangling function rewrite #1200

DavisVaughan · 2021-11-04T20:31:50Z

unnest_longer():
Closes #1201
Closes #1199
Closes #1198
Closes #1196
Closes #1195
Closes #1194
Closes #1193
Closes #1034
Closes #1029
Closes #740

unnest_wider():
Closes #1188
Closes #1187
Closes #1186
Closes #1125
Closes #1133
Closes #1060

hoist():
Added test of current behavior for #1203
Closes #1044
Closes #1000
Closes #999
Closes #998
Closes #996

Other:
Part of #1185

Affected revdeps:

ffscrapr
- Pre-emptive PR already merged Convert all lists to tibbles before calling unnest_wider() ffverse/ffscrapr#344
wehoop
- Lots of off-label usage of unnest_wider(). PR Directly use column name in unnest_wider() calls sportsdataverse/wehoop#14
- Two tests explicitly check for a set of column names after all the unnesting is applied. Fixing the first part of unnest_wider() drops column if it contains empty lists #1125 (comment) means that columns with all empty elements are now (correctly) retained, which means this set of column names should be updated. Luckily this test isn't tested on CRAN, so it should be easy. PR Expect a "notes" column in the scoreboard functions sportsdataverse/wehoop#15

Update: Both revdeps have already merged the PRs and have updated their packages on CRAN. So there are now no affected revdeps.

This PR completely rewrites the rectangling functions. They are now much more consistent in a variety of edge cases, and are often much faster than they used to be. I anticipate very little breaking changes, except in some extreme off-label usage situations. I will leave in-line comments, but I think this generally should be reviewed as if these were completely new functions.

Here is the most interesting performance difference I found so far, which comes from https://stackoverflow.com/questions/68897169/any-speedier-alternatives-to-tidyrunnest-longer-when-dealing-with-nested-nam. Mainly this comes from getting rid of tibble() in favor of new_data_frame(). I think #1133 also identified this. (It is also worth noting that in this case a simple unnest() is even faster. Like, in the ~200ms range. But it is nice to see this improvement).

library(tidyr)

elt <- 1:5
col <- rep(list(elt), 200000L)

df <- tibble(col = col)
df
#> # A tibble: 200,000 × 1
#>    col      
#>    <list>   
#>  1 <int [5]>
#>  2 <int [5]>
#>  3 <int [5]>
#>  4 <int [5]>
#>  5 <int [5]>
#>  6 <int [5]>
#>  7 <int [5]>
#>  8 <int [5]>
#>  9 <int [5]>
#> 10 <int [5]>
#> # … with 199,990 more rows

# CRAN
bench::mark(unnest_longer(df, col))
#> # A tibble: 1 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest_longer(df, col)    1.14m    1.14m    0.0146     292MB     2.09

# This PR
bench::mark(unnest_longer(df, col))
#> # A tibble: 1 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest_longer(df, col)    2.45s    2.45s     0.408    39.6MB     4.90

DavisVaughan · 2021-11-08T22:20:01Z

R/rectangle.R

  auto_name <- names2(pluckers) == "" & is_string

  if (any(auto_name)) {
    names(pluckers)[auto_name] <- unlist(pluckers[auto_name])
  }

-  if (!is_named(pluckers)) {
-    stop("All elements of `...` must be named", call. = FALSE)
+  if (length(pluckers) > 0 && !is_named(pluckers)) {


Would be is_named2() if we had dev rlang

DavisVaughan · 2021-11-08T22:22:24Z

R/rectangle.R

+  # Standardize all pluckers to lists for splicing into `pluck()`
+  # and for easier handling in `strike()`
+  is_not_list <- !map_lgl(pluckers, vec_is_list)
+  pluckers[is_not_list] <- map(pluckers[is_not_list], vec_chop)


It is generally easier to transform a mix of plucker types like: hoist(df, col, x = "x", y = c(1, 2), z = list("foo", "bar")) into their equivalent list type, like hoist(df, col, x = list("x"), y = list(1, 2), z = list("foo", "bar"))

DavisVaughan · 2021-11-08T22:23:56Z

R/rectangle.R

+  # TODO: Use `allow_rename = FALSE`.
+  # Requires https://github.com/r-lib/tidyselect/issues/225.
+  cols <- tidyselect::eval_select(enquo(col), data)


For now we just rely on the assumption that the user didn't do any renaming, which should be okay for now

DavisVaughan · 2021-11-08T22:25:37Z

R/rectangle.R

-  data[[col]] <- map(
-    data[[col]], vec_to_long,
-    col = col,
+  if (!vec_is_list(col)) {


General idea is to get the column into a list or list-of form as fast as possible, no matter the original type

DavisVaughan · 2021-11-08T22:26:41Z

R/rectangle.R

+  for (i in seq_along(col)) {
+    col[[i]] <- elt_to_long(
+      x = col[[i]],
+      index = indices[[i]],


Can't use map2() here because indices might be NULL, but NULL[[i]] is valid and just returns NULL so the for loop form worked well

DavisVaughan · 2021-11-08T22:36:37Z

R/rectangle.R

+    # Extremely special case for data.frames,
+    # which we want to treat like lists where we know the type of each element
+    x <- tidyr_new_list(x)
+    ptypes <- map(x, vec_ptype)
+    x <- tidyr_chop(x)
+    x <- map2(x, ptypes, new_list_of)


I actually don't think the data frame behavior of unnest_wider() completely follows the semantics that I would expect. But what you have implemented is probably the most practically useful thing for it to do.

Here is what I would have expected (yes, it is weird):

library(tibble) df <- tibble(x = list(tibble(a = 1:2, b = 2:3, c = 3:4))) df #> # A tibble: 1 × 1 #> x #> <list> #> 1 <tibble [2 × 3]> df$x #> [[1]] #> # A tibble: 2 × 3 #> a b c #> <int> <int> <int> #> 1 1 2 3 #> 2 2 3 4 # unnest_wider(df, x) # - The size of each element of `df$x` determines the number of columns. # 2 rows (size) = 2 new columns # - The `vec_names()` of the element determine the column names, but tibbles # don't use rownames... tibble(...1 = df$x[[1]][1,], ...2 = df$x[[1]][2,]) #> # A tibble: 1 × 2 #> ...1$a $b $c ...2$a $b $c #> <int> <int> <int> <int> <int> <int> #> 1 1 2 3 2 3 4 # This is more useful if you had a data frame with row names df <- tibble(x = list(data.frame(a = 1:2, b = 2:3, c = 3:4, row.names = c("r1", "r2")))) df$x #> [[1]] #> a b c #> r1 1 2 3 #> r2 2 3 4 # Then you'd get: # unnest_wider(df, x) tibble(r1 = df$x[[1]][1,], r2 = df$x[[1]][2,]) #> # A tibble: 1 × 2 #> r1$a $b $c r2$a $b $c #> <int> <int> <int> <int> <int> <int> #> 1 1 2 3 2 3 4

^{Created on 2021-11-08 by the reprex package (v2.0.1)}

DavisVaughan · 2021-11-08T22:40:33Z

R/rectangle.R

+  if (!is.null(ptype)) {
+    x <- tidyr_new_list(x)
+    x <- vec_cast_common(!!!x, .to = ptype)
+    x <- new_list_of(x, ptype = ptype)


I think this is pretty slick. It means that if you can't simplify (or choose not to), then you can create a list-of of the non-simplified results by supplying a ptype for that component. This was requested in #998

library(tidyr) df <- tibble( x = list( list(a = 1:2), list(a = 1) ) ) df #> # A tibble: 2 × 1 #> x #> <list> #> 1 <named list [1]> #> 2 <named list [1]> unnest_wider(df, x) #> # A tibble: 2 × 1 #> a #> <list> #> 1 <int [2]> #> 2 <dbl [1]> unnest_wider(df, x, ptype = list(a = integer())) #> # A tibble: 2 × 1 #> a #> <list<int>> #> 1 [2] #> 2 [1]

^{Created on 2021-11-08 by the reprex package (v2.0.1)}

That is cool!

DavisVaughan · 2021-11-08T22:41:31Z

R/rectangle.R

-    } else {
-      abort(glue("Can't simplfy '{nm}'; elements have length > 1"))
-    }
+  if (!simplify) {


Note that we now apply the ptype and transform before returning if we don't want to simplify. We also apply the transform then the ptype if they both are given, which I'm fairly confident is correct.

DavisVaughan · 2021-11-08T22:42:07Z

R/rectangle.R

-  } else {
-    stop("Input must be list of vectors", call. = FALSE)
+  # Ensure empty elements are filled in with their correct size 1 equivalent
+  info <- list_init_empty(x, null = TRUE, typed = TRUE)


A more correct version of the previous x[n == 0] <- list(NA) behavior

DavisVaughan · 2021-11-08T22:45:44Z

R/rectangle.R

+  # Assume that if combining fails, then we want to return the object
+  # after the `ptype` and `transform` have been applied, but before the
+  # empty element filling was applied


i.e. compare CRAN against dev here:

library(tidyr) df <- tibble( x = list( list(a = NULL), list(a = 1L), list(a = "x") ) ) df #> # A tibble: 3 × 1 #> x #> <list> #> 1 <named list [1]> #> 2 <named list [1]> #> 3 <named list [1]> # CRAN behavior (the NULL is replaced by NA) unnest_wider(df, x) #> # A tibble: 3 × 1 #> a #> <list> #> 1 <lgl [1]> #> 2 <int [1]> #> 3 <chr [1]> unnest_wider(df, x)$a[[1]] #> [1] NA # Dev behavior unnest_wider(df, x) #> # A tibble: 3 × 1 #> a #> <list> #> 1 <NULL> #> 2 <int [1]> #> 3 <chr [1]> unnest_wider(df, x)$a[[1]] #> NULL

Yeah, that looks much better.

hadley

First pass, just through the docs. I'll try and find the time to play with the code a bit too.

R/rectangle.R

hadley · 2021-11-09T12:15:12Z

R/rectangle.R

+#'   the inner names or positions (if not named) of the values. If multiple
+#'   columns are specified in `col`, this can also be a glue string containing
+#'   `"{col}"` to provide a template for the column names. If `NULL`, defaults
+#'   to `values_to` suffixed with `"_id"`.


Suggested change

#' to `values_to` suffixed with `"_id"`.

#' to `"{col}_id"`.

I actually do mean values_to here, because it might be set to something other than "{col}", like "{col}2", making the indices_to columns default to "{col}2_id"

R/rectangle.R

hadley · 2021-11-09T12:18:17Z

R/rectangle.R

+  if (!is.null(ptype)) {
+    x <- tidyr_new_list(x)
+    x <- vec_cast_common(!!!x, .to = ptype)
+    x <- new_list_of(x, ptype = ptype)


That is cool!

hadley · 2021-11-09T12:18:51Z

R/rectangle.R

+  # Assume that if combining fails, then we want to return the object
+  # after the `ptype` and `transform` have been applied, but before the
+  # empty element filling was applied


Yeah, that looks much better.

mgirlich · 2021-11-10T08:40:00Z

So far I only had a look at unnest_longer() and df_unchop(). The code looks good to me and easy to read. Great work!

Some things I found:

Partial inner names cause an error

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
tibble(y = list(a = 1:2, 3:4)) %>% 
  unnest_longer(y, indices_include = FALSE)
#> Error: If any elements of `.data` are named, all must be named

The names of the ptype and transform argument should be checked with vec_as_names(repair = "check_unique")

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
# two ptypes for x
tibble(x = integer()) %>% 
  unnest_longer(
    c(x),
    ptype = list(x = double(), x = integer())
  )
#> # A tibble: 0 × 1
#> # … with 1 variable: x <dbl>

# unnamed list
tibble(x = integer()) %>% 
  unnest_longer(
    c(x),
    ptype = list(double())
  )
#> # A tibble: 0 × 1
#> # … with 1 variable: x <int>

# maybe warn/inform about unused ptypes?
tibble(x = integer()) %>% 
  unnest_longer(
    c(x),
    ptype = list(y = "a")
  )
#> # A tibble: 0 × 1
#> # … with 1 variable: x <int>

I am not sure about the handling of names in vectors that are not lists. Should y_id be c("a", "a", "b", "b")?

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
tibble(
  x = list(c(a = 1, b = 2)),
  y = c(a = 1, b = 2)
) %>% 
  unnest_longer(
    c(x, y),
    indices_include = TRUE
  )
#> # A tibble: 4 × 4
#>       x x_id      y  y_id
#>   <dbl> <chr> <dbl> <int>
#> 1     1 a         1     1
#> 2     2 b         1     1
#> 3     1 a         2     1
#> 4     2 b         2     1

^{Created on 2021-11-10 by the reprex package (v2.0.1)}

mgirlich · 2021-11-10T12:15:08Z

For unnest_wider()

Add a test with a data frame column and names_sep

tibble(
  y = tibble(a = 1, b = 11)
) %>% 
  unnest_wider(y, names_sep = "_")

Add a test for a column that is not a list and not a data frame

tibble(
  y = c(a = 1, b = 11)
) %>% 
  unnest_wider(y, names_sep = "_")

DavisVaughan · 2021-11-10T19:01:27Z

Thanks @mgirlich, very helpful!

This is due to an annoying list-of bug that would be fixed by Allow "" as a valid name for <vctrs_vctr> objects r-lib/vctrs#784. I'll try and get that in the next vctrs release, it has affected me many times now. I've worked around it the best I can with tidyr_temporary_new_list_of(). There is one edge case that it affects, which I opened an issue for Remove usage of tidyr_temporary_new_list_of() #1212.
I added more input validation here, but I think unused ptype/transform elements are going to have to go by silently. I don't think it is worth it to check for these.
I think your intuition is right. When converting a non-list to a list, there are two things you can do with the names. You can move them to the outer list (this is what we currently do, and what CRAN tidyr does but doesn't have tests for) or you can keep them on the inner elements after converting to a list. The latter aligns with your suggestion, and I now feel that that is correct because it uses more standard vctrs tooling (i.e. vec_chop() rather than tidyr_chop()) and it generally results in more expected output, even though these are non-primary data types. Below are the two options. I've switched over to the equivalent2 approach you suggested.

library(tidyr)
library(vctrs)

# Say we have this:
df <- tibble(x = c(a = 1L, b = 2L))

# Names are moved to outer list (tidyr_chop())
equivalent1 <- tibble(x = list_of(a = 1L, b = 2L))

# Names are kept on inner elements (vec_chop())
equivalent2 <- tibble(x = list_of(c(a = 1L), c(b = 2L)))

unnest_wider(equivalent1, x)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 2 × 1
#>    ...1
#>   <int>
#> 1     1
#> 2     2

unnest_wider(equivalent2, x)
#> # A tibble: 2 × 2
#>       a     b
#>   <int> <int>
#> 1     1    NA
#> 2    NA     2

longer1 <- unnest_longer(equivalent1, x)
longer1
#> # A tibble: 2 × 1
#>       x
#>   <int>
#> 1     1
#> 2     2

longer1$x
#> [1] 1 2

longer2 <- unnest_longer(equivalent2, x)
longer2
#> # A tibble: 2 × 2
#>       x x_id 
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b

longer2$x
#> a b 
#> 1 2

I've added those two tests, thanks!

mgirlich · 2021-11-11T06:58:49Z

I added more input validation here, but I think unused ptype/transform elements are going to have to go by silently. I don't think it is worth it to check for these.

Unused ptype/transform element might just be a typo. On the other hand in unnest_wider() unused ptype elements might be used for something different:

Currently, unnest_wider() isn't stable regarding the shape of the output:

library(tidyr)
df <- tibble(
  x = 1:2,
  y = list(
    list(a = 1),
    list(a = 2, b = 2)
  )
)

df %>% unnest_wider(y)
#> # A tibble: 2 × 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1     1    NA
#> 2     2     2     2

df %>% dplyr::slice(1) %>% unnest_wider(y)
#> # A tibble: 1 × 2
#>       x     a
#>   <int> <dbl>
#> 1     1     1

^{Created on 2021-11-11 by the reprex package (v2.0.1)}

The ptype argument can be used to ensure the types of the columns. Maybe one could also use it to ensure to also have the columns x, a, and b in the second case:

df %>% dplyr::slice(1) %>% unnest_wider(y, ptype = list(a = double(), b = double()))
#> # A tibble: 1 × 2
#>       x     a
#>   <int> <dbl>
#> 1     1     1

# maybe this should output
#> # A tibble: 1 × 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1     1    NA

mgirlich · 2021-11-11T07:11:51Z

I am still not entirely sure about the handling of indices as they are not type stable:

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr
df <- tibble(x = list(c(1, 2), c(a = 3)))

df %>% 
  unnest_longer(x, indices_include = TRUE)
#> # A tibble: 3 × 2
#>       x x_id 
#>   <dbl> <chr>
#> 1     1 <NA> 
#> 2     2 <NA> 
#> 3     3 a

df %>% 
  dplyr::slice(1) %>% 
  unnest_longer(x, indices_include = TRUE)
#> # A tibble: 2 × 2
#>       x  x_id
#>   <dbl> <int>
#> 1     1     1
#> 2     2     2

^{Created on 2021-11-11 by the reprex package (v2.0.1)}

I am not sure whether this is a common scenario but it feels like this goes against the idea of the shape and type predictability in the tidyverse.

DavisVaughan · 2021-11-11T19:03:32Z

Maybe one could also use it to ensure to also have the columns x, a, and b in the second case:

I think that would stray pretty far away from how ptype is used elsewhere in tidyr. But that could be the place of a "spec" object, like mentioned in #835. I doubt we add that in this release, but that seems like the right place for this idea

I am still not entirely sure about the handling of indices as they are not type stable:

Yea I think we are probably stuck with this at this point for backwards compatibility's sake. I agree that some alternative API might have been cleaner / more type stable (like names_to and indices_to you mentioned in #1029), but at this point it is probably slightly better to be backwards compatible. Mainly this PR just makes the current API work as it was intended to in more cases.

DavisVaughan · 2021-11-12T16:47:17Z

@mgirlich would you like to review any further?

At this point I am ready to merge. No revdeps are affected now that wehoop and ffscrapr have already updated their packages on CRAN.

DavisVaughan · 2021-11-12T16:55:23Z

the handling of indices as they are not type stable

If we are really looking for some way to justify this, Hadley reminded me that one way to look at this is that an index could be an integer or character vector, so it can be seen as a union type of integer+character. And since x[<index>] works correctly with integer or character vectors, it should be fine for most usage. Obviously not a perfect argument, but that + preserving backwards compatibility is enough to convince me to keep this as is.

mgirlich · 2021-11-15T07:39:34Z

A couple more things then I think it is ready to merge

An early return to col_to_wide() for data frames would be great. Adding this

  if (is.data.frame(col)) {
    out <- col

    if (!is.null(names_sep)) {
      outer <- name
      inner <- colnames(out)
      names(out) <- apply_names_sep(outer, inner, names_sep)
    }

    return(out)
  }

drastically improves the performance

set.seed(1)

n <- 10e3
df <- tibble(
  x = tibble(a = sample(1:n))
)
bench::mark(unnest_wider(df, x))


# Current
# A tibble: 1 × 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_wider(df, x)    2.71s    2.71s     0.370     1.5GB     17.0     1    46

# With early return
# A tibble: 1 × 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_wider(df, x)   2.04ms   2.27ms      421.    1.93MB     31.4   134    10

Similarly an early return to col_to_long() for non-list vectors would be great. This passes the tests and is way faster

  if (!vec_is_list(col)) {
    out <- tibble("{name}" := col)

    if (is_false(indices_include)) {
      return(out)
    }

    indices <- vec_names(col)
    all_unnamed <- is.null(indices)

    if (is.null(indices_include) && all_unnamed) {
      return(out)
    }

    if (all_unnamed) {
      out[[indices_to]] <- vec_rep(1L, vec_size(col))
    } else {
      out[[indices_to]] <- indices
    }

    return(out)
  }

set.seed(1)

n <- 10e3
df <- tibble(
  x = tibble(a = sample(1:n))
)
bench::mark(unnest_longer(df, x))

# Current
# A tibble: 1 × 13
  expression                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_longer(df, x)    212ms    266ms      3.76     3.6MB     9.40     2     5

# With early return
# A tibble: 1 × 13
  expression                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 unnest_longer(df, x)   2.74ms   3.15ms      316.    2.19MB     38.1    91    11

Can we get rid of the duplicated New names notification?

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr

a <- set_names(1:3, c("x", NA, ""))
# get rid of duplicated new names notification?
tibble(a) %>% 
  unnest_wider(a)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 3 × 2
#>       x  ...1
#>   <int> <int>
#> 1     1    NA
#> 2    NA     2
#> 3    NA     3

^{Created on 2021-11-15 by the reprex package (v2.0.1)}

In unnest_wider() the names_sep doesn't work well with NA

devtools::load_all("~/GitHub/tidyr/")
#> ℹ Loading tidyr

a <- set_names(1:3, c("x", NA, ""))
# "a_NA" name?
tibble(a) %>% 
  unnest_wider(a, names_sep = "_")
#> # A tibble: 3 × 3
#>     a_x  a_NA    a_
#>   <int> <int> <int>
#> 1     1    NA    NA
#> 2    NA     2    NA
#> 3    NA    NA     3

^{Created on 2021-11-15 by the reprex package (v2.0.1)}

DavisVaughan · 2021-11-15T16:21:04Z

This changes the behavior when simplify = FALSE. In general, I am a big fan of the fact that col_to_wide() now always returns a list-of column where each element is a tibble. I'd like to avoid changing that to special case non-primary data types. I did add a test for this now though, so thanks!

library(tidyr)

df <- tibble(x = tibble(a = 1:2))

# This PR
unnest_wider(df, x, simplify = FALSE)
#> # A tibble: 2 × 1
#>             a
#>   <list<int>>
#> 1         [1]
#> 2         [1]

# With the mentioned change
unnest_wider(df, x, simplify = FALSE)
#> # A tibble: 2 × 1
#>       a
#>   <int>
#> 1     1
#> 2     2

I can't find any fault with this one, I think it would work, but I am still a bit hesitant about branching too far away from the main code path for this special case. It is already a pretty complex operation, so I think I'd rather keep it simple here (It took me a long time to verify that it wouldn't change anything). I am optimistic that since this is not the intended use case for unnest_longer() anyways, that it won't affect very much actual code.
This is a vector special case of unnest_wider() and messages with unnamed lists #1060. I think that we are going to have to keep these duplicated names messages with unnamed / partially named lists, as otherwise if we stopped after the first name repair message then you could be fooled into thinking you only have 1 element without names. Non-lists obviously get converted to lists with vec_chop(), so possibly we could repair names early on before using vec_chop(), but I'm not sure it is worth it. Again, with this being the non-primary data type, I'm mainly happy with it returning a consistent and theoretically correct result with the least amount of variation from the main code path as is possible.
Yea, I think this one is a bug, in particular, these two cases should return the same data frame structure (even if the column names are a little different)

library(tidyr)
library(rlang)

col <- list(
  set_names(1, "a"),
  set_names(1, NA_character_),
  set_names(1:2, c("", ""))
)

df <- tibble(col = col)

# Should be the same
unnest_wider(df, col)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> # A tibble: 3 × 3
#>       a  ...1  ...2
#>   <dbl> <dbl> <int>
#> 1     1    NA    NA
#> 2    NA     1    NA
#> 3    NA     1     2
unnest_wider(df, col, names_sep = "_")
#> New names:
#> * col_ -> col_...1
#> * col_ -> col_...2
#> # A tibble: 3 × 4
#>   col_a col_NA col_...1 col_...2
#>   <dbl>  <dbl>    <int>    <int>
#> 1     1     NA       NA       NA
#> 2    NA      1       NA       NA
#> 3    NA     NA        1        2

library(tidyr)
library(rlang)

col <- list(
  set_names(1, "a"),
  set_names(1, NA_character_),
  set_names(1, "")
)

df <- tibble(col = col)

# Should be the same
unnest_wider(df, col)
#> New names:
#> * `` -> ...1
#> New names:
#> * `` -> ...1
#> # A tibble: 3 × 2
#>       a  ...1
#>   <dbl> <dbl>
#> 1     1    NA
#> 2    NA     1
#> 3    NA     1
unnest_wider(df, col, names_sep = "_")
#> # A tibble: 3 × 3
#>   col_a col_NA  col_
#>   <dbl>  <dbl> <dbl>
#> 1     1     NA    NA
#> 2    NA      1    NA
#> 3    NA     NA     1

At the very least this means we should do minimal name repair before applying names_sep, but really I think we should do "unique" name repair to ensure that these return consistent results (i.e. should probably get "...1" for the first suffix).

library(tidyr)
library(rlang)

col1 <- list(set_names(1, ""))
df1 <- tibble(col = col1)

col2 <- list(set_names(1:2, c("", "")))
df2 <- tibble(col = col2)

unnest_wider(df1, col, names_sep = "_")
#> # A tibble: 1 × 1
#>    col_
#>   <dbl>
#> 1     1

unnest_wider(df2, col, names_sep = "_")
#> New names:
#> * col_ -> col_...1
#> * col_ -> col_...2
#> # A tibble: 1 × 2
#>   col_...1 col_...2
#>      <int>    <int>
#> 1        1        2

Since we don't have `vec_chop2()` yet r-lib/vctrs#1226

This will be useful for an enhanced `simplify_col()`

Because vctr objects can't currently have `""` names

This uses more explainable vctrs tooling for converting non-primary data types (i.e. non-lists) into lists. This also seems to produce the expected output in more scenarios. Also inlined `tidyr_chop()` into `elt_to_wide()` since that is the only other place it was used. The fact that this removed a helper makes me optimistic that it is the right approach.

Applying it before `names_sep` rather than after means that `""` and `NA_character_` names get repaired early on before they are combined with the prefix and `names_sep`, which can make them mistakently look like "valid" names

DavisVaughan marked this pull request as ready for review November 8, 2021 22:01

This was referenced Nov 8, 2021

Directly use column name in unnest_wider() calls sportsdataverse/wehoop#14

Merged

Expect a "notes" column in the scoreboard functions sportsdataverse/wehoop#15

Merged

DavisVaughan commented Nov 8, 2021

View reviewed changes

DavisVaughan requested review from hadley and mgirlich November 8, 2021 22:46

hadley reviewed Nov 9, 2021

View reviewed changes

This was referenced Nov 10, 2021

Allow "" as a valid name for <vctrs_vctr> objects r-lib/vctrs#784

Merged

Remove usage of tidyr_temporary_new_list_of() #1212

Closed

DavisVaughan added the ask label Nov 10, 2021

DavisVaughan mentioned this pull request Nov 12, 2021

unnest_wider() and messages with unnamed lists #1060

Closed

DavisVaughan removed the ask label Nov 12, 2021

DavisVaughan added 8 commits November 15, 2021 11:49

Implement tidyr_chop2()

5bb8b5a

Since we don't have `vec_chop2()` yet r-lib/vctrs#1226

Add list_of_ptype() helper for consistent list-of ptype extraction

455a64d

Generalize unchop_col_info() into list_init_empty()

871915a

This will be useful for an enhanced `simplify_col()`

Rename tidyr_chop2() to tidyr_chop()

6cd377c

Reimplement rectangling functions

4051cd4

NEWS bullet

1c37b67

Move details about ptype and transform to the transform param docs

66e777c

Link to list_of() documentation

3294bbf

DavisVaughan added 9 commits November 15, 2021 11:51

Mention recycling of unnest_longer()

bc7a16a

Synchronize values_to and indices_to documentation

c766f46

Replace "can't" with "must not"

bbe1ad2

Introduce tidyr_temporary_new_list_of()

65735a7

Because vctr objects can't currently have `""` names

Provide more input validation for ptype and transform

817af55

Add two tests about names_sep, provided by @mgirlich

7e3b4fa

Test that df-cols result in list-ofs when simplify = FALSE

bf936c0

Apply unique name repair on names before applying names_sep

e0d0d18

Applying it before `names_sep` rather than after means that `""` and `NA_character_` names get repaired early on before they are combined with the prefix and `names_sep`, which can make them mistakently look like "valid" names

DavisVaughan force-pushed the feature/rework-unnest-wider branch from e01b348 to e0d0d18 Compare November 15, 2021 16:53

DavisVaughan merged commit a99bfd9 into tidyverse:main Nov 15, 2021

DavisVaughan deleted the feature/rework-unnest-wider branch November 15, 2021 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rectangling function rewrite #1200

Rectangling function rewrite #1200

DavisVaughan commented Nov 4, 2021 •

edited

Loading

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

hadley Nov 9, 2021

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

DavisVaughan Nov 8, 2021

hadley Nov 9, 2021

hadley left a comment

hadley Nov 9, 2021

DavisVaughan Nov 9, 2021

hadley Nov 9, 2021

hadley Nov 9, 2021

mgirlich commented Nov 10, 2021 •

edited

Loading

mgirlich commented Nov 10, 2021

DavisVaughan commented Nov 10, 2021

mgirlich commented Nov 11, 2021

mgirlich commented Nov 11, 2021

DavisVaughan commented Nov 11, 2021 •

edited

Loading

DavisVaughan commented Nov 12, 2021

DavisVaughan commented Nov 12, 2021

mgirlich commented Nov 15, 2021

DavisVaughan commented Nov 15, 2021

Rectangling function rewrite #1200

Rectangling function rewrite #1200

Conversation

DavisVaughan commented Nov 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgirlich commented Nov 10, 2021 • edited Loading

mgirlich commented Nov 10, 2021

DavisVaughan commented Nov 10, 2021

mgirlich commented Nov 11, 2021

mgirlich commented Nov 11, 2021

DavisVaughan commented Nov 11, 2021 • edited Loading

DavisVaughan commented Nov 12, 2021

DavisVaughan commented Nov 12, 2021

mgirlich commented Nov 15, 2021

DavisVaughan commented Nov 15, 2021

DavisVaughan commented Nov 4, 2021 •

edited

Loading

mgirlich commented Nov 10, 2021 •

edited

Loading

DavisVaughan commented Nov 11, 2021 •

edited

Loading