Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rbind() for vctrs_vctr with underlying type character #1186

Closed
krlmlr opened this issue Jul 10, 2020 · 8 comments · Fixed by #1478
Closed

Fix rbind() for vctrs_vctr with underlying type character #1186

krlmlr opened this issue Jul 10, 2020 · 8 comments · Fixed by #1478
Labels
feature a feature request or enhancement type:vctr

Comments

@krlmlr
Copy link
Member

krlmlr commented Jul 10, 2020

Implement levels.vctrs_vctr() if underlying type is character?

Perhaps only with base type fallback, #1135.

With this, rbind() works well enough in this case.

library(vctrs)

foo <- new_vctr(letters, class = "foo")
data <- data.frame(foo)

class(data$foo)
#> [1] "foo"        "vctrs_vctr"

rbind(data, data)
#> Error: `levels.foo()` not supported.

levels.foo <- function(x) {
  NULL
}

class(rbind(data, data)$foo)
#> [1] "foo"        "vctrs_vctr"

Created on 2020-07-10 by the reprex package (v0.3.0)

@krlmlr krlmlr changed the title Implement levels.vctrs_vctr() if underlying type is character? Fix rbind() for vctrs_vctr with underlying type character Mar 18, 2021
@lionel- lionel- added type:vctr feature a feature request or enhancement labels Mar 25, 2021
@jennybc
Copy link
Member

jennybc commented Jul 6, 2021

I independently ran across this and also pretty much convinced myself to implement levels() for my class, also motivated by making a simple rbind() work.

FWIW I am explicitly inheriting base type, i.e. the constructor uses new_vctr(x, class = "drive_id", inherit_base_type = TRUE) and rbind() still doesn't work.

# I'm working on the drive_id class in a googledrive branch,
# but will soon merge this
devtools::load_all("~/rrr/googledrive")
#> ℹ Loading googledrive

x <- data.frame(a = as_id(month.name[1:3]))
x
#>          a
#> 1  January
#> 2 February
#> 3    March
x$a
#> <drive_id[3]>
#> [1] January  February March
class(x$a)
#> [1] "drive_id"   "vctrs_vctr" "character"

rbind(x, x)
#> Error: `levels.drive_id()` not supported.

Created on 2021-07-06 by the reprex package (v2.0.0.9000)

@jennybc
Copy link
Member

jennybc commented Jul 6, 2021

But the more I look into it, I'm not so sure I want rbind() to "just work", because it uses c() and not vec_c(), so seems to circumvent many of the reasons for using vctrs to implement the drive_id class in the first place.

Maybe, all things considered, it's just as well that one needs to use vctrs::vec_rbind() or dplyr::bind_rows() in order to row bind data.frames hosting such a column.

So far my main conclusion is just that this is a super weird error to encounter and that is what leads to the sense of disorientation.

#> Error: `levels.foo()` not supported.

@lionel-
Copy link
Member

lionel- commented Jul 7, 2021

But the more I look into it, I'm not so sure I want rbind() to "just work", because it uses c() and not vec_c(), so seems to circumvent many of the reasons for using vctrs to implement the drive_id class in the first place.

On the other hand, concatenating objects of the same class with c() should always work as expected. And by deriving from vctrs_vctr you automatically gain a c() method. I think the idea behind vctrs_vctr is to have the class work reasonably well with base, so it seems to make sense to implement levels() for vectors that explicitly inherit from character.

I'm just wondering what other consequences implementing levels() will have? I didn't know it was used for other things than factors.

@jennybc
Copy link
Member

jennybc commented Jul 7, 2021

Locally, I've added and exported levels.drive_id.

But, if we make rbind() "just work" by adding a placeholder method for levels.foo, then rbind() combines instances of foo with other things in ways that differ from what vec_rbind(), vec_c() , or even c() (!!) would do.

Is this surprising? Or does this suggest that I've done something odd in my implementation of vctrs methods for drive_id?

devtools::load_all("~/rrr/googledrive")
#> ℹ Loading googledrive

a_drive_id <- as_id(month.name[1:3])
could_be_a_drive_id <- month.name[4:6]
invalid_drive_id <- c("April", "%*@", "June")

x <- data.frame(a = a_drive_id)
class(x$a)
#> [1] "drive_id"   "vctrs_vctr" "character"

# combine drive_id with character (that can be coerced to drive_id)

# inside a data.frame, 
# with rbind: get a drive_id or character, depending on input order
# with vec_rbind: get character
# with c: get character
# with vec_c: get character
y <- data.frame(a = could_be_a_drive_id)
class(y$a)
#> [1] "character"

x_and_y_rbind <- rbind(x, y)
class(x_and_y_rbind$a)
#> [1] "drive_id"   "vctrs_vctr" "character"

y_and_x_rbind <- rbind(y, x)
class(y_and_x_rbind$a)
#> [1] "character"

x_and_y_vec_rbind <- vec_rbind(x, y)
class(x_and_y_vec_rbind$a)
#> [1] "character"

y_and_x_vec_rbind <- vec_rbind(y, x)
class(y_and_x_vec_rbind$a)
#> [1] "character"

c(a_drive_id, could_be_a_drive_id)
#> [1] "January"  "February" "March"    "April"    "May"      "June"
class(c(a_drive_id, could_be_a_drive_id))
#> [1] "character"

vctrs::vec_c(a_drive_id, could_be_a_drive_id)
#> [1] "January"  "February" "March"    "April"    "May"      "June"
class(vctrs::vec_c(a_drive_id, could_be_a_drive_id))
#> [1] "character"

# combine drive_id with character (that can NOT be coerced to drive_id)

# inside a data.frame, 
# with rbind: get an error or character, depending on input order
# with vec_rbind: get character
# with c: get character
# with vec_c: get character
z <- data.frame(a = invalid_drive_id)
class(z$a)
#> [1] "character"

x_and_z_rbind <- rbind(x, z)
#> Error: A <drive_id> must match this regular expression: `^[a-zA-Z0-9_-]+$`
#> Invalid input:
#> x '%*@'

z_and_x_rbind <- rbind(z, x)
class(z_and_x_rbind$a)
#> [1] "character"

x_and_z_vec_rbind <- vec_rbind(x, z)
class(x_and_z_vec_rbind$a)
#> [1] "character"

z_and_x_vec_rbind <- vec_rbind(z, x)
class(z_and_x_vec_rbind$a)
#> [1] "character"

c(a_drive_id, invalid_drive_id)
#> [1] "January"  "February" "March"    "April"    "%*@"      "June"
class(c(a_drive_id, invalid_drive_id))
#> [1] "character"

vctrs::vec_c(a_drive_id, invalid_drive_id)
#> [1] "January"  "February" "March"    "April"    "%*@"      "June"
class(vctrs::vec_c(a_drive_id, invalid_drive_id))
#> [1] "character"

Created on 2021-07-07 by the reprex package (v2.0.0.9000)

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 12, 2021

I don't believe that this is specific to character in any way. You can't use rbind() with a vctrs_vctr class with an underlying integer type either.

library(vctrs)

data <- data.frame(foo = new_vctr(1:5, class = "foo"))

rbind(data, data)
#> Error: `levels.foo()` not supported.

Looking into this, here is where levels() is called once on each column of the data frame. It is a weird way to determine if something is factor-ish?? xj refers to a column in the data frame.
https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/base/R/dataframe.R#L1356-L1361


I think I would be ok with supplying a levels() method that returns NULL. I don't think we'd want to rely on levels.default which returns attr(x, "levels"), as we'd want people to explicitly opt in to levels() support (like I do in probably), but this gives us a decent default behavior that plays nicely with base R in the 99% case where there are no levels. I think we would continue to error with levels<-.vctrs_vctr.

I agree with Jenny that rbind() has some odd behavior that is different from vec_rbind() (see below), but I think we'd probably prioritize compatibility with base R where possible, pointing people to use vec_rbind() if they complain about rbind() doing the "wrong thing". We do a similar thing with c() and vec_c(), so it seems consistent.

data <- data.frame(foo = new_vctr(1:5, class = "foo"))
data2 <- data.frame(foo = 1:5)

levels.vctrs_vctr <- function(x) {
  NULL
}

rbind(data, data)$foo
#> <foo[10]>
#>  [1] 1 2 3 4 5 1 2 3 4 5

rbind(data, data2)$foo
#> <foo[10]>
#>  [1] 1 2 3 4 5 1 2 3 4 5

# integer first, lost class
rbind(data2, data)$foo
#>  [1] 1 2 3 4 5 1 2 3 4 5

@jennybc
Copy link
Member

jennybc commented Jul 12, 2021

I agree with Jenny that rbind() has some odd behavior that is different from vec_rbind() (see below), but I think we'd probably prioritize compatibility with base R where possible

But in this case the base R results are wrong (or, rather, "can be and often are wrong"). So I'm not sure "compatibility" really describes what we'd be enabling. I know my reprex is very long, but it shows rbind() returning a column with type drive_id when it should not (and no other method of combination does) and erroring for a row-bind that is, in fact, possible.

@lionel-
Copy link
Member

lionel- commented Oct 25, 2021

Taking another look at this, I think we should just remove the vctrs_vctr method and allow levels.default do its thing. There is an expectation that levels() returns NULL when unimplemented.

Or return NULL from the method, this way a coincidental levels attribute won't get picked up.

@DavisVaughan
Copy link
Member

I think I'm fine with that too. c() already returns results that "can be and often are wrong" when mixed with vctr types in a particular order (i.e. c(1, new_vctr("foo"))) and there isn't much we can do about that. I think it is just the same argument with rbind().

I vote for returning NULL from the method. I like that my custom factor-ish type in {probably} has to opt in to levels() by explicitly creating a method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement type:vctr
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants