Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't fcase() recycle? #4258

Closed
sindribaldur opened this issue Feb 25, 2020 · 21 comments · Fixed by #4264
Closed

Shouldn't fcase() recycle? #4258

sindribaldur opened this issue Feb 25, 2020 · 21 comments · Fixed by #4264
Labels
top request One of our most-requested issues
Milestone

Comments

@sindribaldur
Copy link

Would be nice if either of these would work

fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  TRUE, as.character(iris$Sepal.Length)
)
# Results in an error


fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  default = as.character(iris$Sepal.Length) # Length of 'default' must be 1.
)

My current solution (maybe I am missing some neater solution):

fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  rep(TRUE, nrow(iris)), as.character(iris$Sepal.Length)
)

dplyr::case_when(
  iris$Sepal.Length > 5 ~ ">5",
  iris$Sepal.Length < 4 ~ "<4",
  TRUE ~  as.character(iris$Sepal.Length)
)

@ColeMiller1
Copy link
Contributor

Just a few SQL like alternatives:

## the uneven argument is the ELSE somewhat similar to Oracle DECODE()
fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  as.character(iris$Sepal.Length)
)

## new special symbol similar to at the end of an actual case when statement
fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  .ELSE, as.character(iris$Sepal.Length)
)

When I first started using dplyr from a SQL background, I kept finding it surprising the else was implemented as TRUE.

@2005m
Copy link
Contributor

2005m commented Feb 28, 2020

I can understand the need for a default vector, however in the example above i would have written the code as follows:

x = iris$Sepal.Length
fcase(
  x > 5, ">5",
  x < 4, "<4",
  x <= 5, as.character(x)
)

That avoids the overhead that is currently being implemented in the PR and mentionned by @jangorecki

@2005m
Copy link
Contributor

2005m commented Feb 28, 2020

It also raises the question. Do we want the same behaviour in fifelse ?

@sindribaldur
Copy link
Author

@2005m my example was bad, the default or .ELSE condition can't always be stated so easily.

@2005m
Copy link
Contributor

2005m commented Feb 28, 2020

A good example would be great.

@shrektan
Copy link
Member

I've already made the implementations that supports scalar condition and lazy-eval defaults in the PR above. Please take a look there.

@sindribaldur
Copy link
Author

Here is a different example. fcase() by no means necessary, but makes for concise and clean-looking code.

DTmtcars[, rn := fcase(
                   rn %like% "^Merc", sub("^Merc", "Mercedes", rn),
                   rn == "Toyota Corona", paste0(rn, "s"),
                   ...
                   rep(TRUE, length(rn)), rn # else just the original vector
                 )]

@knapply
Copy link

knapply commented Apr 26, 2020

Here's a simple example that may be helpful.

I have a vector of country names...

countries <- c("USA", "Britain", "Russian Federation", "Trinidad-Tobago", 
               "Bahamas", "Congo", "UAE", "Sao Tome", "Timor-Leste",
               "Canada", "Mexico")

... that I'd like to standardize for downstream tasks.

Let's say that I need to modify everything except for Canada and Mexico. Using dplyr::case_when(), I would do so like this...

dplyr::case_when(
  countries == "USA" ~ "United States of America",
  countries == "Britain" ~ "United Kingdom",
  countries == "Russian Federation" ~ "Russia",
  countries == "Trinidad-Tobago" ~ "Trinidad and Tobago",
  countries == "Bahamas" ~ "The Bahamas",
  countries == "Timor-Leste" ~ "East Timor",
  countries == "UAE" ~ "United Arab Emirates",
  countries == "Congo" ~ "Democratic Republic of the Congo",
  countries == "Sao Tome" ~ "Sao Tome and Principe",
  TRUE ~ countries
)
#>  [1] "United States of America"         "United Kingdom"                  
#>  [3] "Russia"                           "Trinidad and Tobago"             
#>  [5] "The Bahamas"                      "Democratic Republic of the Congo"
#>  [7] "United Arab Emirates"             "Sao Tome and Principe"           
#>  [9] "East Timor"                       "Canada"                          
#> [11] "Mexico"

but data.table::fcase() seems to require finding the values you want to leave as-is separately.

conditions <- c("USA", "Britain", "Russian Federation", "Trinidad-Tobago", 
                "Bahamas", "Congo", "UAE", "Sao Tome", "Timor-Leste")

(dont_modify <- !countries %in% conditions)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE


data.table::fcase(
  countries == "USA", "United States of America",
  countries == "Britain", "United Kingdom",
  countries == "Russian Federation", "Russia",
  countries == "Trinidad-Tobago", "Trinidad and Tobago",
  countries == "Bahamas", "The Bahamas",
  countries == "Timor-Leste", "East Timor",
  countries == "UAE", "United Arab Emirates",
  countries == "Congo", "Democratic Republic of the Congo",
  countries == "Sao Tome", "Sao Tome and Principe",
  dont_modify, countries
)
#>  [1] "United States of America"         "United Kingdom"                  
#>  [3] "Russia"                           "Trinidad and Tobago"             
#>  [5] "The Bahamas"                      "Democratic Republic of the Congo"
#>  [7] "United Arab Emirates"             "Sao Tome and Principe"           
#>  [9] "East Timor"                       "Canada"                          
#> [11] "Mexico"
installed.packages()["data.table", "Version"]
#> [1] "1.12.9"

@jangorecki
Copy link
Member

@knapply example you presented is perfect example where lookup table should be prefered, it is much easier to maintain and cleaner to use. Then you do countries[countries_new_name, name := i.new_name, on="name"].

@knapply
Copy link

knapply commented Apr 26, 2020

I think I must've oversimplified, so here's an expanded example that sticks with the country theme...

library(data.table)

countries <- data.table(
  name = c("Czech Republic", "Czecho-Slovakia", "Mexico", "Czech Republic", 
           "Canada", "Czechoslovakia", "USA", "Britain"),
  year = c(1918, 1990:1996)
); countries
#>               name year
#> 1:  Czech Republic 1918
#> 2: Czecho-Slovakia 1990
#> 3:          Mexico 1991
#> 4:  Czech Republic 1992
#> 5:          Canada 1993
#> 6:  Czechoslovakia 1994
#> 7:             USA 1995
#> 8:         Britain 1996

is_czech_name <- function(x) {
  x %chin% c("Czechoslovak Republic", "Czechoslovakia",
             "Czech Republic", "Czecho-Slovakia")
}

@jangorecki this kind of pattern-matching flexibility is more what I'm getting at.

countries[, name := dplyr::case_when(
  is_czech_name(name) & year <= 1938 ~ "Czechoslovak Republic",
  is_czech_name(name) & year %between% c(1939, 1992) ~ "Czechoslovakia",
  is_czech_name(name) & year >= 1993 ~ "Czech Republic",
  name == "USA" ~ "United States of America",
  name == "Britain" ~ "United Kingdom",
  TRUE ~ name
)]
#>                        name year
#> 1:    Czechoslovak Republic 1918
#> 2:           Czechoslovakia 1990
#> 3:                   Mexico 1991
#> 4:           Czechoslovakia 1992
#> 5:                   Canada 1993
#> 6:           Czech Republic 1994
#> 7: United States of America 1995
#> 8:           United Kingdom 1996

@2005m
Copy link
Contributor

2005m commented Apr 28, 2020

@jangorecki , countries[countries_new_name, name := i.new_name, on="name"] is nice, but :
-do you agree that fcase is much faster ? (there is actually a faster solution than fcase)
-how do you deal with missing values? (i.e. the default argument or TRUE) ?
-does your solution also work when the output values are vectors?
Thanks.

@jangorecki
Copy link
Member

  • I have not idea if it is faster, but I would expect fcase to be slower. Unless we look at microseconds, then anything can be faster than [.data.table call. Update on join will use binary merge to match new values to rows, so it will scale well. fcase is lazy so can potentially avoid a lot of computation. There are many factors, but I think scaling up will be in favor of update-on-join.
  • If there is missing value, then the name is not being updated at all, so stays as it was.
  • counties_new_name needs to be a list/data.table

@2005m
Copy link
Contributor

2005m commented Apr 28, 2020

Yes you are right for large vector [.data.table is around 30% faster than fcase.
The fastest solution is a vectorised switch, which cut the time by half on a single thread.
Odd for missing values I get NA, so it does not stay as it was.

@2005m

This comment was marked as off-topic.

@2005m

This comment was marked as off-topic.

@2005m

This comment has been minimized.

@jangorecki

This comment has been minimized.

@2005m

This comment has been minimized.

@iago-pssjd
Copy link
Contributor

Is there any update on this? Also, the solution proposed in first comment

fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  rep(TRUE, nrow(iris)), as.character(iris$Sepal.Length)
)

does not work to me since I am using fcase inside [.data.table and by = .() group with different group sizes, so I obtain

Argument #9 has a different length than argument #1. Please make sure all logical conditions have the same length.

Instead, it works replacing rep(TRUE, nrow(iris)) with rep(TRUE, .N)

@davidbudzynski
Copy link
Contributor

Has there been any progress made on this issue? I can confirm that @2005m's nif works exactly as expected.

@Fablepongiste
Copy link

This should really be looked into, it is a classical issue and for once dplyr way with case_when so much nicer.

@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants