Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case When/Fcase Error #300

Closed
kmishra9 opened this issue Sep 10, 2021 · 10 comments
Closed

Case When/Fcase Error #300

kmishra9 opened this issue Sep 10, 2021 · 10 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@kmishra9
Copy link

Hi there,

I've run into a bug while using dtplyr that doesn't occur while using dplyr and it's been difficult to debug/understand why it's happening, so dropping a reprex for y'all here! This issue felt like the most relevant but really was unable to explain what was going on and after playing with it for a while, just couldn't get it working in dtplyr.

library(tidyverse)
library(dtplyr)
library(lubridate)

smallest_two_month_window = 30 * 2
longest_six_month_window = 31 * 6

df <- tibble(
    id = c(1, 2, 3),
    outcome_yr = 2020,
    outcome_dt = '2020-12-31' %>% ymd(),
    num_vaccine_claims = c(1, 2, 2),
    first_vaccine_claim = '2020-06-30' %>% ymd(),
    last_vaccine_claim = c('2020-06-30', '2020-07-01', '2020-11-01') %>% ymd(),
) %>% 
    mutate(time_between_doses = last_vaccine_claim - first_vaccine_claim)

# Dplyr Successful
df %>%
    group_by(id, outcome_yr) %>% 
    # lazy_dt() %>% 
    summarize(
        vaccination_status = case_when(
            # Fully Vaccinated
            (num_vaccine_claims == 2) &&
                (smallest_two_month_window <= time_between_doses) &&
                (time_between_doses <= longest_six_month_window) ~ 'Fully vaccinated (recommended dose)',
            (num_vaccine_claims >= 2) ~ 'Fully vaccinated (2+ doses',
            
            # Partially Vaccinated
            (num_vaccine_claims == 1) ~ 'Partially vaccinated',
            
            # Unvaccinated
            (num_vaccine_claims == 0) ~ 'Unvaccinated',
            TRUE ~ 'Unvaccinated'
        )
    )
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 3 × 3
#> # Groups:   id [3]
#>      id outcome_yr vaccination_status                 
#>   <dbl>      <dbl> <chr>                              
#> 1     1       2020 Partially vaccinated               
#> 2     2       2020 Fully vaccinated (2+ doses         
#> 3     3       2020 Fully vaccinated (recommended dose)

# Dtplyr Fcase Error
df %>%
    group_by(id, outcome_yr) %>% 
    lazy_dt() %>%
    summarize(
        vaccination_status = case_when(
            # Fully Vaccinated
            (num_vaccine_claims == 2) &&
                (smallest_two_month_window <= time_between_doses) &&
                (time_between_doses <= longest_six_month_window) ~ 'Fully vaccinated (recommended dose)',
            (num_vaccine_claims >= 2) ~ 'Fully vaccinated (2+ doses',
            
            # Partially Vaccinated
            (num_vaccine_claims == 1) ~ 'Partially vaccinated',
            
            # Unvaccinated
            (num_vaccine_claims == 0) ~ 'Unvaccinated',
            TRUE ~ 'Unvaccinated'
        )
    )
#> Error in fcase((num_vaccine_claims == 2) && (..smallest_two_month_window <= : Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

Created on 2021-09-10 by the reprex package (v2.0.1)

@hadley
Copy link
Member

hadley commented Sep 10, 2021

The root problem is that you've used && instead of & in the case_when(). I'm surprised that dplyr doesn't give you a warning here, but it might be because you have one row in each group (since nothing in your case_when() varies by group, you don't need group here at all).

I'll leave this issue open because we'll need to think about a better error message, probably both here and in dplyr.

@eutwt
Copy link
Collaborator

eutwt commented Sep 10, 2021

This is because in CRAN dtplyr, calling lazy_dt() results in a lazy dt with no groups even when the input is grouped. So the summarise is incorrectly done on the whole table.

This issue was fixed in #207

Although it's not best practice to mix && and & in case_when, it should make no difference for the output in this case where there is one row per group, if dtplyr is properly preserving the input grouping.

@kmishra9
Copy link
Author

kmishra9 commented Sep 10, 2021

Good to know & vs && differs here -- no error from dplyr in this instance but this is also a boiled down reprex, so with the larger dataset I'm using, there are plenty of groups with plenty of rows per group and the same error occurs w/ no warnings. @eutwt , I also did just check my original, more complex query on the original data and the error persists even though the lazy_dt() call occurs before the group_by(), so I think it's more likely the case that the and operator is causing issues! I'll report back in a bit if I can get it working w/ that change

@eutwt
Copy link
Collaborator

eutwt commented Sep 10, 2021

Here's a smaller reprex for the issue that seems to be causing the error in the above reprex. But yeah, there could be other issues with mixing & and && aside from this with the real data

library(dtplyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

gb_after <- 
  data.frame(a = 1:2) %>% 
    lazy_dt() %>% 
    group_by(a) %>% 
    mutate(b = case_when(a == 2 && a != 1 ~ 'row 2', a < 2 ~ 'not row 2'))

gb_before <- 
  data.frame(a = 1:2) %>% 
    group_by(a) %>% 
    lazy_dt() %>% 
    mutate(b = case_when(a == 2 && a != 1 ~ 'row 2', a < 2 ~ 'not row 2'))

# group_by after works fine
gb_after
#> Source: local data table [2 x 2]
#> Groups: a
#> Call:   copy(`_DT1`)[, `:=`(b = fcase(a == 2 && a != 1, "row 2", a < 
#>     2, "not row 2")), by = .(a)]
#> 
#>       a b        
#>   <int> <chr>    
#> 1     1 not row 2
#> 2     2 row 2    
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

# with group_by before the summarise is on the whole data
show_query(gb_before)
#> copy(`_DT2`)[, `:=`(b = fcase(a == 2 && a != 1, "row 2", a < 
#>     2, "not row 2"))]

# and cannot produce output
print(gb_before)
#> Error in fcase(a == 2 && a != 1, "row 2", a < 2, "not row 2"): Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

Created on 2021-09-10 by the reprex package (v2.0.1)

@kmishra9
Copy link
Author

kmishra9 commented Sep 11, 2021

Can confirm switching to & from && fixed this bug for me in the complex version of this query (saving me hours upon hours in the process, in a super-resource-constrained sandbox), so thanks for the insight and looking forward to getting a more informative error message in the future!

@KesterJ
Copy link

KesterJ commented Feb 11, 2022

I've encountered a version of this issue that doesn't involve &&, and where group_by() is called after lazy_dt(). Reprex below:

library(dplyr, warn.conflicts = FALSE)
library(dtplyr, warn.conflicts = FALSE)

options(dplyr.summarise.inform = FALSE)

loans <- tibble(
  borrower_id = c(1,1,1,1,2,2),
  loan_id = c("A", "A", "B", "B", "C", "C"),
  year = c(2020, 2021, 2020, 2021, 2020, 2021),
  repayments = c(0, 0, 0, 200, 150, 50)
)

#In dplyr (works)
loans %>%
  group_by(borrower_id, year) %>%
  summarise(
    status = case_when(any(repayments > 0) ~ "Made repayments",
                       TRUE ~ "Did not make any repayments")
  ) %>%
  ungroup()
#> # A tibble: 4 x 3
#>   borrower_id  year status                     
#>         <dbl> <dbl> <chr>                      
#> 1           1  2020 Did not make any repayments
#> 2           1  2021 Made repayments            
#> 3           2  2020 Made repayments            
#> 4           2  2021 Made repayments

#In dtplyr (does not work)
loans %>%
  lazy_dt() %>%
  group_by(borrower_id, year) %>%
  summarise(
    status = case_when(any(repayments > 0) ~ "Made repayments",
                       TRUE ~ "Did not make any repayments")
  ) %>%
  ungroup() %>%
  as_tibble()
#> Error in fcase(any(repayments > 0), "Made repayments", rep(TRUE, .N), : Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

#In dtplyr with different grouping that includes only one row per group (works)
loans %>%
  lazy_dt() %>%
  group_by(loan_id, year) %>%
  summarise(
    status = case_when(any(repayments > 0) ~ "Made repayments",
                       TRUE ~ "Did not make any repayments")
  ) %>%
  ungroup() %>%
  as_tibble()
#> # A tibble: 6 x 3
#>   loan_id  year status                     
#>   <chr>   <dbl> <chr>                      
#> 1 A        2020 Did not make any repayments
#> 2 A        2021 Did not make any repayments
#> 3 B        2020 Did not make any repayments
#> 4 B        2021 Made repayments            
#> 5 C        2020 Made repayments            
#> 6 C        2021 Made repayments

Created on 2022-02-11 by the reprex package (v2.0.1)

@eutwt
Copy link
Collaborator

eutwt commented Feb 11, 2022

@KesterJ Since dtplyr handles TRUE as a special case I've opened a separate issue for this.

@eutwt eutwt closed this as completed Feb 11, 2022
@eutwt
Copy link
Collaborator

eutwt commented Feb 11, 2022

Actually I should not have closed this issue. There is still an inconsistency with dplyr, aside from the special case of an explicit T/TRUE (length-1 vectors are not repeated)

library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

df <- tibble(x = 1:2)

df %>%
  mutate(y = case_when(x > 1 ~ 'a', any(x > 1) ~ 'b'))
#> # A tibble: 2 × 2
#>       x y    
#>   <int> <chr>
#> 1     1 b    
#> 2     2 a

lazy_dt(df) %>%
  mutate(y = case_when(x > 1 ~ 'a', any(x > 1) ~ 'b'))
#> Error in fcase(x > 1, "a", any(x > 1), "b"): Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

Created on 2022-02-11 by the reprex package (v2.0.1)

@eutwt eutwt reopened this Feb 11, 2022
@markfairbanks
Copy link
Collaborator

@eutwt should this be closed? I don't see an easy workaround for this - it looks more like a data.table issue than a dtplyr one.

@markfairbanks markfairbanks added the bug an unexpected problem or unintended behavior label Jun 21, 2022
@eutwt
Copy link
Collaborator

eutwt commented Jun 21, 2022

I agree

@eutwt eutwt closed this as completed Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants