Case When/Fcase Error #300

kmishra9 · 2021-09-10T19:45:35Z

Hi there,

I've run into a bug while using dtplyr that doesn't occur while using dplyr and it's been difficult to debug/understand why it's happening, so dropping a reprex for y'all here! This issue felt like the most relevant but really was unable to explain what was going on and after playing with it for a while, just couldn't get it working in dtplyr.

library(tidyverse)
library(dtplyr)
library(lubridate)

smallest_two_month_window = 30 * 2
longest_six_month_window = 31 * 6

df <- tibble(
    id = c(1, 2, 3),
    outcome_yr = 2020,
    outcome_dt = '2020-12-31' %>% ymd(),
    num_vaccine_claims = c(1, 2, 2),
    first_vaccine_claim = '2020-06-30' %>% ymd(),
    last_vaccine_claim = c('2020-06-30', '2020-07-01', '2020-11-01') %>% ymd(),
) %>% 
    mutate(time_between_doses = last_vaccine_claim - first_vaccine_claim)

# Dplyr Successful
df %>%
    group_by(id, outcome_yr) %>% 
    # lazy_dt() %>% 
    summarize(
        vaccination_status = case_when(
            # Fully Vaccinated
            (num_vaccine_claims == 2) &&
                (smallest_two_month_window <= time_between_doses) &&
                (time_between_doses <= longest_six_month_window) ~ 'Fully vaccinated (recommended dose)',
            (num_vaccine_claims >= 2) ~ 'Fully vaccinated (2+ doses',
            
            # Partially Vaccinated
            (num_vaccine_claims == 1) ~ 'Partially vaccinated',
            
            # Unvaccinated
            (num_vaccine_claims == 0) ~ 'Unvaccinated',
            TRUE ~ 'Unvaccinated'
        )
    )
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 3 × 3
#> # Groups:   id [3]
#>      id outcome_yr vaccination_status                 
#>   <dbl>      <dbl> <chr>                              
#> 1     1       2020 Partially vaccinated               
#> 2     2       2020 Fully vaccinated (2+ doses         
#> 3     3       2020 Fully vaccinated (recommended dose)

# Dtplyr Fcase Error
df %>%
    group_by(id, outcome_yr) %>% 
    lazy_dt() %>%
    summarize(
        vaccination_status = case_when(
            # Fully Vaccinated
            (num_vaccine_claims == 2) &&
                (smallest_two_month_window <= time_between_doses) &&
                (time_between_doses <= longest_six_month_window) ~ 'Fully vaccinated (recommended dose)',
            (num_vaccine_claims >= 2) ~ 'Fully vaccinated (2+ doses',
            
            # Partially Vaccinated
            (num_vaccine_claims == 1) ~ 'Partially vaccinated',
            
            # Unvaccinated
            (num_vaccine_claims == 0) ~ 'Unvaccinated',
            TRUE ~ 'Unvaccinated'
        )
    )
#> Error in fcase((num_vaccine_claims == 2) && (..smallest_two_month_window <= : Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

^{Created on 2021-09-10 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

hadley · 2021-09-10T19:53:00Z

The root problem is that you've used && instead of & in the case_when(). I'm surprised that dplyr doesn't give you a warning here, but it might be because you have one row in each group (since nothing in your case_when() varies by group, you don't need group here at all).

I'll leave this issue open because we'll need to think about a better error message, probably both here and in dplyr.

eutwt · 2021-09-10T20:36:06Z

This is because in CRAN dtplyr, calling lazy_dt() results in a lazy dt with no groups even when the input is grouped. So the summarise is incorrectly done on the whole table.

This issue was fixed in #207

Although it's not best practice to mix && and & in case_when, it should make no difference for the output in this case where there is one row per group, if dtplyr is properly preserving the input grouping.

kmishra9 · 2021-09-10T20:49:55Z

Good to know & vs && differs here -- no error from dplyr in this instance but this is also a boiled down reprex, so with the larger dataset I'm using, there are plenty of groups with plenty of rows per group and the same error occurs w/ no warnings. @eutwt , I also did just check my original, more complex query on the original data and the error persists even though the lazy_dt() call occurs before the group_by(), so I think it's more likely the case that the and operator is causing issues! I'll report back in a bit if I can get it working w/ that change

eutwt · 2021-09-10T21:04:57Z

Here's a smaller reprex for the issue that seems to be causing the error in the above reprex. But yeah, there could be other issues with mixing & and && aside from this with the real data

library(dtplyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

gb_after <- 
  data.frame(a = 1:2) %>% 
    lazy_dt() %>% 
    group_by(a) %>% 
    mutate(b = case_when(a == 2 && a != 1 ~ 'row 2', a < 2 ~ 'not row 2'))

gb_before <- 
  data.frame(a = 1:2) %>% 
    group_by(a) %>% 
    lazy_dt() %>% 
    mutate(b = case_when(a == 2 && a != 1 ~ 'row 2', a < 2 ~ 'not row 2'))

# group_by after works fine
gb_after
#> Source: local data table [2 x 2]
#> Groups: a
#> Call:   copy(`_DT1`)[, `:=`(b = fcase(a == 2 && a != 1, "row 2", a < 
#>     2, "not row 2")), by = .(a)]
#> 
#>       a b        
#>   <int> <chr>    
#> 1     1 not row 2
#> 2     2 row 2    
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

# with group_by before the summarise is on the whole data
show_query(gb_before)
#> copy(`_DT2`)[, `:=`(b = fcase(a == 2 && a != 1, "row 2", a < 
#>     2, "not row 2"))]

# and cannot produce output
print(gb_before)
#> Error in fcase(a == 2 && a != 1, "row 2", a < 2, "not row 2"): Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

^{Created on 2021-09-10 by the reprex package (v2.0.1)}

kmishra9 · 2021-09-11T08:52:17Z

Can confirm switching to & from && fixed this bug for me in the complex version of this query (saving me hours upon hours in the process, in a super-resource-constrained sandbox), so thanks for the insight and looking forward to getting a more informative error message in the future!

KesterJ · 2022-02-11T12:49:17Z

I've encountered a version of this issue that doesn't involve &&, and where group_by() is called after lazy_dt(). Reprex below:

library(dplyr, warn.conflicts = FALSE)
library(dtplyr, warn.conflicts = FALSE)

options(dplyr.summarise.inform = FALSE)

loans <- tibble(
  borrower_id = c(1,1,1,1,2,2),
  loan_id = c("A", "A", "B", "B", "C", "C"),
  year = c(2020, 2021, 2020, 2021, 2020, 2021),
  repayments = c(0, 0, 0, 200, 150, 50)
)

#In dplyr (works)
loans %>%
  group_by(borrower_id, year) %>%
  summarise(
    status = case_when(any(repayments > 0) ~ "Made repayments",
                       TRUE ~ "Did not make any repayments")
  ) %>%
  ungroup()
#> # A tibble: 4 x 3
#>   borrower_id  year status                     
#>         <dbl> <dbl> <chr>                      
#> 1           1  2020 Did not make any repayments
#> 2           1  2021 Made repayments            
#> 3           2  2020 Made repayments            
#> 4           2  2021 Made repayments

#In dtplyr (does not work)
loans %>%
  lazy_dt() %>%
  group_by(borrower_id, year) %>%
  summarise(
    status = case_when(any(repayments > 0) ~ "Made repayments",
                       TRUE ~ "Did not make any repayments")
  ) %>%
  ungroup() %>%
  as_tibble()
#> Error in fcase(any(repayments > 0), "Made repayments", rep(TRUE, .N), : Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

#In dtplyr with different grouping that includes only one row per group (works)
loans %>%
  lazy_dt() %>%
  group_by(loan_id, year) %>%
  summarise(
    status = case_when(any(repayments > 0) ~ "Made repayments",
                       TRUE ~ "Did not make any repayments")
  ) %>%
  ungroup() %>%
  as_tibble()
#> # A tibble: 6 x 3
#>   loan_id  year status                     
#>   <chr>   <dbl> <chr>                      
#> 1 A        2020 Did not make any repayments
#> 2 A        2021 Did not make any repayments
#> 3 B        2020 Did not make any repayments
#> 4 B        2021 Made repayments            
#> 5 C        2020 Made repayments            
#> 6 C        2021 Made repayments

^{Created on 2022-02-11 by the reprex package (v2.0.1)}

eutwt · 2022-02-11T14:06:17Z

@KesterJ Since dtplyr handles TRUE as a special case I've opened a separate issue for this.

eutwt · 2022-02-11T15:23:03Z

Actually I should not have closed this issue. There is still an inconsistency with dplyr, aside from the special case of an explicit T/TRUE (length-1 vectors are not repeated)

library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

df <- tibble(x = 1:2)

df %>%
  mutate(y = case_when(x > 1 ~ 'a', any(x > 1) ~ 'b'))
#> # A tibble: 2 × 2
#>       x y    
#>   <int> <chr>
#> 1     1 b    
#> 2     2 a

lazy_dt(df) %>%
  mutate(y = case_when(x > 1 ~ 'a', any(x > 1) ~ 'b'))
#> Error in fcase(x > 1, "a", any(x > 1), "b"): Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

^{Created on 2022-02-11 by the reprex package (v2.0.1)}

markfairbanks · 2022-06-21T14:45:12Z

@eutwt should this be closed? I don't see an easy workaround for this - it looks more like a data.table issue than a dtplyr one.

eutwt · 2022-06-21T15:41:20Z

I agree

eutwt mentioned this issue Sep 11, 2021

Give warning for early atomic conditions in case_when tidyverse/dplyr#6012

Closed

eutwt mentioned this issue Feb 11, 2022

Using TRUE in case_when causes error if output vector length does not equal group size .N #339

Open

eutwt closed this as completed Feb 11, 2022

eutwt reopened this Feb 11, 2022

markfairbanks added the bug an unexpected problem or unintended behavior label Jun 21, 2022

eutwt closed this as completed Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case When/Fcase Error #300

Case When/Fcase Error #300

kmishra9 commented Sep 10, 2021

hadley commented Sep 10, 2021

eutwt commented Sep 10, 2021 •

edited

Loading

kmishra9 commented Sep 10, 2021 •

edited

Loading

eutwt commented Sep 10, 2021

kmishra9 commented Sep 11, 2021 •

edited

Loading

KesterJ commented Feb 11, 2022

eutwt commented Feb 11, 2022

eutwt commented Feb 11, 2022

markfairbanks commented Jun 21, 2022

eutwt commented Jun 21, 2022

Case When/Fcase Error #300

Case When/Fcase Error #300

Comments

kmishra9 commented Sep 10, 2021

hadley commented Sep 10, 2021

eutwt commented Sep 10, 2021 • edited Loading

kmishra9 commented Sep 10, 2021 • edited Loading

eutwt commented Sep 10, 2021

kmishra9 commented Sep 11, 2021 • edited Loading

KesterJ commented Feb 11, 2022

eutwt commented Feb 11, 2022

eutwt commented Feb 11, 2022

markfairbanks commented Jun 21, 2022

eutwt commented Jun 21, 2022

eutwt commented Sep 10, 2021 •

edited

Loading

kmishra9 commented Sep 10, 2021 •

edited

Loading

kmishra9 commented Sep 11, 2021 •

edited

Loading