Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtplyr changes result of summarize of empty data.table #282

Open
lutzgruber-quantco opened this issue Aug 13, 2021 · 2 comments
Open

dtplyr changes result of summarize of empty data.table #282

lutzgruber-quantco opened this issue Aug 13, 2021 · 2 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@lutzgruber-quantco
Copy link

Hi,

When summarizing a data.table that has zero rows without dtplyr, I receive a table with one row. When dtplyr is loaded, the same code results in a table with zero rows:

> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




> data.table::data.table(x = -1) %>% filter(x > 0) %>% summarize(nrow = n(), xmean = mean(x)) %>% data.table::as.data.table()
    nrow xmean
   <int> <num>
1:     0   NaN



> library(dtplyr)
> data.table::data.table(x = -1) %>% filter(x > 0) %>% summarize(nrow = n(), xmean = mean(x)) %>% data.table::as.data.table()
Empty data.table (0 rows and 2 cols): nrow,xmean
@hadley
Copy link
Member

hadley commented Aug 30, 2021

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@hadley hadley added the reprex needs a minimal reproducible example label Aug 30, 2021
@eutwt
Copy link
Collaborator

eutwt commented Aug 30, 2021

Here's a reprex:

library(data.table)
library(dplyr, warn.conflicts = FALSE)
dt <- data.table(x = -1) 

dt %>% 
  filter(x > 0) %>% 
  summarize(nrow = n(), xmean = mean(x)) %>% 
  as.data.table()
#>    nrow xmean
#> 1:    0   NaN

library(dtplyr)
dtp_result <- 
  dt %>% 
    filter(x > 0) %>% 
    summarize(nrow = n(), xmean = mean(x)) 
  
as.data.table(dtp_result)
#> Empty data.table (0 rows and 2 cols): nrow,xmean
show_query(dtp_result)
#> `_DT1`[x > 0, .(nrow = .N, xmean = mean(x))]

Created on 2021-08-30 by the reprex package (v2.0.1)

If you compute() before summarise the output is the same as without dtplyr

library(data.table)
library(dplyr, warn.conflicts = FALSE)
dt <- data.table(x = -1) 

dt %>% 
  filter(x > 0) %>% 
  compute() %>% 
  summarize(nrow = n(), xmean = mean(x)) %>% 
  as.data.table()
#>    nrow xmean
#> 1:    0   NaN

Created on 2021-08-31 by the reprex package (v2.0.1)

That's because if this is done as two separate [.data.table calls the output has one row rather than 0 (see below). To match dplyr's output for this situation (without a user explicitly computing results), I can think of three options

  • always filters in a separate step
  • internally evaluate filters and count the number of rows to decide how to construct the next step, breaking the lazy evaluation.
  • specifically check if one of the summarise dots is n() and treat that as a special case

The first two seem pretty bad. I'm not sure if the third option is that useful since it doesn't take care of other cases where dplyr returns 1 row e.g. just summarize(xmean = mean(x)) in the above example.

library(data.table)
library(dplyr, warn.conflicts = FALSE)
dt <- data.table(x = -1) 

dt[x > 0, .(nrow = .N, xmean = mean(x))]
#> Empty data.table (0 rows and 2 cols): nrow,xmean

dt[x > 0][, .(nrow = .N, xmean = mean(x))]
#>    nrow xmean
#> 1:    0   NaN

## Same behavior in general

dt[x > 0, .(y = 1)]
#> Empty data.table (0 rows and 1 cols): y
dt[x > 0][, .(y = 1)]
#>    y
#> 1: 1

Created on 2021-08-30 by the reprex package (v2.0.1)

@hadley hadley removed the reprex needs a minimal reproducible example label Aug 31, 2021
@markfairbanks markfairbanks added the bug an unexpected problem or unintended behavior label Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants