Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handling the "variable" column after melting unbalanced wide data #2575

Closed
mrdwab opened this issue Jan 18, 2018 · 8 comments · Fixed by #4731
Closed

Properly handling the "variable" column after melting unbalanced wide data #2575

mrdwab opened this issue Jan 18, 2018 · 8 comments · Fixed by #4731
Labels
reshape dcast melt
Milestone

Comments

@mrdwab
Copy link

mrdwab commented Jan 18, 2018

In reference to my comments at #2551, using patterns with melt on unbalanced wide or panel data results in an incorrect molten dataset.

Here's a minimal example, where "a" values are missing "beta" times, and "b" values are missing "alpha" times:

library(data.table)
set.seed(2334)
DT <- data.table(
  a_alpha = rnorm(3), a_gamma = rnorm(3),
  b_beta = rnorm(3), b_gamma = rnorm(3),
  id = c(1:3)
)

melt just sees that there are two sets of variables:

melt(DT, measure.vars = patterns("a_", "b_"))
#    id variable     value1       value2
# 1:  1        1 -0.1183107  1.682018318
# 2:  2        1  1.2370906 -0.573611132
# 3:  3        1  0.8088209 -0.057320032
# 4:  1        2 -0.7656264 -0.706428227
# 5:  2        2 -0.5919939  0.001899857
# 6:  3        2  0.5279071  1.063851211

Expected behaviour:

melt(DT, "id")[, c("stub", "var") := tstrsplit(variable, "_")][, dcast(.SD, id + var ~ stub)]
#    id   var          a            b
# 1:  1 alpha -0.1183107           NA
# 2:  1  beta         NA  1.682018318
# 3:  1 gamma -0.7656264 -0.706428227
# 4:  2 alpha  1.2370906           NA
# 5:  2  beta         NA -0.573611132
# 6:  2 gamma -0.5919939  0.001899857
# 7:  3 alpha  0.8088209           NA
# 8:  3  beta         NA -0.057320032
# 9:  3 gamma  0.5279071  1.063851211
@mrdwab
Copy link
Author

mrdwab commented Mar 3, 2018

@jangorecki jangorecki added the reshape dcast melt label Apr 5, 2020
@jangorecki jangorecki changed the title Properly handling the "variable" column after melting unbalanced wide data [feature request] Properly handling the "variable" column after melting unbalanced wide data Apr 5, 2020
@tdhock
Copy link
Member

tdhock commented Sep 29, 2020

this is the same issue as #4027 and is solved by #4720, not merged into master yet but you can use it via:

remotes::install_github(c("Rdatatable/data.table@fix4027", "tdhock/nc@multiple-fill"))
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
#> Skipping install of 'nc' from a github remote, the SHA1 (11b61f8e) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
set.seed(2334)
DT <- data.table(
  a_alpha = rnorm(3), a_gamma = rnorm(3),
  b_beta = rnorm(3), b_gamma = rnorm(3),
  id = c(1:3)
)
(DT.tall <- nc::capture_melt_multiple(DT, column="[ab]", "_", var=".*", fill=TRUE))
#>    id   var          a            b
#> 1:  1 alpha -0.1183107           NA
#> 2:  2 alpha  1.2370906           NA
#> 3:  3 alpha  0.8088209           NA
#> 4:  1  beta         NA  1.682018318
#> 5:  2  beta         NA -0.573611132
#> 6:  3  beta         NA -0.057320032
#> 7:  1 gamma -0.7656264 -0.706428227
#> 8:  2 gamma -0.5919939  0.001899857
#> 9:  3 gamma  0.5279071  1.063851211
DT.tall[order(id, var)]
#>    id   var          a            b
#> 1:  1 alpha -0.1183107           NA
#> 2:  1  beta         NA  1.682018318
#> 3:  1 gamma -0.7656264 -0.706428227
#> 4:  2 alpha  1.2370906           NA
#> 5:  2  beta         NA -0.573611132
#> 6:  2 gamma -0.5919939  0.001899857
#> 7:  3 alpha  0.8088209           NA
#> 8:  3  beta         NA -0.057320032
#> 9:  3 gamma  0.5279071  1.063851211

@tdhock tdhock closed this as completed Sep 29, 2020
@jangorecki
Copy link
Member

Let's wait for PR to merge, tests added and news item provided.

@jangorecki jangorecki reopened this Sep 29, 2020
@tdhock
Copy link
Member

tdhock commented Oct 1, 2020

hi again if you want to use the new data.table::melt without nc you could do

remotes::install_github("Rdatatable/data.table@fix4027")
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
set.seed(2334)
DT <- data.table(
  a_alpha = rnorm(3), a_gamma = rnorm(3),
  b_beta = rnorm(3), b_gamma = rnorm(3),
  id = c(1:3)
)
(DT.tall <- melt(DT, measure=list(
  a=c("a_alpha", NA, "a_gamma"),
  b=c(NA, "b_beta", "b_gamma"))))
#>    id variable          a            b
#> 1:  1        1 -0.1183107           NA
#> 2:  2        1  1.2370906           NA
#> 3:  3        1  0.8088209           NA
#> 4:  1        2         NA  1.682018318
#> 5:  2        2         NA -0.573611132
#> 6:  3        2         NA -0.057320032
#> 7:  1        3 -0.7656264 -0.706428227
#> 8:  2        3 -0.5919939  0.001899857
#> 9:  3        3  0.5279071  1.063851211
DT.tall[, var := c("alpha", "beta", "gamma")[variable] ]
DT.tall[order(id, var), .(id, var, a, b)]
#>    id   var          a            b
#> 1:  1 alpha -0.1183107           NA
#> 2:  1  beta         NA  1.682018318
#> 3:  1 gamma -0.7656264 -0.706428227
#> 4:  2 alpha  1.2370906           NA
#> 5:  2  beta         NA -0.573611132
#> 6:  2 gamma -0.5919939  0.001899857
#> 7:  3 alpha  0.8088209           NA
#> 8:  3  beta         NA -0.057320032
#> 9:  3 gamma  0.5279071  1.063851211

@tdhock
Copy link
Member

tdhock commented Oct 7, 2020

pure data.table solution using #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
set.seed(2334)
DT <- data.table(
  a_alpha = rnorm(3), a_gamma = rnorm(3),
  b_beta = rnorm(3), b_gamma = rnorm(3),
  id = c(1:3)
)
melt(DT, measure.vars=measure(value.name, var))
#>    id   var          a            b
#> 1:  1 alpha -0.1183107           NA
#> 2:  2 alpha  1.2370906           NA
#> 3:  3 alpha  0.8088209           NA
#> 4:  1 gamma -0.7656264 -0.706428227
#> 5:  2 gamma -0.5919939  0.001899857
#> 6:  3 gamma  0.5279071  1.063851211
#> 7:  1  beta         NA  1.682018318
#> 8:  2  beta         NA -0.573611132
#> 9:  3  beta         NA -0.057320032

@PavoDive
Copy link

Thank you all very much for measure. I find an unexpected behavior. With data from this StackOverflow question and data.table 1.14.1 I found this:

df1 <- structure(list(Material_code = 111:112, 
                      actual_202009 = c(30L, 19L), 
                      actual_202010 = c(44L, 70L), 
                      actual_202011 = c(24L, 93L), 
                      pred_202009 = c(25L, 23L), 
                      pred_202010 = c(52L, 68L), 
                      pred_202011 = c(27L, 100L)), 
                 class = c("data.table", "data.frame"), 
                 row.names = c(NA, -2L))

The following code produces almost the expected result:

> melt(df1, id.vars = 1, measure.vars= measure(value.name, date, sep = "_"))
   Material_code   date Material actual pred
1:           111   code      111     NA   NA
2:           112   code      112     NA   NA
3:           111 202009       NA     30   25
4:           112 202009       NA     19   23
5:           111 202010       NA     44   52
6:           112 202010       NA     70   68
7:           111 202011       NA     24   27
8:           112 202011       NA     93  100

with the weird rows with "code" in the date column. I expected melt to leave column 1 (id.vars = 1) untouched by measure, so that's why I say the result is not what I expected. However, if I change the name of the first column, then it produces the output I expected:

df2 <- copy(df1)[, setnames(.SD, 1, "Material.code")]
> melt(df2, id.vars = 1, measure.vars= measure(value.name, date, sep = "_"))
   Material.Code   date actual pred
1:           111 202009     30   25
2:           112 202009     19   23
3:           111 202010     44   52
4:           112 202010     70   68
5:           111 202011     24   27
6:           112 202011     93  100

@avimallu
Copy link
Contributor

Can confirm. You might need to either reopen this, or open a separate issue.

@PavoDive
Copy link

PavoDive commented May 12, 2021

I opened issue 4991, as it most likely is an issue of measure rather than the continuation of the issue described in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reshape dcast melt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants