-
Notifications
You must be signed in to change notification settings - Fork 16
/
multivarmissing.Rmd
159 lines (137 loc) · 4.32 KB
/
multivarmissing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Multivariate Missing Data {#multivarmissing}
$$
\DeclareMathOperator{diag}{diag}
$$
```{r multivarmissing_setup,message=FALSE}
library("tidyverse")
library("rstan")
```
This example shows how to impute missing data. See @Stan2016a, Chapter 10 "Missing Data & Partially Known Parameters" for more discussion.[^multivarmissing-src]
Consider a data set of 10 observations on 3 variables
Only one of the variables, $z$, is completely observed.
The other two variables, x$ and $y$, have a non-overlapping pattern of missing data.
```{r multivarmissing}
multivarmissing <- tribble(
~x, ~y, ~z,
1, NA, NA,
2, NA, 4,
3, NA, 3,
4, NA, 5,
5, NA, 7,
NA, 7, 9,
NA, 8, 8,
NA, 9, 11,
NA, 8, 10,
NA, 9, 8)
```
The missing elements of $x$ and $y$ are parameters, and the observed elements of $x$, $y$, and $z$ are data.
These are combined in the `transformed parameters` block, and modeled.
## Separate Regressions
We use $z$ to predict $x$,
and $z$ and $x$ (both observed and imputed) to impute $y$.
$$
\begin{aligned}[t]
x_i &\sim \mathsf{Normal}(\mu_{x,i}, \sigma_x) \\
\mu_{x,i} &= \gamma_1 + \gamma_2 z_i \\
y_i &\sim \mathsf{Normal}(\mu_{y,i}, \sigma_y) \\
\mu_{y,i} &= \beta_1 + \beta_2 y_i + \beta_3 z_i \\
z_i &\sim \mathsf{Normal}(\mu_z, \sigma_z)
\end{aligned}
$$
The parameters are given weakly informative parameters:
$$
\begin{aligned}[t]
\sigma_x,\sigma_y,\sigma_z &\sim \mathsf{HalfCauchy}(0, 5) \\
\gamma_1, \beta_1 &\sim \mathsf{Normal}(0, 10) \\
\gamma_2, \beta_2, \beta_3 &\sim \mathsf{Normal}(0, 2.5)
\end{aligned}
$$
Note that this assumes that $x$, $y$, and $z$ are standardized to have zero mean and unit variance.
```{r data_multivarmissing}
data_multivarmissing <- within(list(), {
N <- nrow(multivarmissing)
x_obs <- multivarmissing$x[!is.na(multivarmissing$x)] %>%
scale() %>% as.numeric()
x_obs_idx <- array(which(!is.na(multivarmissing$x)))
N_x_obs <- length(x_obs_idx)
x_miss_idx <- array(which(is.na(multivarmissing$x)))
N_x_miss <- length(x_miss_idx)
y_obs <- multivarmissing$y[!is.na(multivarmissing$y)] %>%
scale() %>% as.numeric()
y_obs_idx <- array(which(!is.na(multivarmissing$y)))
N_y_obs <- length(y_obs_idx)
y_miss_idx <- array(which(is.na(multivarmissing$y)))
N_y_miss <- length(y_miss_idx)
z_obs <- multivarmissing$z[!is.na(multivarmissing$z)] %>%
scale() %>% as.numeric()
z_obs_idx <- array(which(!is.na(multivarmissing$z)))
N_z_obs <- length(z_obs_idx)
z_miss_idx <- array(which(is.na(multivarmissing$z)))
N_z_miss <- length(z_miss_idx)
alpha_loc <- 0
alpha_scale <- 10
beta_loc <- rep(0, 3)
beta_scale <- c(10, 2.5, 2.5)
gamma_loc <- rep(0, 2)
gamma_scale <- c(10, 2.5)
sigma_x_scale <- 5
sigma_y_scale <- 5
sigma_z_scale <- 5
})
```
```{r mod_multivarmissing,cache.extra=tools::md5sum("stan/multivarmissing.stan"),message=FALSE,warning=FALSE}
mod_multivarmissing <- stan_model("stan/multivarmissing2.stan")
```
```{r}
mod_multivarmissing
```
```{r fit_multivarmissing,results='hide'}
fit_multivarmissing <-
sampling(mod_multivarmissing, data = data_multivarmissing)
```
```{r}
fit_multivarmissing
```
## Multivariate Normal
Alternatively, $x$, $y$, and $z$ could be modeled as coming from a multivariate normal distribution.
$$
\begin{bmatrix}
x_i \\
y_i \\
z_i
\end{bmatrix} \sim
\mathsf{Normal}(\mu, \Sigma)
$$
where $\mu$ and $\Sigma$ are given weakly informative priors,
$$
\begin{aligned}[t]
\mu_{i,k} &\sim \mathsf{Normal}(0, 10) & k \in \{1, 2, 3\}, \\
\Sigma &= \diag{\sigma} R \diag{sigma}, \\
\sigma &\sim \mathsf{HalfCauchy}(0, 5), \\
R &\sim \mathsf{LkjCorr}(2) .
\end{aligned}
$$
```{r data_multivarmissing2}
data_multivarmissing2 <- within(list(), {
N <- nrow(multivarmissing)
K <- ncol(multivarmissing)
mu_loc <- rep(0, 3)
mu_scale <- rep(0, 10)
Sigma_scale_scale <- 5
Sigma_corr_L_eta <- 2
})
```
```{r mod_multivarmissing2,cache.extra=tools::md5sum("stan/multivarmissing2.stan")}
mod_multivarmissing2 <- stan_model("stan/multivarmissing2.stan")
```
```{r}
mod_multivarmissing2
```
```{r fit_multivarmissing2,results='hide'}
fit_multivarmissing <-
sampling(mod_multivarmissing2, data = data_multivarmissing2)
```
```{r}
fit_multivarmissing
```
[^multivarmissing-src]: This example is derived from Simon Jackman, "[Multivariate Missing Data](https://web-beta.archive.org/web/20020618183148/http://jackman.stanford.edu:80/mcmc/multivarmissing.odc)", 2002-06-18.