Skip to content

sum_to_zero_vector case study #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
354 changes: 354 additions & 0 deletions jupyter/radon/LICENSE

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions jupyter/radon/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Notebook based on Gelman and Hill 2007, Radon case study.
Introduces hierarchical models and Python packages `cmdstanpy`, and `plotnine`.

Author: Mitzi Morris


354 changes: 354 additions & 0 deletions jupyter/sum-to-zero/LICENSE

Large diffs are not rendered by default.

18 changes: 18 additions & 0 deletions jupyter/sum-to-zero/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Case study and jupyter notebook which demonstrate the correctness and efficiency of the
`sum_to_zero_vector` constrained parameter type, introduced in Stan 2.36.


- Case study `sum_to_zero_evaluation.qmd` demonstrates use of the `sum_to_zero_vector` for 2 models.

- Jupyter notebook `sum_to_zero_evalutation.ipynb` is a step-by-step explanation of the operations
used to carry out this evaluation.

Included in the GitHub repository for this notebook are several python files of helper functions.

* eval_efficiencies.py - run models repeatedly, get average performance stats.
* utils.py - simulate data for binomial model.
* utils\_html.py - format Stan summaries for this notebook.
* utils\_bym2.py - compute data inputs to the BYM2 model.
* utils\_nyc\_map.py - munge the New York City census tract map.

Author: Mitzi Morris
1 change: 1 addition & 0 deletions jupyter/sum-to-zero/binomial_runtimes_large.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"runtime, ave":{"ozs large":2.41,"hard large":3.25,"soft large":17.41},"runtime, std dev":{"ozs large":0.08,"hard large":0.09,"soft large":0.5},"ESS_bulk\/s":{"ozs large":1305.86,"hard large":981.53,"soft large":181.15}}
1 change: 1 addition & 0 deletions jupyter/sum-to-zero/binomial_runtimes_small.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"runtime, ave":{"ozs small":2.21,"hard small":3.07,"soft small":54.49},"runtime, std dev":{"ozs small":0.05,"hard small":0.08,"soft small":2.54},"ESS_bulk\/s":{"ozs small":1327.11,"hard small":988.47,"soft small":63.14}}
2,102 changes: 2,102 additions & 0 deletions jupyter/sum-to-zero/data/nyc_study.geojson

Large diffs are not rendered by default.

93 changes: 93 additions & 0 deletions jupyter/sum-to-zero/eval_efficiencies.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Compare ESS/sec using binomial model
# and simulated datasets based on smaller, larger number of observations.
import os
import numpy as np
import pandas as pd
from typing import Any, Dict, List, Tuple
from cmdstanpy import CmdStanModel, set_cmdstan_path


import logging
cmdstanpy_logger = logging.getLogger("cmdstanpy")
cmdstanpy_logger.setLevel(logging.FATAL)

import warnings
warnings.filterwarnings('ignore')

set_cmdstan_path(os.path.join('/Users', 'mitzi', 'github', 'stan-dev', 'cmdstan'))

from utils import simulate_data

# Fit model and dataset for N iterations.
# For each run, save wall clock time and effective samples / second (N_Eff/sec)
# Return np.ndarray of size (N,2) with timing information.
def time_fits(N: int, model: CmdStanModel, data: dict) -> np.ndarray:
fit_times = np.ndarray(shape=(N, 2), dtype=float)
for i in range(N):
print('Run', i)
fit = model.sample(data=data, parallel_chains=4,
show_progress=False, show_console=False, refresh=10_000)
fit_summary = fit.summary()
total_time = 0
times = fit.time
for j in range(len(times)):
total_time += times[j]['total']

fit_times[i, 0] = total_time
fit_times[i, 1] = fit_summary.loc['lp__', 'ESS_bulk/s']
return fit_times


# Given a list of label, time pairs, populate dataframe
# of means of time and std dev wall clock time, and N_Eff/sec
def summarize_times(data_pairs: List[Tuple[str, np.ndarray]]) -> pd.DataFrame:
result_data = []
for label, array in data_pairs:
result_data.append({
'label': label,
'mean': np.mean(array, axis=0)[0],
'std dev': np.std(array, axis=0)[0],
'ESS_bulk/s': np.mean(array, axis=0)[1]
})
df = pd.DataFrame(result_data)
return df.set_index('label').round(2)


# Create datasets - fix sizes, and seed
N_eth = 3
N_edu = 5
N_age = 9
baseline = -3.5
sens = 0.75
spec = 0.9995
data_small = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 17, seed=45678)
data_large = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 200, seed=45678)

# sum to zero vector

binomial_ozs_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_ozs.stan'))
times_ozs_large = time_fits(100, binomial_ozs_mod, data_large)
times_ozs_small = time_fits(100, binomial_ozs_mod, data_small)

# hard sum-to-zero constraint

binomial_hard_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_hard.stan'))
times_hard_small = time_fits(100, binomial_hard_mod, data_small)
times_hard_large = time_fits(100, binomial_hard_mod, data_large)

# soft sum-to-zero constraint

binomial_soft_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_soft.stan'))
times_soft_small = time_fits(100, binomial_soft_mod, data_small)
times_soft_large = time_fits(100, binomial_soft_mod, data_large)


df_small = summarize_times([('ozs small', times_ozs_small),
('hard small', times_hard_small),
('soft small', times_soft_small)])
df_small.to_json("binomial_runtimes_small.json")

df_large = summarize_times([('ozs large', times_ozs_large),
('hard large', times_hard_large),
('soft large', times_soft_large)])
df_large.to_json("binomial_runtimes_large.json")
53 changes: 53 additions & 0 deletions jupyter/sum-to-zero/stan/binomial_4preds_hard.stan
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
// multi-level model for binomial data with 4 categorical predictors.
data {
int<lower=1> N; // number of strata
int<lower=1> N_age;
int<lower=1> N_eth;
int<lower=1> N_edu;

array[N] int<lower=0> pos_tests;
array[N] int<lower=0> tests;
array[N] int<lower=1, upper=2> sex;
array[N] int<lower=1, upper=N_age> age;
array[N] int<lower=1, upper=N_eth> eth;
array[N] int<lower=1, upper=N_edu> edu;

// hyperparameters
real<lower=0, upper=1> sens;
real<lower=0, upper=1> spec;
}
parameters {
real beta_0;
real beta_sex_raw;
real<lower=0> sigma_age, sigma_eth, sigma_edu;
vector[N_age - 1] beta_age_raw;
vector[N_eth - 1] beta_eth_raw;
vector[N_edu - 1] beta_edu_raw;
}
transformed parameters {
vector[2] beta_sex = [beta_sex_raw, -beta_sex_raw]';

vector[N_age] beta_age = append_row(beta_age_raw, -sum(beta_age_raw));
vector[N_eth] beta_eth = append_row(beta_eth_raw, -sum(beta_eth_raw));
vector[N_edu] beta_edu = append_row(beta_edu_raw, -sum(beta_edu_raw));

vector[N] eta = inv_logit(beta_0 + beta_sex[sex] + beta_age[age] + beta_eth[eth] + beta_edu[edu]);
vector[N] prob_pos_test = eta * sens + (1 - eta) * (1 - spec);
}
model {
pos_tests ~ binomial(tests, prob_pos_test); // likelihood

// priors
beta_0 ~ normal(0, 2.5);
beta_sex ~ std_normal();
// centered parameterization
beta_age_raw ~ normal(0, sigma_age);
beta_eth_raw ~ normal(0, sigma_eth);
beta_edu_raw ~ normal(0, sigma_edu);
sigma_eth ~ std_normal();
sigma_age ~ std_normal();
sigma_edu ~ std_normal();
}
generated quantities {
array[N] int<lower=0>y_rep = binomial_rng(tests, prob_pos_test);
}
38 changes: 38 additions & 0 deletions jupyter/sum-to-zero/stan/binomial_4preds_hard_ppc.stan
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
// multi-level model for binomial data with 4 categorical predictors.
data {
int<lower=1> N; // number of strata
int<lower=1> N_age;
int<lower=1> N_eth;
int<lower=1> N_edu;

// hyperparameters
real<lower=0, upper=1> sens;
real<lower=0, upper=1> spec;
}
parameters {
real beta_0;
real beta_sex_raw;
real<lower=0> sigma_age, sigma_eth, sigma_edu;
vector[N_age - 1] beta_age_raw;
vector[N_eth - 1] beta_eth_raw;
vector[N_edu - 1] beta_edu_raw;
}
transformed parameters {
vector[2] beta_sex = [beta_sex_raw, -beta_sex_raw]';

vector[N_age] beta_age = append_row(beta_age_raw, -sum(beta_age_raw));
vector[N_eth] beta_eth = append_row(beta_eth_raw, -sum(beta_eth_raw));
vector[N_edu] beta_edu = append_row(beta_edu_raw, -sum(beta_edu_raw));
}
model {
// priors
beta_0 ~ normal(0, 2.5);
beta_sex ~ std_normal();
// centered parameterization
beta_age_raw ~ normal(0, sigma_age);
beta_eth_raw ~ normal(0, sigma_eth);
beta_edu_raw ~ normal(0, sigma_edu);
sigma_eth ~ std_normal();
sigma_age ~ std_normal();
sigma_edu ~ std_normal();
}
62 changes: 62 additions & 0 deletions jupyter/sum-to-zero/stan/binomial_4preds_ozs.stan
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
// multi-level model for binomial data with 4 categorical predictors.
data {
int<lower=1> N; // number of strata
int<lower=1> N_age;
int<lower=1> N_eth;
int<lower=1> N_edu;

array[N] int<lower=0> pos_tests;
array[N] int<lower=0> tests;
array[N] int<lower=1, upper=2> sex;
array[N] int<lower=1, upper=N_age> age;
array[N] int<lower=1, upper=N_eth> eth;
array[N] int<lower=1, upper=N_edu> edu;

// hyperparameters
real<lower=0, upper=1> sens;
real<lower=0, upper=1> spec;
}
transformed data {
real mean_sex = mean(sex);
vector[N] sex_c = to_vector(sex) - mean_sex;
// scaling factors for marginal variances of sum_to_zero_vectors
// https://discourse.mc-stan.org/t/zero-sum-vector-and-normal-distribution/38296
real s_age = sqrt(N_age * inv(N_age - 1));
real s_eth = sqrt(N_eth * inv(N_eth - 1));
real s_edu = sqrt(N_edu * inv(N_edu - 1));
}
parameters {
real beta_0;
real beta_sex;
real<lower=0> sigma_age, sigma_eth, sigma_edu;
sum_to_zero_vector[N_age] beta_age;
sum_to_zero_vector[N_eth] beta_eth;
sum_to_zero_vector[N_edu] beta_edu;
}
transformed parameters {
// true prevalence
vector[N] p = inv_logit(beta_0 + beta_sex * sex_c + beta_age[age]
+ beta_eth[eth] + beta_edu[edu]);
// incorporate test sensitivity and specificity.
vector[N] p_sample = p * sens + (1 - p) * (1 - spec);
}
model {
pos_tests ~ binomial(tests, p_sample); // likelihood

// priors
beta_0 ~ normal(0, 2.5);
beta_sex ~ std_normal();
sigma_eth ~ std_normal();
sigma_age ~ std_normal();
sigma_edu ~ std_normal();

// centered parameterization
// scale normal priors on sum_to_zero_vectors
beta_age ~ normal(0, s_age * sigma_age);
beta_eth ~ normal(0, s_eth * sigma_eth);
beta_edu ~ normal(0, s_edu * sigma_edu);
}
generated quantities {
real beta_intercept = beta_0 - mean_sex * beta_sex;
array[N] int<lower=0>y_rep = binomial_rng(tests, p_sample);
}
38 changes: 38 additions & 0 deletions jupyter/sum-to-zero/stan/binomial_4preds_ozs_ppc.stan
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
// generate sample from model priors, (before seeing any data)
data {
int<lower=1> N; // number of strata
int<lower=1> N_age;
int<lower=1> N_eth;
int<lower=1> N_edu;
// omit observational data
}
transformed data {
// scaling factors for marginal variances of sum_to_zero_vectors
// https://discourse.mc-stan.org/t/zero-sum-vector-and-normal-distribution/38296
real s_age = sqrt(N_age * inv(N_age - 1));
real s_eth = sqrt(N_eth * inv(N_eth - 1));
real s_edu = sqrt(N_edu * inv(N_edu - 1));
}
parameters {
real beta_0;
real beta_sex;
real<lower=0> sigma_age, sigma_eth, sigma_edu;
sum_to_zero_vector[N_age] beta_age;
sum_to_zero_vector[N_eth] beta_eth;
sum_to_zero_vector[N_edu] beta_edu;
}
model {
// omit likelihood
// priors
beta_0 ~ normal(0, 2.5);
beta_sex ~ std_normal();
sigma_eth ~ std_normal();
sigma_age ~ std_normal();
sigma_edu ~ std_normal();

// centered parameterization
// scale normal priors on sum_to_zero_vectors
beta_age ~ normal(0, s_age * sigma_age);
beta_eth ~ normal(0, s_eth * sigma_eth);
beta_edu ~ normal(0, s_edu * sigma_edu);
}
58 changes: 58 additions & 0 deletions jupyter/sum-to-zero/stan/binomial_4preds_soft.stan
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
// multi-level model for binomial data with 4 categorical predictors.
data {
int<lower=1> N; // number of strata
int<lower=1> N_age;
int<lower=1> N_eth;
int<lower=1> N_edu;

array[N] int<lower=0> pos_tests;
array[N] int<lower=0> tests;
array[N] int<lower=1, upper=2> sex;
array[N] int<lower=1, upper=N_age> age;
array[N] int<lower=1, upper=N_eth> eth;
array[N] int<lower=1, upper=N_edu> edu;

// hyperparameters
real<lower=0, upper=1> sens;
real<lower=0, upper=1> spec;
}
transformed data {
real mean_sex = mean(sex);
vector[N] sex_c = to_vector(sex) - mean_sex;
}
parameters {
real beta_0;
real beta_sex;
real<lower=0> sigma_age, sigma_eth, sigma_edu;
vector[N_age] beta_age;
vector[N_eth] beta_eth;
vector[N_edu] beta_edu;
}
transformed parameters {
// non-standard link function
vector[N] p = inv_logit(beta_0 + beta_sex * sex_c + beta_age[age]
+ beta_eth[eth] + beta_edu[edu]);
vector[N] p_sample = p * sens + (1 - p) * (1 - spec);
}
model {
pos_tests ~ binomial(tests, p_sample); // likelihood

// priors
beta_0 ~ normal(0, 2.5);
beta_sex ~ std_normal();
// centered parameterization
beta_age ~ normal(0, sigma_age);
beta_eth ~ normal(0, sigma_eth);
beta_edu ~ normal(0, sigma_edu);
sigma_eth ~ std_normal();
sigma_age ~ std_normal();
sigma_edu ~ std_normal();
// soft sum-to-zero constraint
sum(beta_age) ~ normal(0, 0.001 * N_age);
sum(beta_eth) ~ normal(0, 0.001 * N_eth);
sum(beta_edu) ~ normal(0, 0.001 * N_edu);
}
generated quantities {
real beta_intercept = beta_0 - mean_sex * beta_sex;
array[N] int<lower=0>y_rep = binomial_rng(tests, p_sample);
}
Loading